CO7093 - Big Data & Predictive Analytics CW Assignment Brief | University of Leicester

Looking for Plagiarism-Free Answers for Your US, UK, Singapore, New Zealand, and Ireland College/University Assignments?

Talk to an Expert

Published: 08 Mar, 2025

Category	Assignment	Subject	Computer Science
University	University of Leicester	Module Title	CO7093 - Big Data & Predictive Analytics

Overview

The primary goal of this assignment is to familiarize you with key stages of a data science project, including formulating and answering questions about data, visualizing insights, and building predictive models capable of making new predictions. You will apply the knowledge gained throughout the module. The dataset is provided in the CSV format.

Learning Objectives

After this homework, students will be able to:

Work with basic Python data structures such as dict, tuple, list etc.
Use Pandas as the primary tool to process structured data in Python with CSV files. Handle extreme cases appropriately. Use appropriate methods to address missing data.
Use PyPlot to make simple plots to investigate a specific phenomenon. Read plotting library documentation and use example plotting code to figure out how to create more complex Seaborn plots.
Train a machine learning model and use it to make a prediction about the future using the scikit-learn library.
Use PySpark to explore efficient approaches to handling big data.

Problem Statement

In this coursework, you will create a classification model that, given a Covid-19 patient's current symptom, status, and medical history, will predict whether the patient is in high risk or not. This dataset contains an enormous number of anonymized patient-related information including pre-conditions. The raw dataset consists of 21 unique features and 1,048,576 unique patients. We have applied some simple data cleansing techniques to reduce data to 200031 unique patients and 21 unique features.

Objective

Using the given dataset, the goal is to determine if the patient is at high risk and will be admitted to ICU or not. You will use appropriate performance metrics to evaluate the performance of your model. The data is not clean, and you will have to apply appropriate methods to clean the data. Additionally, using unsupervised clustering, you will have to implement cluster-based classification model that may improve the performance of the model. The (partially) processed dataset is available to download from the blackboard.

Tasks

Your first task is to prepare the data and carry out data cleansing, bearing in mind the question you would like to answer. For example, which factor is the most important factor in predicting the readmission of a patient.

Part 1: Building up a basic predictive model

Load the dataset patients.csv into pandas dataframe and carry out the following tasks.

Data Cleaning and Transformation

If you have a closer look at the dataset, you will see that there are lots of inconsistencies in the dataset. While there are no explicit ‘null’ values, some binary attributes contain entries such as ‘?’, which represent missing values. These need to be handled appropriately. For the
first task, adopt an aggressive approach to address these issues. Below is a list of steps you should consider. This list is not exhaustive, so feel free to explore additional techniques that demonstrate your understanding of exploratory data analysis (EDA).

Check dataset shape.
Remove irrelevant columns. Clearly justify any deletions in your report.
Identify and handle missing values. Some missing values are represented as strings like ‘?’. Some columns may contain values that fall outside their expected range. Identify all the missing values and convert them to NaN.
Summarize missing values before and after handling them.
Verify and adjust data types as needed for consistency.
Drop rows containing null values.
Analyse numerical features. Display summary statistics and identify potential outliers.
Remove outliers if necessary.
Normalize features where applicable.
Check the final dataset shape after preprocessing.

Data Visualisation

Consider the resulting Dataframe. This first aggressive cleaning should give a smaller dataset, which you can start by exploring relationships between the various features of the dataset.

Plot the distribution of unique classes of the target variable.
Plot the count of number of ICU cases against age.
Plot a graph that displays the count of target variable against ‘CLASIFFICATION_FINAL’.
Show the scatter matrix plot and the correlation matrices. Can you identify pairs of highlymcorrelated features.
Generate additional plots that demonstrate your understanding of the problem and the data. You are free to select the plot and features for visualisation. For better visualisation and understanding of data, consider using seaborn library.

Model Building

Consider the resulting Dataframe:

Select the predictors that would have impact in predicting ICU.
Build up a first linear model with appropriate predictors and evaluate it. Split the data into a training and test sets. Evaluate your model by using a cross-validation procedure.
Use different performance metrics to evaluate the performance of your model. You might have noticed that the data is imbalanced. The number of positive examples is less than 8% of the total dataset. Choose appropriate performance metrics to evaluate the performance of your model.
Balance your data using data balancing technique. Train your model again and evaluate its performance. Did you achieve better prediction accuracies with more balanced data?

Part 2: Improved model

This is an open-ended task, allowing you to apply your problem-solving skills to develop a high-performance model. Your goal is to explore various approaches and build an effective solution. For full credit, you must use PySpark to demonstrate your understanding of handling big data in a distributed environment.

Consider the entire datasets again. Develop an improved classification model that predicts the patient’s risk. You should aim for a model with a higher performance while using a maximum of data points. This implies treating missing values differently for example through imputation rather than dropping them. Validate your model and compare its performance with the performance of the model that you built previously.
Use the K-Means algorithm to cluster your cleansed dataset and compare the obtained clusters with the distribution found in the data. Justify your clustering and visualise your clusters as appropriate.
Build up local classifiers based on your clustering and discuss how this clusters-based classification compares to your model obtained in the first part of Improved model.
As in Part1, balance the data and train and test your model with the balanced data.

Request Answer of this Assignment

Share this with your Friends

Facebook

Instagram

If you are stressed about CO7093—Big Data & Predictive Analytics Assignment, then there is no need to worry now! With computer science assignment help services, you will get expert guidance and help on assignments that will make your concepts strong. We also provide you with free assignment example samples that will help you understand. And the best part? All the content is 100% original, written by PhD expert writers, and well-researched so that you get the best quality. So don't delay now; boost your grades with our help!

Latest Related Questions

Your Name

Word Count

Your Email

Mobile Number

Reference Style

Paper Style

Enter Subject

Education Level

Select Deadline

Turn it in Report
YES: NO:

CO7093 - Big Data & Predictive Analytics CW Assignment Brief | University of Leicester

Overview

Learning Objectives

Problem Statement

Objective

Tasks

Part 1: Building up a basic predictive model

Data Visualisation

Model Building

Part 2: Improved model

Share this with your Friends

Latest Related Questions

STAT40800 Data Programming with Python Assignment Brief 2026 | UCD Ireland

AS6-04-23 MCAST Research Methods in Nursing, RMN2 Individual Assignment

HSC7001 Contemporary Issues, Policies, and Practice Assessment Brief 2026 | UGM

MKTG101 Marketing Coursework Brief 2026 | Singapore Management University

BMP5018 Project Management Assessment 2 Brief 2026 | UoB

CTF3301 Fundamentals of Programming Assessment Brief 2026 | UoGM

BMP5016 Leadership, Management and Organisational Culture Assessment Brief 1, 2026 | UoGM

COM4302 Computer Science Fundamentals Assessment Brief 2026 | Regent College London

BAM4013 Financial Decision Making in Business Assignment 2026 | University of Bolton

CIH Level 4 Certificate in Housing (610/4695/4) Unit H428 Housing Management Assessment Brief 2026

Latest Free Samples for University Students

100% Original Work — Connect With Our Expert Writers Today