Category | Assignment | Subject | Computer Science |
---|---|---|---|
University | University of Leicester | Module Title | CO7093 - Big Data & Predictive Analytics |
The primary goal of this assignment is to familiarize you with key stages of a data science project, including formulating and answering questions about data, visualizing insights, and building predictive models capable of making new predictions. You will apply the knowledge gained throughout the module. The dataset is provided in the CSV format.
After this homework, students will be able to:
In this coursework, you will create a classification model that, given a Covid-19 patient's current symptom, status, and medical history, will predict whether the patient is in high risk or not. This dataset contains an enormous number of anonymized patient-related information including pre-conditions. The raw dataset consists of 21 unique features and 1,048,576 unique patients. We have applied some simple data cleansing techniques to reduce data to 200031 unique patients and 21 unique features.
Using the given dataset, the goal is to determine if the patient is at high risk and will be admitted to ICU or not. You will use appropriate performance metrics to evaluate the performance of your model. The data is not clean, and you will have to apply appropriate methods to clean the data. Additionally, using unsupervised clustering, you will have to implement cluster-based classification model that may improve the performance of the model. The (partially) processed dataset is available to download from the blackboard.
Your first task is to prepare the data and carry out data cleansing, bearing in mind the question you would like to answer. For example, which factor is the most important factor in predicting the readmission of a patient.
Load the dataset patients.csv into pandas dataframe and carry out the following tasks.
Data Cleaning and Transformation
If you have a closer look at the dataset, you will see that there are lots of inconsistencies in the dataset. While there are no explicit ‘null’ values, some binary attributes contain entries such as ‘?’, which represent missing values. These need to be handled appropriately. For the
first task, adopt an aggressive approach to address these issues. Below is a list of steps you should consider. This list is not exhaustive, so feel free to explore additional techniques that demonstrate your understanding of exploratory data analysis (EDA).
Consider the resulting Dataframe. This first aggressive cleaning should give a smaller dataset, which you can start by exploring relationships between the various features of the dataset.
Consider the resulting Dataframe:
This is an open-ended task, allowing you to apply your problem-solving skills to develop a high-performance model. Your goal is to explore various approaches and build an effective solution. For full credit, you must use PySpark to demonstrate your understanding of handling big data in a distributed environment.
If you are stressed about CO7093—Big Data & Predictive Analytics Assignment, then there is no need to worry now! With computer science assignment help services, you will get expert guidance and help on assignments that will make your concepts strong. We also provide you with free assignment example samples that will help you understand. And the best part? All the content is 100% original, written by PhD expert writers, and well-researched so that you get the best quality. So don't delay now; boost your grades with our help!
Let's Book Your Work with Our Expert and Get High-Quality Content