| Category | Assignment | Subject | Education |
|---|---|---|---|
| University | University of London (UOL) | Module Title | DSM030 Statistics and Statistical Data Mining |
MSc Data Science
Module: Statistics and Statistical Data Mining
Task Name: Data Preprocessing and Engineering using Python 3
Assignment Date: Monday, 09 March 2026
In doing so, this will delete the previous version which you submitted and your new updated version will replace it. Therefore your Turnitin similarity score should not be affected. If there is a change in your Turnitin similarity score, it will be due to any changes you may have made to your Coursework.
You are asked to submit a Jupyter notebook that contains your solution (Weighted at 50% of final mark for the module). You will be given a Jupyter notebook that you can use as a skeleton/guide. Please make sure you use Python 3 and not Python 2. Python 2 code will not be marked and will be considered as a non-submission.
Task Name: Data Engineering and Pre-processing
Data pre-processing and engineering is a very important step in statistical data mining. This step might look straightforward, but it can easily be a nightmare. This could be due to any number of difficulties, including: 1) the nature of the problem, 2) the number of variables and their types (i.e. numerical, categorical etc), and 3) Selecting the correct transformation if a transformation is required.
In this task, you will implement several data pre-processing and engineering steps that are common in data science and machine learning. These steps involve several key topics in statistics.
You are expected to learn some simple techniques that are required to finish this task (this is if you do not already know them). There will be a video explaining the task further in order to assist you.
Data description, the dataset you will use for this task contains data about house sale prices. The file ‘data_description.txt’ contains a detailed description of all the variables, what they represent, their values and so on. The target variable is ‘SalePrice’, which is the house’s sale price in US dollars.
Get Solution of DSM030 Statistics and Statistical Data Mining Assignment before Deadline
Pay & Buy Non Plagiarized AssignmentHere is a description of the steps you are asked to implement and their corresponding marks:
1. Import the required libraries.
2. Load the data using pandas and plot a Histogram of the SalePrice column. This code is provided for you, do not change it.
3. The SalePrice column is not normally distributed (i.e. not Gaussian), prove this by running a statistical test and obtaining and interpreting the p-value. [5 marks]
4. Split data into train and test sets making sure the test set is 30% of the original data and the remaining 70% are for training. This code is provided for you, do not change it.
5. Create a list of all categorical variables (by checking their type in the original dataset). [2 marks]
6. Using the training set (X_train), create a list of all categorical variables that contain missing data and print the percentage of missing values per variable in X_train. [3 marks]
7. Using the result of the previous step: For categorical variables with more than 10% of data missing, replace missing data with the word ‘Missing’, in other variables replace the missing data with the most frequent category in the training set (Apply the replacement to X_train and X_test and make sure it is based on the results you have obtained from the training set). [5 marks]
8. Create a list of all numerical variables (do not include SalePrice). [2 marks]
9. Create a list of all numerical variables that contain missing data and print out the percentage of missing values per variable (use the training data). [3 marks]
10. Using the result of the previous step: For numerical variables with less than 15% of data missing, replace missing data with the mean of the variable, in other variables replace the missing data with the median of the variable in the training set (Apply the replacement to X_train and X_test and make sure it is based on the results you have obtained from the training set). [5 marks]
11.In the train and test sets, replace the values of variables ‘YearBuilt’, ‘YearRemodAdd’ and ‘GarageYrBlt’ with the time elapsed between them and the year in which the house was sold ‘YrSold’. After that drop the ‘YrSold’ column. [5 marks]
12.Apply mappings to categorical variables that have an order (in total there should be 14 of them). Some of the categorical variables have values with an assigned order, related to quality (for more information, check the data description file). This means you can replace categories by numbers to determine quality. For example, values in the ‘BsmtExposure’ can be mapped as follows: ‘No’ can be mapped to 1, ‘Mn’ can be mapped to 2, ‘Av’ can be mapped to 3 and ‘Gd’ can be mapped to 4.
One way of doing this is to manually create mappings similar to the example given. Each mapping can be saved as a Python dictionary and used to perform the actual mapping to transform the described variables from categorical to numerical.
To Make it easier for you, here are groups of variables that have the same mappings (Hint: you can map both categories ‘Missing’ and ‘NA’ to 0):
[‘ExterQual’, ‘ExterCond’, ‘BsmtQual’, ‘BsmtCond’, ‘HeatingQC’,
‘KitchenQual’, ‘FireplaceQu’,’GarageQual’, ‘GarageCond’]
[‘BsmtFinType1’, ‘BsmtFinType2’]
Each of the following variables has its own mapping: ‘BsmtExposure’, ‘GarageFinish’, ‘Fence’. [5 marks]
13. Replace Rare Labels with ‘Rare’. For the remaining five categorical variables (the variables that you did not apply value mappings to, they should be five variables), you will need to group those categories that are present in less than 1% of the observations in the training set. That is, all values of categorical variables that are shared by less than 1% of houses in the training set will be replaced by the string “Rare” in both the training and test set. So in more detail you need to find rare labels in the remaining categorical variables and replace them with the category ‘Rare’. Remember: rare labels are those categories that only appear in a small percentage of the observations (in our case in < 1%). Hint: If you look at unique values in a categorical variable in the training set and count how many times each of the unique values appear in the variable, you can compute the percentage of each unique value by dividing its count by the total number of observations. Remember to make the computations using the training set and replacement in both training and test sets. [5 marks]
14.Perform one hot encoding to transform the previous five categorical variables into binary variables. Make sure you do it correctly for both the training and testing sets. After this, remember to drop the original five categorical variables (the ones with the strings) from the trainin and test after the encoding. [5 marks]
15. Feature Scaling. Now we know all variables in our two datasets (i.e. the training and test sets) are numerical, the final step in this exercise is to apply scaling by making sure the minimum value in each variable is 0 and the maximum value is 1. For this step, you can use MinMaxScaler() from sci-kit learn. Make sure you apply it correctly by transforming the test set based on the training set. [5 marks]
After applying all the previous steps, the overall mean value of all entries in the training set was approximately 0.249 and in the test set was approximately 0.247.
Please refer to Appendix C of the Programme Regulations for detailed Assessment Criteria.
Plagiarism:'
This is cheating. Do not be tempted and certainly do not succumb to temptation. Plagiarised copies are invariably rooted out and severe penalties apply. All assignment submissions are electronically tested for plagiarism.
Order Custom Answers for DSM030 Statistics and Statistical Data Mining Assignment
Order Non Plagiarized AssignmentDo you need help with your DSM030 Statistics and Statistical Data Mining assignment at the University of London? Look no further! We are here to assist you. Students seeking high-quality assistance often choose Best Assignment Help for timely and accurate solutions. Expert management assignment help ensure easily understandable content backed by thorough research. A free list of assignment sample examples helps students learn proper writing techniques and improve their answers. This combination of guidance and resources helps students achieve better grades and develop a strong understanding of the subject.
Hire Assignment Helper Today!
Let's Book Your Work with Our Expert and Get High-Quality Content