7BUIS008W.1 Data Mining & Machine Learning Coursework (CW1) | University of Westminster

Published: 05 Feb, 2025

Category	Coursework	Subject	Computer Science
University	University of Westminster	Module Title	7BUIS008W.1 Data Mining & Machine Learning

Description

Students are expected to critically engage in effectively applying and evaluating novel data mining and machine learning techniques for a specific problem domain and definitely reflect on the knowledge of how different data mining and machine learning algorithms perform in terms of biases for a given problem domain. Students are expected to methodically analyse the output of the data mining tasks and machine learning algorithms by drawing technically appropriate and sound conclusions resulting from the application of data mining and machine learning algorithms to the given problem.

Learning Outcomes Covered in this Assignment

This assignment contributes towards the following Learning Outcomes (LOs):

LO2 fully implement data mining/machine learning projects, focused on problem analysis, data pre-processing, data post-processing by choosing and implementing appropriate algorithms;
LO4 fully implement encode and test data mining and machine learning algorithms using the programming language (such as Python) and standard packages and toolkits.
LO6 perform a critical evaluation of performance metrics for data mining and machine learning algorithms for a given domain/application.

Coursework Description

The Real-world Problem Description

A) The Domain

The deployment of machine learning modelling in this coursework aims to tackle a real-world tool by developing effective early screening machine learning models for breast cancer mortality and survival prediction to help doctors enhance their treatment planning and management.

Cancer is a disease in which cells in the body grow out of control. Breast cancer is a disease in which abnormal breast cells grow out of control and form tumours. If left unchecked, the tumours can spread throughout the body and become fatal. Breast cancer cells begin inside the milk ducts and/or the milk- producing lobules of the breast.

In females in the UK, breast cancer is the 2nd most common cause of cancer death, with around 11,400 deaths every year (2017-2019). In males in the UK, breast cancer is not among the 20 most common causes of cancer death, with around 85 deaths every year (2017-2019).

Stages and grades of breast cancer

The tests and scans the patient have to diagnose breast cancer give information about:

the size of the cancer and whether it has spread (the stage)
how abnormal the cells look under the microscope (the grade)
Knowing the stage and grade helps doctor plan the patient’s treatment. The stage of a cancer tells the patient how big it is and whether it
has spread. It helps the doctor decide which treatment the patient need.
There are different systems used in the UK to stage breast cancer. The most common one is the TNM system. TNM stands for Tumour,
Node and Metastasis. the patient might also be told about the number staging system. There are 4 main stages in this system, from 1 to 4.

The tests the patient has also give information about the type of breast cancer they have. The information below is an overview of the TNM staging for all types of cancer.

T describes the size of the tumour (cancer)
N describes whether there are any cancer cells in the nearby lymph nodes
M describes whether the cancer has spread to parts of the body further away from where the cancer started

Stage 1 breast cancer means that the cancer is small and only in the breast tissue or it might be found in lymph nodes close to the breast (see Figure 1). It is an early-stage breast cancer.

The stage of cancer tells the patient how big it is and how far it has spread. It helps the doctor decide the best treatment for the
patient. There are different systems used in the UK to stage breast cancer. Stage 1 is part of the number staging system. Doctors may also use the TNM staging system.

Staging for breast cancer is very complex. Many different factors are considered before doctors can confirm the patient’s final stage.

Fig.1 Illustration of stage 1 breast cancer

Fig.1 Illustration of stage 1 breast cancer

Stage 2 breast cancer means that the cancer is either in the breast or in the nearby lymph nodes or both. It is an early-stage breast cancer.
Stage 2 is part of the number staging system. Doctors may also use the TNM staging system.
Stage 2 can be divided into 2A and 2B. Opposite is a simplified description of stage 2A and 2B breast cancer (see Figure 2).

7BUIS008W.1 Fig.2 Illustration of stage 2 breast cancer

Fig.2 Illustration of stage 2 breast cancer

Stage 3 means that the cancer has spread from the breast to the lymph nodes close to the breast, to the skin of the breast or to the chest wall. It is also called locally advanced breast cancer. Stage 3 is part of the number staging system. Doctors may also use the TNM staging system.
Stage 3 can be divided into 3A, 3B and 3C. opposite is a simplified description of stage 3A, 3B and 3C breast cancer (see Figure 3).

7BUIS008W.1 Fig.3 Illustration of stage 3 breast cancer

Fig.3 Illustration of stage 3 breast cancer

Stage 4 breast cancer has spread to another part of the body (see Figure 4). It is also called advanced cancer or secondary breast cancer. The aim of treatment is to control the cancer and any symptoms. Treatment depends on a number of factors.
In stage 4 breast cancer:

the cancer can be any size
the lymph nodes may or may not contain cancer cells
the cancer has spread (metastasised) to other parts of the body such as the bones, lungs, liver or brain.

Fig.4 Illustration of stage 4 breast cancer

Hormonal Treatment

Hormone therapy is a common treatment for secondary breast cancer. It can often shrink and control the cancer wherever it is in the body. It works well if the cancer cells have particular proteins called hormone receptors, estrogen receptor, progesterone receptor.
If one hormone therapy stops working so well, the doctor might suggest you try a different one.

Other Treatment

Doctor will take many different factors into account when deciding which treatment is best for the patient. These include:

the type of cells the cancer started in
which part of your body the cancer has spread to
the treatment you have already had
your general health
whether the patient have had the menopause.
whether the cancer is growing slowly or more quickly
whether the cancer cells have receptors for particular cancer drugs

If your cancer doesn't have hormone receptors or has spread to the liver or lungs, the doctor might suggest Chemotherapy. Radiotherapy might be recommended if the cancer has spread to the bones or the skin near the breast. Targeted and immunotherapy drugs might be recommended for secondary breast cancer.

C) The Domain Problem

The importance of predicting mortality, short- and long-term survival of patients with cancer may improve their care. Prior predictive models either use data with limited availability or predict the outcome of only 1 type of cancer. In this case breast cancer.

D) Your Role as A Data Scientist

You are hired as a data scientist to work alongside a team of doctors to
1- Build predictive machine-learning models for breast cancer mortality status.
2- Build predictive machine-learning models to estimate patient’s survival period.
The team of doctors provided you with historical records of breast cancer patients and had their mortality status. Also, obtained the number of months they survived.

The doctors rely on your work to answer the following two research question on the dataset; the key objective is to create a new, predictive tool powered by a machine learning model to assist doctors in enhancing their treatment planning and cancer care. The Research Questions are:
a) Does machine learning have the potential to assist doctors to predict those who would survive breast cancer or not?
b) For patients who would not survive cancer, can machine learning offer a reliable estimate of their survival period?
E) Your Dataset

This dataset of breast cancer patients was obtained from the 2017 November update of the SEER Program of the NCI, which provides information on population-based cancer statistics. The dataset contains the following attributes:

Your Coursework Tasks & Framework

As a data scientist, you are a logician, a mathematician, a technician, and an analyst, and you need doctors to understand your analyses. Doctors are usually busy individuals, and they don’t have all the time in the world. One essential skill that you must adhere to is to be concise and straight to the point. Focus on the answers needed for each task, and provide just enough words for the answer only. There is no need to provide lengthy descriptions of algorithms and methods unless you are asked to do.

Also, they are only interested in assessing your interpretation of the modelling results, so you MUST NOT paste any Python code in this report unless specifically asked to so. You will receive a separate link to submit your code as a Python notebook file (mandatory). ipynb extension. Your data mining tasks will be aligned with the popular CRISP-DM methodology phases but without the deployment phase (see Figure 5).

7BUIS008W.1 Fig.2 CRISP-DM Phases

Fig.2 CRISP-DM Phases

PART (A) Breast Cancer Mortality Prediction [65 MARKS]

Does machine learning have the potential to assist doctors to predict those who would survive breast cancer or not Task (1) – Domain Understanding:

Classification [Total 6 Marks] The doctors decided that classification modelling is required. Indicate in the table below for each of the listed variables in your data which ones you should RETAIN and can be included in the classification modelling of Breast Cancer Mortality (Alive vs. Dead) and the variables you should DROP (REMOVE). Justify your decision logically and/or by research (include in-text citation)

Variable Name	Retain or Drop	Brief justification for retention or dropping
Patient ID
Month of Birth
Age
Sex
Race
Marital Status
Occupation Code
Adopted Status
T Stage
N Stage
6th Stage
Differentiate
Grade
A Stage
Tumour Size
Estrogen Status
Progesterone Status
Regional Node Examined
Regional Node Positive
Survival Months
Mortality Status

Task (2) – Data Understanding: Producing Your Experimental Designing

From your Python notebook, for your RETAINED input variables and your class “target” variable, produce a basic statistical description and variable scale type. Plot the distribution of your target variable. (Paste screenshots of code OUTPUTS ONLY for evidence).

Task (3) – Data Preparation: Cleaning and Transforming your data

a) Investigate any issues in your retained dataset and the possible variables. Based on the issues you find in your data, suggest a suitable possible method to mitigate each of these issues and provide your justification for using each method. Use the table below to organise your findings, add more rows if needed:

Variable Name	Issue description	Proposed mitigation	Justification for used mitigation



⋮	⋮	⋮	⋮

b) With the aid of Python packages and a notebook, implement your suggested mitigations of issues in Task (3.a) ,and show evidence of implementing your suggested solutions to the problems you identified for your dataset in (Task 3. a). (Use screenshots of code OUTPUTS ONLY). Indicate and annotate in your screenshots which issue was resolved from each screenshot provided. Show screenshots of code outputs before and after implementing your solution.

Task (4) – Modelling: Create Predictive Classification Models

a) From the classification algorithms which you learned in the module, four different algorithms were selected: Logistic Regression (LR), K Nearest Neighbour (KNN) and Naïve Bayes (NB). These algorithms are a mix of parametric and non-parametric algorithms. List down the type of each algorithm (parametric vs non-parametric), name any learnable parameters, and list any possible hyperparameters for each algorithm which you may want to consider tuning. Note the Python package and module for importing each algorithm. Again, organise your answer in a table as before. See below:

Algorithm Name	Algorithm Type	Learnable Parameters	Some Possible Hyperparameters	Imported Python package to use the algorithm
NB
LR
KNN (N=?)

b) With the aid of the Python packages, use the training–test split approach with your retained applicable categorical input features only and the class output feature to build your predictive classification models.

Screenshot the list of all feature names used for building your classification models and the corresponding data shape function output.
In less than 100 words, research and justify your choice of the training-test split ratio and provide an in-text citation.
In less than 100 words, discuss the overall purpose of using a training-test approach in contrast to the use of validation sets in K-fold cross-validation and describe the case/s when to apply each of those approach is used.
Provide as evidence the code line from your source code that ensures that all models were tested on the same test dataset, also ensure that the labels ratio of Mortality Status “Alive” to Mortality Status “Dead” is the same in the training and test sets.

Task (5) – Evaluation: How good are your models

Your healthcare professionals provided the following success criteria to guide you when evaluating your models.
“When evaluating your model's performance, which addresses your first research question (a). The model is expected to misclassify subjects. Thus, the model should aim to predict the “Dead” mortality status of subjects for as many as possible to increase the urgency in treatment planning. However, the model should demonstrate that its high “Death” mortality prediction rate is mainly due to a larger portion of correctly detected (predicted) subjects who belong to the Mortality Status “Dead” class.”

a) With the aid of Python packages, paste the test confusion matrix for each trained model as screenshots from the output of your Python code.
b) Five different classification evaluation metrics are noted. Paste each model’s test performance results. State which evaluation metric/metrics to “USE or “NOT USE” to closely interpret the above success criteria. For justification, explain how closely your choice of “USE” or “DO NOT USE” for a metric interprets the given success criteria. With the aid of Python packages, document the TEST SCORES for each built model

Metrics	USE or DO NOT USE	Justification in relation to the success criteria	Model Name	Test Score
Accuracy			NB
			LR
			KNN (K=?)
Recall			NB
			LR
			KNN (K=?)
Precision			NB
			LR
			KNN (K=?)
F-Score			NB
			LR
			KNN (K=?)
AUC-ROC			NB
			LR
			KNN (K=?)

c) Suggest a single best classification model based on the ‘USED’ performance metrics scores you identified in (Task 5. b). Briefly describe how well your best model satisfies the needs of your healthcare professionals.

d) Investigated with evidence to establish whether your selected best model is good fit, underfit or overfit.

e) To enhance your selected best model/s performance, tune some of its possible hyperparameters, which you indicated in (Task 4. a) for that specific algorithm. With the aid of Python packages, Re-train the algorithm again with GridSearchCV [5 marks]

i. Indicate the number of cross-validation K folds used.
ii. For the newly tuned model, document the estimated best hyperparameters,
iii. Present the test confusion matrix for the best models before and after tuning.
iv. Calculate and document the new score/s of the “USED” performance metric/s of your choice to interpret the success criteria identified in (Task 5.b) before and after tuning.
v. Use your observations to comment on whether the tuning of hyperparameters of your best model improved its positive predictive ability in line with the success criteria.

f) Based on your best model, draft an answer for the research question, criticise your best-performing model, and state any limitations you may have identified. Research and try to explain why your selected algorithm overtook all other models in no more than 100 words. State any ethical issues your model may raise if used to screen for breast cancer mortality.

PART (B) Breast Cancer Survival Rate Prediction [35 Marks]

Task (1) – Domain Understanding: Regression

The doctors decided that regression modelling is required. Using python functions, show the dimensions of your data subset that you will RETAIN for this regression modelling problem. Using python functions, list the names of the features that you intend to use for modelling from Table.1.

Task (2) – Data Understanding: Producing Your Experimental Designing

From your Python notebook, Plot the distribution for your RETAINED input variables and your “target” variable, (Paste screenshots of code OUTPUTS ONLY “the plots” for evidence).

Task (3) – Data Preprocessing: Transforming your data

a) By looking at the dataset establish whether there is a need for scaling your dataset attributes. Explain with evidence from your python code output the reasoning behind your recommendation.

b) In general, when applying scaling to any dataset for regression modelling, would scale the input features only, the target feature only, or all features? In less than 150 words, briefly justify your answer and include in-text citation where appropriate.

Task (4) – Modelling: Build Predictive Regression Models

a) From the regression algorithms which you learned in the module, doctors decided on the use of a Decision Tree Regression (DT) algorithm. In less than 50 words, explain the added benefit of using a DT regressor to this healthcare prediction problem.

b) With the aid of the Python packages, you will use training – test split of 80:20 to build and test two DT regression models, Model 1 & Model 2. The first DT model with numeric features only and the second model using all your retained features:

i. From your python notebook, insert in your report, provide as evidence the code line from your source code that ensures reproducibility of your training - test sampling.

ii. Using python packages, show from your code output the dimensions of your training and test subsets used for each model. List the subset of features names used for Model 1 and Model 2.

Task (5) – Evaluation: How good are your models

Your healthcare professionals provided the following success criteria to guide you when evaluating your models. “When evaluating both models’ performances which addresses your research question (b), the model is expected to make some errors in estimating the survival months. However, the selected model out of the two built models should have input features that are better at explaining the recorded values of survival months.”

a) Four different regression evaluation metrics are noted. State which evaluation metric/metrics to USE or NOT USE to interpret the above success criteria closely. Justify your choice of USE or DO NOT USE. With the aid of Python packages, document the TEST SCORES for each built model.

Metrics	USE or DO NOT USE	Justification in relation to the success criteria	Model Name	Test Score
MSE			DT (Numeric Features Only)
			DT (All Features Only)
MAE			DT (Numeric Features Only)
			DT (All Features Only)
R-Square			DT (Numeric Features Only)
			DT (All Features Only)

b) Describe any caveats to your selected performance metric assessing the ability of your model meeting the success criteria.

c) Suggest a single best regression model (Model 1 or Model 2) based on the ‘USED’ performance metrics scores you identified in (Task 5. b). Briefly describe how well your selected best model satisfies the needs of the healthcare professionals.

d) Health care professionals aim to explain your best model’s decision of estimated survival months to the patient. Therefore, rebuild your best model while performing pre-pruning (4 levels limit) to ease the interpretation of your best model’s decision. Plot the pruned tree and paste it here from your python notebook results. Describe with evidence if there were any performance advantages or disadvantages of pruning your best tree model.

e) Using your pruned model, predict the survival months for breast cancer patient B002565 whose attributes values are the following:

Variable Name	Value
Patient ID	B002565
Month of Birth	July
Age	56 Years old
Sex	Female
Race	White
Marital Status	Single
Occupation Code	15
Adopted Status	Not Adopted
T Stage	T3
N Stage	N3
6th Stage	IIIC
Differentiate	Moderately differentiated
Grade	2
A Stage	Regional
Tumour Size	41
Estrogen Status	Positive
Progesterone Status	Positive
Regional Node Examined	5
Regional Node Positive	1

Request Answer of this Assignment

Share this with your Friends

Facebook

Instagram

Are you looking for expert help for your 7BUIS008W.1 Data Mining & Machine Learning Coursework? Get the best assignment help UK from our professional services, which will give you perfect guidance in your Computer Science assignments and Data Mining assignment help. In addition, we provide guidance with assignment examples specifically tailored to data mining and machine learning topics, so that you can learn and apply the concepts effectively. or personalized coursework help, our experts will deliver high-quality and plagiarism-free content for you. Take advantage of our service today to boost your grades!

7BUIS008W.1 Data Mining & Machine Learning Coursework (CW1) | University of Westminster

Description

Learning Outcomes Covered in this Assignment

Coursework Description

The Real-world Problem Description

Stages and grades of breast cancer

Your Coursework Tasks & Framework

PART (A) Breast Cancer Mortality Prediction [65 MARKS]

Task (2) – Data Understanding: Producing Your Experimental Designing

Task (3) – Data Preparation: Cleaning and Transforming your data

Task (4) – Modelling: Create Predictive Classification Models

Task (5) – Evaluation: How good are your models

PART (B) Breast Cancer Survival Rate Prediction [35 Marks]

Task (1) – Domain Understanding: Regression

Task (2) – Data Understanding: Producing Your Experimental Designing

Task (3) – Data Preprocessing: Transforming your data

Task (4) – Modelling: Build Predictive Regression Models

Task (5) – Evaluation: How good are your models

Share this with your Friends

Original Samples for Every Category

Top Assignment Services

Get 100% AI & Plagiarism Free Work, Connect With Our Writers Now!