Master Big Data Tools & Techniques, Assignment , UOS

Published: 17 Jan, 2025
Category Assignment Subject Computer Science
University University of Salford Module Title Master Big Data Tools & Techniques

Assessment information 

Tasks

You will be given 2 tasks with separate datasets a set of problem statements for each  task. You are required to implement your solution to each problem based on tasks’  description. 

Task 1  

You will be using clinical trial datasets in this work and combining the information with a  list of pharmaceutical companies. You will be given the answers to the questions, for a  basic implementation, for two historical datasets, so you can verify your basic solution to  the problems. All data will be available from Blackboard. 

Datasets: 

The data necessary for this assignment will be zipped CSV files. The .csv files have a header  describing the files’ contents. They are: 

1. Clinicaltrial_2023.csv: 

Every row in the dataset corresponds to an individual clinical trial and is identified  by different variables. It's important to note that the first column contains a mixture  of various variables separated by a delimiter, and the date columns exhibit various  formats. Please consider these issues and ensure that the dataset is appropriately  prepared before initiating any analysis. 

(Source: ClinicalTrials.gov) 

2. pharma.csv: 

The file contains a small number of a publicly available list of pharmaceutical  violations. For the purposes of this work, we are interested in the second column,  Parent Company, which contains the name of the pharmaceutical company in  question.  

(Source: https://violationtracker.goodjobsfirst.org/industry/pharmaceuticals) 

When creating tables for this task, you must name them as follows: 

➢clinicaltrial_2023 

➢pharma 

The uploaded datasets, must exist (and be named) in the following locations: ➢/FileStore/tables/clinicaltrial_2023.csv  

➢/FileStore/tables/pharma.csv 

This is to ensure that we can run your notebooks when testing your code (marks are  allocated for your code running).

Problem statement

You are a data scientist / AI engineer whose client wishes to gain further insight into  clinical trials. You are tasked with answering these questions, using visualisations where  these would support your conclusions. 

You should address the following questions.  

1. The number of studies in the dataset. You must ensure that you explicitly check  distinct studies. 

2. You should list all the types (as contained in the Type column) of studies in the  dataset along with the frequencies of each type. These should be ordered from  most frequent to least frequent. 

3. The top 5 conditions (from Conditions) with their frequencies. 

4. Find the 10 most common sponsors that are not pharmaceutical companies, along  with the number of clinical trials they have sponsored. Hint: For a basic  implementation, you can assume that the Parent Company column contains all  possible pharmaceutical companies. 

5. Plot number of completed studies for each month in 2023. You need to include your  visualization as well as a table of all the values you have plotted for each month. 

You are to implement all 5 tasks 3 times: once in Spark SQL and twice in PySpark (once  in RDD and another time in DataFrame). 

For the visualisation of the results, you are free to use any tool that fulfils the requirements,  which can be tools such as Python’s matplotlib, Excel, Power Bi, Tableau, or any other free  open-source tool you may find suitable. Using built-in visualizations directly is permitted,  it will however not yield a high number of marks. Your report needs to state the software  used to generate the visualization, otherwise a built-in visualization will be assumed. 

Extra features to be implemented for extra marks: 

For this task, an implementation of the above which is correct, fully documented and  clearly explained can only receive a maximum mark of 69% for this component. Higher  scoring submissions will need to implement additional features, such as (but not limited  to):

➢ Unzipping the data inside the Databricks system (You can unzip the file on your  computer before uploading it to Databricks. However, to earn extra marks, you  should be able to successfully unzip it within the Databricks environment.  Additionally, your code should be reusable for us, meaning it needs to include  proper cleanup procedures to remove any unnecessary files and folders from the  filesystem. This ensures our ability to run your code without errors.). 

➢Maximum 3 further analyses of the data, motivated by the questions asked (new  problem statements other than the above 5 problems) 

➢ Writing general and reusable code for example for different versions of data. We have  provided the clinicaltrial_2020 and clinicaltrial_2021 datasets only for this purpose  if you want (don’t forget, the main dataset is clinicaltrial_2023 and 2020 and 2021 versions are just for extra mark and is not compulsory to use them). 

➢ Using more advance methods to solve the problems like defining and using user defined functions. 

➢ Successfully implementing Spark functions that you have not used in the workshop. 

➢ Creation of additional visualizations presenting useful information based on your  own exploration which is not covered by the problem statements.

Task 2 

Problem statement

For your second task, you are working with a dataset extracted from Steam, an online video  game distribution service. This dataset is available on Blackboard and named steam 200k.csv. It provides details on the games different members have purchased and played,  along with the number of hours they have played each game. It contains four columns: 

➢The first column contains a unique identifier for each member 

➢The second column contains the name of the game they purchased or played 

➢The third column contains details of the member behaviour, either ‘purchase’ or  ‘play’. Because a game has to be purchased before it can be played there will be two  entries for the same game / member combination in some instances 

➢The fourth is set to 1 for rows where the behaviour is ‘purchase’. For rows where the  behavior is ‘play’ the value in the fourth column corresponds to the number of  hours of play 

We can use both purchase and play behaviours as implicit user feedback, which is useful for  training a recommender system. 

Your task as a data scientist is to do the following: 

➢ Load the dataset into a Spark DataFrame. You may want to consider carrying out  some initial exploratory analysis of the data, which you are welcome to do using  DataFrames, Spark SQL, Databricks visualisations, another visualisation library etc. 

➢ Use MLlib to train a collaborative filtering recommender system on the provided  data, evaluate its performance and explore some of the resulting  recommendations. You will need to carry out all pre-processing steps, such as  splitting the data into training and test sets. It is your decision whether to include  both ‘purchase’ and ‘play’ behaviours or to choose one of these as more suitable for your purposes. You may wish to experiment with more than one approach. 

Note: To run Alternating Least Squares (ALS) matrix factorization using MLlib, we need to  have integer ID values for both users and items, and this dataset does not contain IDs for  the games. You will need to find a way to generate a unique integer ID for each game and  add this into the DataFrame. There is an additional file, games.csv, which you can use to  do this, but you will receive more marks if you are able to complete this within Databricks  without using this additional csv file. 

Extra features to be implemented for extra marks:

For this task, an implementation of the above which is correct, fully documented and  clearly explained can only receive a maximum mark of 69% for this component. Higher  scoring submissions will need to implement additional features, such as:

➢ You can receive more marks for multiple runs as part of your experiment, for  example, training models with different hyperparameters. 

➢ Track your experiment with MLflow. You must include screenshots from the  Databricks Experiment UI in your report to evidence that you have done this. ➢ More in-depth data exploration / visualisation of the data, and more in-depth  exploration and evaluation of the recommendations generated. For example, you  may want to summarise and visualise information such as the most frequently  purchased games and the games with the highest total ‘play’ time.

Report structure

A 5000-word report that documents your solution should be included with your  submission. In this module, a background research, literature review or citations are not  required. The format of the report should be as follows: 

Task 1: 

Compulsory Part 

1) Description of any setup required to be able to complete this task. 

2) Data cleaning and preparation (including descriptions, justifications, and screenshots of all code). 

3) Problem answers. 

Question 1 

1. Assumptions made on the dataset before answering to this question. 2. PySpark implementation outline in RDD (description of main ideas in words and  screenshot of code) 

3. PySpark implementation outline in DataFrame (description of main ideas in words  and screenshot of code) 

4. SQL implementation outline (description of main ideas in words and screenshot of  code) 

5. Discussion of result 

Question 2 

1. Assumptions made on the dataset before answering to this question. 2. PySpark implementation outline in RDD (description of main ideas in words and  screenshot of code) 

3. PySpark implementation outline in DataFrame (description of main ideas in words  and screenshot of code) 

4. SQL implementation outline (description of main ideas in words and screenshot of  code) 

5. Discussion of result 

Question 3 

Optional Part 

4) Further analysis 1 

Use same structure as the main questions but only choose one implementation (RDD or DF  or SQL) 

5) Further analysis 2 

Use same structure as the main questions but only choose one implementation (other than  what you chose for the first further analysis) 

6) Further analysis 3 

Use same structure as the main questions but only choose one implementation (other than  what you chose for the first and second further analysis)

Task 2: 

1) Description of any set up required to complete this task. 

2) Loading data into Spark DataFrame and any exploratory analysis or visualisation carried  out prior to training. 

3) Data preparation and pre-processing carried out prior to training the model. 4) Selection of hyperparameters and model training and evaluation and MLflow  experiment tracking. 

5) Brief discussion of the result.

 

If you are looking for the best alternative in your Big Data Tools & Techniques Project Assignment, you've come to the right place! Our experts who hold a PhD provide you with assignment help online UK that solves your tasks according to academic standards. Whether you need help with computer science assignments or report writing help services for UK students, we provide solutions using clinical trial datasets, Spark SQL, PySpark, and visualization tools. With our services, you'll receive a detailed, high-quality Big Data Tools and Techniques assignment example  that is 100% plagiarism-free and covers all critical project areas. We also implement advanced methods and reusable code for higher grades.

Online Assignment Help in UK