Category | Assignment | Subject | Computer Science |
---|---|---|---|
University | University of Salford | Module Title | Master Big Data Tools & Techniques |
You will be given 2 tasks with separate datasets a set of problem statements for each task. You are required to implement your solution to each problem based on tasks’ description.
Task 1
You will be using clinical trial datasets in this work and combining the information with a list of pharmaceutical companies. You will be given the answers to the questions, for a basic implementation, for two historical datasets, so you can verify your basic solution to the problems. All data will be available from Blackboard.
Datasets:
The data necessary for this assignment will be zipped CSV files. The .csv files have a header describing the files’ contents. They are:
1. Clinicaltrial_2023.csv:
Every row in the dataset corresponds to an individual clinical trial and is identified by different variables. It's important to note that the first column contains a mixture of various variables separated by a delimiter, and the date columns exhibit various formats. Please consider these issues and ensure that the dataset is appropriately prepared before initiating any analysis.
(Source: ClinicalTrials.gov)
2. pharma.csv:
The file contains a small number of a publicly available list of pharmaceutical violations. For the purposes of this work, we are interested in the second column, Parent Company, which contains the name of the pharmaceutical company in question.
(Source: https://violationtracker.goodjobsfirst.org/industry/pharmaceuticals)
When creating tables for this task, you must name them as follows:
➢clinicaltrial_2023
➢pharma
The uploaded datasets, must exist (and be named) in the following locations: ➢/FileStore/tables/clinicaltrial_2023.csv
➢/FileStore/tables/pharma.csv
This is to ensure that we can run your notebooks when testing your code (marks are allocated for your code running).
Problem statement
You are a data scientist / AI engineer whose client wishes to gain further insight into clinical trials. You are tasked with answering these questions, using visualisations where these would support your conclusions.
You should address the following questions.
1. The number of studies in the dataset. You must ensure that you explicitly check distinct studies.
2. You should list all the types (as contained in the Type column) of studies in the dataset along with the frequencies of each type. These should be ordered from most frequent to least frequent.
3. The top 5 conditions (from Conditions) with their frequencies.
4. Find the 10 most common sponsors that are not pharmaceutical companies, along with the number of clinical trials they have sponsored. Hint: For a basic implementation, you can assume that the Parent Company column contains all possible pharmaceutical companies.
5. Plot number of completed studies for each month in 2023. You need to include your visualization as well as a table of all the values you have plotted for each month.
You are to implement all 5 tasks 3 times: once in Spark SQL and twice in PySpark (once in RDD and another time in DataFrame).
For the visualisation of the results, you are free to use any tool that fulfils the requirements, which can be tools such as Python’s matplotlib, Excel, Power Bi, Tableau, or any other free open-source tool you may find suitable. Using built-in visualizations directly is permitted, it will however not yield a high number of marks. Your report needs to state the software used to generate the visualization, otherwise a built-in visualization will be assumed.
Extra features to be implemented for extra marks:
For this task, an implementation of the above which is correct, fully documented and clearly explained can only receive a maximum mark of 69% for this component. Higher scoring submissions will need to implement additional features, such as (but not limited to):
➢ Unzipping the data inside the Databricks system (You can unzip the file on your computer before uploading it to Databricks. However, to earn extra marks, you should be able to successfully unzip it within the Databricks environment. Additionally, your code should be reusable for us, meaning it needs to include proper cleanup procedures to remove any unnecessary files and folders from the filesystem. This ensures our ability to run your code without errors.).
➢Maximum 3 further analyses of the data, motivated by the questions asked (new problem statements other than the above 5 problems)
➢ Writing general and reusable code for example for different versions of data. We have provided the clinicaltrial_2020 and clinicaltrial_2021 datasets only for this purpose if you want (don’t forget, the main dataset is clinicaltrial_2023 and 2020 and 2021 versions are just for extra mark and is not compulsory to use them).
➢ Using more advance methods to solve the problems like defining and using user defined functions.
➢ Successfully implementing Spark functions that you have not used in the workshop.
➢ Creation of additional visualizations presenting useful information based on your own exploration which is not covered by the problem statements.
Task 2
Problem statement
For your second task, you are working with a dataset extracted from Steam, an online video game distribution service. This dataset is available on Blackboard and named steam 200k.csv. It provides details on the games different members have purchased and played, along with the number of hours they have played each game. It contains four columns:
➢The first column contains a unique identifier for each member
➢The second column contains the name of the game they purchased or played
➢The third column contains details of the member behaviour, either ‘purchase’ or ‘play’. Because a game has to be purchased before it can be played there will be two entries for the same game / member combination in some instances
➢The fourth is set to 1 for rows where the behaviour is ‘purchase’. For rows where the behavior is ‘play’ the value in the fourth column corresponds to the number of hours of play
We can use both purchase and play behaviours as implicit user feedback, which is useful for training a recommender system.
Your task as a data scientist is to do the following:
➢ Load the dataset into a Spark DataFrame. You may want to consider carrying out some initial exploratory analysis of the data, which you are welcome to do using DataFrames, Spark SQL, Databricks visualisations, another visualisation library etc.
➢ Use MLlib to train a collaborative filtering recommender system on the provided data, evaluate its performance and explore some of the resulting recommendations. You will need to carry out all pre-processing steps, such as splitting the data into training and test sets. It is your decision whether to include both ‘purchase’ and ‘play’ behaviours or to choose one of these as more suitable for your purposes. You may wish to experiment with more than one approach.
Note: To run Alternating Least Squares (ALS) matrix factorization using MLlib, we need to have integer ID values for both users and items, and this dataset does not contain IDs for the games. You will need to find a way to generate a unique integer ID for each game and add this into the DataFrame. There is an additional file, games.csv, which you can use to do this, but you will receive more marks if you are able to complete this within Databricks without using this additional csv file.
Extra features to be implemented for extra marks:
For this task, an implementation of the above which is correct, fully documented and clearly explained can only receive a maximum mark of 69% for this component. Higher scoring submissions will need to implement additional features, such as:
➢ You can receive more marks for multiple runs as part of your experiment, for example, training models with different hyperparameters.
➢ Track your experiment with MLflow. You must include screenshots from the Databricks Experiment UI in your report to evidence that you have done this. ➢ More in-depth data exploration / visualisation of the data, and more in-depth exploration and evaluation of the recommendations generated. For example, you may want to summarise and visualise information such as the most frequently purchased games and the games with the highest total ‘play’ time.
A 5000-word report that documents your solution should be included with your submission. In this module, a background research, literature review or citations are not required. The format of the report should be as follows:
Task 1:
Compulsory Part
1) Description of any setup required to be able to complete this task.
2) Data cleaning and preparation (including descriptions, justifications, and screenshots of all code).
3) Problem answers.
Question 1
1. Assumptions made on the dataset before answering to this question. 2. PySpark implementation outline in RDD (description of main ideas in words and screenshot of code)
3. PySpark implementation outline in DataFrame (description of main ideas in words and screenshot of code)
4. SQL implementation outline (description of main ideas in words and screenshot of code)
5. Discussion of result
Question 2
1. Assumptions made on the dataset before answering to this question. 2. PySpark implementation outline in RDD (description of main ideas in words and screenshot of code)
3. PySpark implementation outline in DataFrame (description of main ideas in words and screenshot of code)
4. SQL implementation outline (description of main ideas in words and screenshot of code)
5. Discussion of result
Question 3
Optional Part
4) Further analysis 1
Use same structure as the main questions but only choose one implementation (RDD or DF or SQL)
5) Further analysis 2
Use same structure as the main questions but only choose one implementation (other than what you chose for the first further analysis)
6) Further analysis 3
Use same structure as the main questions but only choose one implementation (other than what you chose for the first and second further analysis)
Task 2:
1) Description of any set up required to complete this task.
2) Loading data into Spark DataFrame and any exploratory analysis or visualisation carried out prior to training.
3) Data preparation and pre-processing carried out prior to training the model. 4) Selection of hyperparameters and model training and evaluation and MLflow experiment tracking.
5) Brief discussion of the result.
If you are looking for the best alternative in your Big Data Tools & Techniques Project Assignment, you've come to the right place! Our experts who hold a PhD provide you with assignment help online UK that solves your tasks according to academic standards. Whether you need help with computer science assignments or report writing help services for UK students, we provide solutions using clinical trial datasets, Spark SQL, PySpark, and visualization tools. With our services, you'll receive a detailed, high-quality Big Data Tools and Techniques assignment example that is 100% plagiarism-free and covers all critical project areas. We also implement advanced methods and reusable code for higher grades.
Let's Book Your Work with Our Expert and Get High-Quality Content