Category |
Assignment |
Subject |
Computer Science |
University |
___ |
Module Title |
771762 Big Data and Data Mining |
Count Words: |
2000 words |
Assignment Format: |
Report |
Due date: |
Thursday 28th of August 2025 |
771762 Context
Unlike in our prior module, Fundamentals of Data Science, this assignment is based on real world data from two sources for two separate tasks: Firstly, we would like you to focus on road traffic accidents in 2019. You will have already encountered this database for the presentation assignment. Secondly, we will provide you with social network data from the Stanford Network Analysis Platform (SNAP), which was obtained via Facebook. This assignment is a chance to test your skills against these real-world data, by analysing it, understanding and interpreting the results of that analysis in order to produce meaningful conclusions based upon what has been taught in this module.
Project Report Background Information.
The project report is split into two parts, each with a set of tasks for you to complete:
- The first part deals with real accident data from 2019 in the ‘accident_data_v1.0.0_2023.db’ database for you to collate, process and analyse with an interpretation and conclusion of your findings by outlining recommendations for policy changes or interventions that should be made by the UK Government.
- The second part deals with real Facebook Data of edges from a series of egonets as produced by SNAP and this is compiled into ‘facebook_combined.txt’. You’ll explore the structure and properties of a social network using this dataset. The objective here will be to analyse the network's structure, identify communities and provide insights from a network science context, rather than to generate policy recommendations like in part (a).
Again, in this assignment we will be using the data from 2019 as it represents a very complete sample with a lot of ancillary data available. We have uploaded the relevant data to Canvas here.
771762 Tasks
(a) - Report on Accidents
Imagine that you are a data scientist working for the Department for Transport (DfT) confronted with the accident database detailed earlier.
Your task is to advise the DfT (and any other relevant UK Gov department) on the policy changes/interventions required to improve road safety, as well as to create a model that would predict such accidents and the injuries that people may incur.
Main Objective: You will write a formal report and detail your analysis with results (and visualisations) alongside with concluding remarks on the recommendations to changes in policy/interventions either locally or nationally depending on what the data inform us.
Alongside the main objective, you should also aim to address these questions (at minimum) for the project report using accidents from 2019 only (although, there may be occasional use of ‘historical data’ present in the dataset):
- Are there any particular hours of the day, and days of the week, on which these accidents are more likely occur to a significant degree? If there is, using your data analysis, what possible reasons could help explain such a pattern?
- For motorbikes, are there any particular hours of the day and days of the week, on which these accidents are more likely occur to a significant degree? We suggest a focus on comparisons between Motorcycles of 125cc and under; Motorcycles over 125cc and up to 500cc; Motorcycles over 500cc. If there is, using your data analysis, what reasons would there be for a category of motorcycle to have more accidents than others for certain days of week and times of day?
- For pedestrians involved in accidents, are there any particular hours of the day, and days of the week, on which they are more likely to be involved in said accidents to a significant degree? If there is, using your data analysis, what could explain why you see the patterns you observe?
- Using the apriori algorithm, explore the impact of selected variables on the accident severity.
- Identify accidents in our region: Kingston upon Hull, Humberside, and the East Riding of Yorkshire ONLY. You can do this by filtering on the LSOA, or police region or another method if you can find one. Run clustering algorithm methods on this data and analyse. What do these clusters reveal about the distribution of the accidents across our region?
- Choose three policing areas by filtering the data using the "police_force" column, then create a separate time series model for each policing area chosen to predict weekly accident counts for 2019 based on historical data from 2017 to 2018. How do these predictions compare for each of the chosen policing areas with the actual 2019 accident data?
- Identify the top thirty (30) Local Super Output Areas (LSOAs) for the City of Hull that recorded the highest number of road accidents in the first three months of 2019. Then aggregate these top thirty records together, so you can employ a time series model leveraging data for the first six months of 2019 (e.g., January to June) for these high incident areas so you can forecast the daily accident occurrences for the following month (e.g., July).
(b) - Social Network Analysis
Main Objective: You will write an analysis on the outcomes of constructing a social network based off the edge node information present in ‘facebook_combined.txt’. You will explore the structure and properties of a social network within this dataset to analyse the structure of the network, identify any communities that may be present and provide implications/meaning behind this analysis within a network science context. This task will be shorter than task (a).
Alongside the Main Objective for task (b) please follow the tasks below in order to successfully complete this component:
- Construct a social network using the provided data and visualise the network, then provide the basic network characteristics, including numbers of nodes and edges, network density, average degree.
- Calculate the edge centrality of this network and plot the distribution of the edge centrality values.
- Use two community detection algorithms to detect the clusters/community within this social network, then compare the difference of results (the number of clusters and numbers of nodes in each cluster).
771762 Report Structures (Suggested Approach).
Your structure for (a).
Please structure your report as follows.
- Short introduction. No more than a few sentences introducing the dataset and the problems that you seek to solve using it.
- Analysis and Results. Present an analysis of the data, including any visualizations, that address the questions 1-7, above. This should be broken down in to analysing when, where, and under what conditions accidents happen, as per the questions above.
- Predictions and Discussion. This should be working models to address points 6 and 7 in Task (a), above, that can predict the conditions under which accidents are most likely to occur in, and the severity of injuries sustained given the conditions they happen under.
- Recommendations. What recommendations can be made to government agencies based on this data and your analysis to improve safety? Keep this to your top 4 or 5 bullet points.
Your structure for (b).
Please structure your report as follows.
- Short introduction. No more than a few sentences introducing the dataset and how you will construct the social network with the provided data.
- Analysis and Results. Present an analysis of the constructed social network, detailing its structure and the communities including any visualizations. Do not forget to justify, by referencing where appropriate, any method or algorithm you use for those objectives outlined in task (b) (i.e. for each point 1-3).
- Discussion. Discuss the results of you analysis and the implications/interpretations/meaning behind what you find.
- Conclusions. Keep this to a few bullet points that highlight the key scientific outcomes from constructing this social network.
Grading.
The following grading rubric (on the next page) will be applied to your supplied answers. The total number of marks available for this assignment is 100. Please note that submitting lots of data is unlikely to attract many marks. Instead, we want to see fully reasoned analyses supported by evidence derived from the data supplied.
Given the word count, it is essential to be concise in your answers. It is strongly suggested that you illustrate your answers with appropriate diagrams (i.e. visualisations) or appendices of example calculations. Further, you might need to read around the topic and undertake library/online research to help with this assignment to achieve the highest grades.
Please upload:
- Your report for Tasks (a) and (b).
- The code you wrote to produce the results and/or visualisations used in the assignment.