Project: A Crime Analysis of the Last Decade NYC

Big Data Project with Apache Spark + Amazon EMR

The followings are slides & written report for the Big Data Seminar course. It is a project written with Spark, deployed first on DataBricks and then on Amazon EMR. The packages involved: SparkML developed by Apache Spark team, and Azure Machine Learning developed by Microsoft.

I used the community version DataBricks, and the EMR costed around $5 (paid by school). If you are interested in replicating the result yourself, feel free to take my code from the following links:

PySpark Code for Visualization and Exploratory Analysis

PySpark Code for Feature Engineering and Modeling

Project Slides

Project Report

Last updated