Skip to content

Latest commit

 

History

History
12 lines (8 loc) · 1.25 KB

README.md

File metadata and controls

12 lines (8 loc) · 1.25 KB

Big Data in Hadoop & Machine Learning using Spark

Hadoop

This project focuses on the utilization of Spark to explore and analyze the Million Song Dataset (MSD), which is stored in the Hadoop Distributed File System (HDFS). The main dataset comes from a project initiated by The Echo Nest and LabROSA, containing the song ID, the track ID, the artist ID, and 51 other fields, such as the year, title, artist tags, and various audio properties such as loudness, beat, tempo, and time signature. The detailed introduction and source dataset can be accessed at http://millionsongdataset.com/.

Spark DataFrame API

The report leverages Spark's APIs, primarily the DataFrame API, to perform transformations and conduct pre-processing of the Million Song Dataset, which is on the order of billions of records. The project covers three main tasks: data processing, audio similarity, and song recommendation.

Machine Learning

Machine learning algorithms, such as Logistic Regression, Random Forest, and Support Vector Machine (SVM) are used to perform classification. Alternating Least Squares (ALS) are applied to perform Collaborative Filtering. In addition, random sampling, stratification sampling, and other re-sampling techniques are also covered in this Project.