Skip to content

Annaqin0929/Spark_Hadoop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Big Data in Hadoop & Machine Learning using Spark

Hadoop

This project focuses on the utilization of Spark to explore and analyze the Million Song Dataset (MSD), which is stored in the Hadoop Distributed File System (HDFS). The main dataset comes from a project initiated by The Echo Nest and LabROSA, containing the song ID, the track ID, the artist ID, and 51 other fields, such as the year, title, artist tags, and various audio properties such as loudness, beat, tempo, and time signature. The detailed introduction and source dataset can be accessed at http://millionsongdataset.com/.

Spark DataFrame API

The report leverages Spark's APIs, primarily the DataFrame API, to perform transformations and conduct pre-processing of the Million Song Dataset, which is on the order of billions of records. The project covers three main tasks: data processing, audio similarity, and song recommendation.

Machine Learning

Machine learning algorithms, such as Logistic Regression, Random Forest, and Support Vector Machine (SVM) are used to perform classification. Alternating Least Squares (ALS) are applied to perform Collaborative Filtering. In addition, random sampling, stratification sampling, and other re-sampling techniques are also covered in this Project.