Big Data in Hadoop & Machine Learning using Spark

Hadoop

This project focuses on the utilization of Spark to explore and analyze the Million Song Dataset (MSD), which is stored in the Hadoop Distributed File System (HDFS). The main dataset comes from a project initiated by The Echo Nest and LabROSA, containing the song ID, the track ID, the artist ID, and 51 other fields, such as the year, title, artist tags, and various audio properties such as loudness, beat, tempo, and time signature. The detailed introduction and source dataset can be accessed at http://millionsongdataset.com/.

Spark DataFrame API

The report leverages Spark's APIs, primarily the DataFrame API, to perform transformations and conduct pre-processing of the Million Song Dataset, which is on the order of billions of records. The project covers three main tasks: data processing, audio similarity, and song recommendation.

Machine Learning

Machine learning algorithms, such as Logistic Regression, Random Forest, and Support Vector Machine (SVM) are used to perform classification. Alternating Least Squares (ALS) are applied to perform Collaborative Filtering. In addition, random sampling, stratification sampling, and other re-sampling techniques are also covered in this Project.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
MSD.html		MSD.html
MSD.xmind		MSD.xmind
MSD_v3.ipynb		MSD_v3.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data in Hadoop & Machine Learning using Spark

Hadoop

Spark DataFrame API

Machine Learning

About

Releases

Packages

Languages

Annaqin0929/Spark_Hadoop

Folders and files

Latest commit

History

Repository files navigation

Big Data in Hadoop & Machine Learning using Spark

Hadoop

Spark DataFrame API

Machine Learning

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages