Classic Car Retail Data Pipeline AWS

Overview

This repository contains an example project that demonstrates the full data engineering lifecycle. As part of the Week 2 assignment for my Data Engineering course, this lab walks through the setup and execution of a data pipeline on AWS. The source system used is a MySQL database hosted on Amazon RDS. AWS Glue is used for the ETL (Extract, Transform, Load) process, with Amazon S3 as the storage destination. The pipeline concludes by querying the data via Amazon Athena and visualizing the results using Jupyter Lab.

The objective of the project is to transform the source data into a more query-efficient format, enabling data analysts to perform quick analytical queries without impacting the performance of the production database.

Problem Statement

The project simulates a scenario where a retailer specializing in classic car scale models stores historical purchases and customer information in a relational MySQL database. The marketing team is interested in analyzing this data to understand customer behavior, sales distribution across countries, and product line performance. However, querying this data from the production system is not efficient for analytical purposes.

As a Data Engineer, my responsibility is to design and implement a data pipeline that extracts, transforms, and loads the data into a query-optimized format, allowing the marketing team to analyze it with ease and efficiency.

Architecture

The data pipeline is built on the following AWS services:

Amazon RDS (Relational Database Service): Hosts the source MySQL database.
AWS Glue: Handles the ETL process to extract and transform the data.
Amazon S3: Serves as the destination for the transformed data.
Amazon Athena: Provides the querying capability for the stored data.
Terraform: Infrastructure as Code (IaC) tool used to provision and manage the AWS services.

Database Schema

Pipeline Workflow

Data Exploration: Query and explore the sample MySQL database stored in Amazon RDS.
ETL Process: Use AWS Glue to extract the data from the MySQL database, transform it as needed, and load it into Amazon S3.
Data Querying: Use Amazon Athena to query the data stored in S3, making it available for analysis.
Data Visualization: Build a dashboard using Jupyter Lab to visualize the queried data.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
images		images
infrastructure		infrastructure
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Classic Car Retail Data Pipeline AWS

Overview

Problem Statement

Architecture

Database Schema

Pipeline Workflow

About

Releases

Packages

Languages

aruadecarvalho/classic-car-retail-data-pipeline-aws

Folders and files

Latest commit

History

Repository files navigation

Classic Car Retail Data Pipeline AWS

Overview

Problem Statement

Architecture

Database Schema

Pipeline Workflow

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages