rag2riches

Rag to Riches is the repository which has everything a Data Engineer needs.

Hadoop vs Spark

Hadoop and Spark are two powerful frameworks widely used in big data processing, but they have distinct roles and advantages. Below is a breakdown of their architectures, differences, and how they can complement each other.

Overview Hadoop: • Core Components: o HDFS (Hadoop Distributed File System): Storage layer to manage large volumes of data distributed across multiple nodes. o MapReduce: Batch processing framework using disk-based intermediate storage. o YARN (Yet Another Resource Negotiator): Resource management and job scheduling layer. • Strengths: o Handles massive amounts of data reliably. o Fault-tolerant and scalable. o Cost-effective storage using commodity hardware. o Primarily for batch processing.

Spark: • Core Components: o RDD (Resilient Distributed Dataset): In-memory data abstraction for fault-tolerant distributed computing. o DataFrame & Dataset APIs: Higher-level abstractions for querying and transforming structured data. o Spark SQL: Module for querying data with SQL syntax. o Spark Streaming: Real-time data processing module. o MLlib: Machine learning library. o GraphX: Graph computation library. • Strengths: o Faster than Hadoop (in-memory processing). o Supports batch, streaming, machine learning, and graph processing. o Provides easy-to-use APIs for Python, Java, Scala, and R. o Can integrate with HDFS, Hive, HBase, Cassandra, and other storage systems.

Hadoop and Spark Together

Although Spark is faster and more versatile, it doesn’t replace Hadoop entirely. Instead, they work well together:

Storage: Spark uses Hadoop's HDFS for distributed file storage.
Resource Management: Spark can run on YARN for resource management, leveraging an existing Hadoop cluster.
Hive Integration: Spark SQL can query data in Hive, which uses HDFS as the storage layer.
Batch and Streaming: Spark processes streaming data (Spark Streaming) alongside Hadoop's batch jobs.

#When to Use Hadoop vs. Spark • Hadoop: Best for cost-effective storage and batch processing of large-scale data. • Spark: Best for real-time processing, machine learning, and iterative workloads. Ideal Scenarios for Hadoop + Spark: • Processing large historical datasets stored in HDFS with Spark's in-memory speed. • Running ETL jobs where Spark processes data quickly, and Hadoop stores the data reliably. • Combining batch jobs (MapReduce) and real-time streaming jobs (Spark Streaming).

Architecture Example: Spark on Hadoop Cluster

• HDFS: Stores input and output data. • YARN: Manages Spark jobs and resources. • Spark Applications: Run on top of YARN, reading from and writing to HDFS.

Commands for Spark with Hadoop

Reading Data from HDFS: from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("HDFSExample").getOrCreate() df = spark.read.text("hdfs://namenode:9000/path/to/input") df.show()

Running Spark on YARN: spark-submit --master yarn --deploy-mode cluster my_app.py
Write Data Back to HDFS: df.write.format("parquet").save("hdfs://namenode:9000/path/to/output")

Conclusion

• Hadoop provides scalable, fault-tolerant storage with HDFS and resource management with YARN. • Spark builds on this foundation to provide faster, versatile data processing capabilities. • Together, they form a robust ecosystem for big data processing.

List of popular AWS Services

Compute

Amazon EC2 (Elastic Compute Cloud) - Scalable virtual servers.
AWS Lambda - Serverless compute service.
Amazon ECS (Elastic Container Service) - Container orchestration service.
Amazon EKS (Elastic Kubernetes Service) - Managed Kubernetes service.
AWS Fargate - Serverless compute for containers.
AWS Elastic Beanstalk - Managed platform for web apps.

Storage

Amazon S3 (Simple Storage Service) - Scalable object storage.
Amazon EBS (Elastic Block Store) - Persistent block storage for EC2.
Amazon EFS (Elastic File System) - Managed file storage for Linux.
AWS Backup - Centralized backup service.
AWS Snowball - Data transfer appliance.

Database

Amazon RDS (Relational Database Service) - Managed relational databases. (e.g., MySQL, PostgreSQL, MariaDB, Oracle, SQL Server).
Amazon DynamoDB - NoSQL database service.
Amazon Aurora - High-performance relational database.
Amazon Redshift - Data warehousing service.
Amazon ElastiCache - Managed in-memory data store (supports Redis and Memcached).
DocumentDB: Managed document database service compatible with MongoDB.

Networking and Content Delivery

Amazon VPC (Virtual Private Cloud) - Isolated cloud networks.
AWS CloudFront - Content delivery network (CDN).
AWS Direct Connect - Private network connection to AWS.
Elastic Load Balancing - Automatic traffic distribution.
Amazon Route 53 - Scalable domain name system (DNS).

Machine Learning

Amazon SageMaker - Machine learning model building and deployment.
AWS DeepLens - AI-enabled camera.
Amazon Comprehend - Natural language processing (NLP).
Amazon Rekognition - Image and video analysis.
Amazon Textract - Text extraction from scanned documents.

Analytics

Amazon EMR (Elastic MapReduce) - Managed big data frameworks like Hadoop(raw unprocessed data) , pySpark, sqoop, hbase etc. Its a managed cluster and allows to scale by adding and removing nodes to your cluster. It runs on EC2 under the hood. We need to add steps in EMR cluster which pyspark scripts we wrote .
AWS Glue - ETL (Extract, Transform, Load) service for data preparation. Create crawler from data catalog then add datasource like S3 bucket location, then add database where the data is loaded in the tables, then start crawler to load the data.
Amazon Athena - Serverless query service for analyzing data in S3 using SQL. supports advanced sql, windows funstions and arrays. Very easy for non technical folks and BI Analyst folks.
Amazon Kinesis - Real-time data streaming.
Amazon QuickSight - Business intelligence (BI) service.

Developer Tools

AWS CodePipeline - Continuous integration and delivery (CI/CD).
AWS CodeBuild - Build and test applications.
AWS CodeDeploy - Automated application deployment.
AWS Cloud9 - Cloud-based IDE.
AWS X-Ray - Debugging and tracing applications.

Security and Identity

AWS IAM (Identity and Access Management) - Access control.
AWS KMS (Key Management Service) - Encryption key management.
AWS Shield - DDoS protection.
AWS WAF (Web Application Firewall) - Application security.
Amazon Macie - Data security and privacy service.

Management and Governance

AWS CloudFormation - Infrastructure as code (IaC).
AWS CloudTrail - Track user activity and API usage.
Amazon CloudWatch - Monitoring and logging.
AWS Config - Resource configuration tracking.
AWS Organizations - Multi-account management.

IoT

AWS IoT Core - Connect IoT devices to the cloud.
AWS Greengrass - Edge computing for IoT.
AWS IoT Analytics - Analytics for IoT data.
AWS IoT Events - Detect and respond to events from IoT devices.

Migration and Transfer

AWS Migration Hub - Centralized migration tracking.
AWS DataSync - Automated data transfer.
AWS Snow Family - Data transport appliances.
AWS Application Migration Service - Simplified app migration.

AWS Eventbridge: for workflow scheduling (Based of cron) and automation. message brokers: Amazon Simple Notification Service (SNS) and Amazon Simple Queue Service (SQS) are both message brokers in AWS that allow for asynchronous communication between components

List of popular Azure Services

Here’s a list of popular Microsoft Azure services across various categories:

Compute

Azure Virtual Machines - Scalable virtual servers.
Azure App Service - Platform for web and mobile apps.
Azure Kubernetes Service (AKS) - Managed Kubernetes for containerized applications.
Azure Functions - Serverless compute service.
Azure Container Instances (ACI) - Run containers without managing servers.
Azure Batch - Batch computing for large-scale parallel jobs.

Storage

Azure Blob Storage - Scalable object storage for unstructured data.
Azure Disk Storage - Managed disks for VMs.
Azure File Storage - Fully managed file shares in the cloud.
Azure Data Lake Storage - Storage optimized for big data analytics.
Azure Backup - Simplified and secure backup solutions.

Database

Azure SQL Database - Managed relational database service.
Azure Cosmos DB - Globally distributed NoSQL database.
Azure Database for MySQL/PostgreSQL - Managed open-source databases.
Azure Synapse Analytics - Unified analytics platform for big data and data warehousing.
Azure Cache for Redis - Managed in-memory caching service.

Networking

Azure Virtual Network (VNet) - Private network in the cloud.
Azure Traffic Manager - DNS-based traffic load balancer.
Azure Load Balancer - Distributes traffic across multiple servers.
Azure Application Gateway - Layer 7 load balancing and WAF.
Azure Content Delivery Network (CDN) - Deliver content globally with low latency.

AI and Machine Learning

Azure Machine Learning - Build and deploy machine learning models.
Azure Cognitive Services - Pre-built AI services for vision, speech, and text.
Azure Bot Service - Develop and manage intelligent chatbots.
Azure OpenAI Service - Access to advanced AI models like GPT.
Azure Video Indexer - AI-powered video analysis.

Analytics

Azure Data Factory - ETL service for data integration.
Azure Stream Analytics - Real-time stream processing.
Azure Log Analytics - Analyze and query log data.
Azure Event Hubs - Big data streaming platform.
Azure Monitor - Monitor applications and infrastructure.

Developer Tools

Azure DevOps - CI/CD pipelines, version control, and project management.
Azure DevTest Labs - Manage development and testing environments.
Azure Pipelines - CI/CD automation for app deployment.
Azure Repos - Cloud-hosted Git repositories.
Azure Artifacts - Package management service.

Security

Azure Active Directory (Azure AD) - Identity and access management.
Azure Key Vault - Securely store and manage keys, secrets, and certificates.
Azure Security Center - Unified security management.
Azure Sentinel - Cloud-native SIEM and threat detection.
Azure Firewall - Managed network security service.

Management and Monitoring

Azure Resource Manager (ARM) - Infrastructure as code (IaC).
Azure Advisor - Personalized recommendations for best practices.
Azure Cost Management - Monitor and optimize cloud costs.
Azure Automation - Automate repetitive tasks.
Azure Blueprints - Define and deploy cloud environments consistently.

Internet of Things (IoT)

Azure IoT Hub - Connect and manage IoT devices.
Azure IoT Central - Fully managed IoT application platform.
Azure Digital Twins - Create digital replicas of physical systems.
Azure Time Series Insights - Analytics and visualization for IoT data.
Azure Sphere - Secure IoT devices and applications.

Hybrid and Multicloud

Azure Arc - Manage resources across on-premises, Azure, and other clouds.
Azure Stack - Run Azure services on-premises.
Azure Backup - Unified data backup across hybrid environments.
Azure Site Recovery - Disaster recovery as a service.

Migration Tools

Azure Migrate - Discover, assess, and migrate workloads.
Azure Database Migration Service - Simplify database migrations.
Azure Data Box - Secure offline data transfer appliance.

Blockchain

Azure Blockchain Service - Simplified blockchain development and deployment.
Azure Blockchain Workbench - Tools to build blockchain solutions.

Business Intelligence

Azure Power BI Embedded - Integrate Power BI reports into applications.
Azure Analysis Services - Enterprise-grade analytics and BI modeling.

What all we can do with Apache Spark as Data Engineer?

As a Data Engineer, Apache Spark is a versatile and powerful tool to handle large-scale data processing tasks. Spark’s unified ecosystem supports batch processing, streaming, machine learning, and SQL-based analytics on massive datasets. Here’s an overview of the key tasks you can perform with Spark as a Data Engineer:

ETL (Extract, Transform, Load) Pipelines • Extract data from various sources (HDFS, AWS S3, Kafka, RDBMS, NoSQL databases, etc.). • Transform data to clean, enrich, and apply business logic. • Load processed data into data warehouses, databases, or analytical platforms. Example: from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ETL Example").getOrCreate()

Read data from HDFS

raw_df = spark.read.csv("hdfs://path/to/data.csv", header=True)

Transform: Clean and filter data

transformed_df = raw_df.filter("age > 18").withColumnRenamed("name", "full_name")

Load: Write back to HDFS or a database

transformed_df.write.parquet("hdfs://path/to/output")

Batch Processing of Large Datasets • Process historical or static data stored in HDFS, S3, Azure Blob Storage, etc. • Use DataFrame and Dataset APIs for efficient batch operations. • Spark can replace traditional MapReduce jobs with much faster execution. Example:

Aggregating large data by group

df.groupBy("country").agg({"revenue": "sum"}).show()

Real-Time Data Processing (Streaming) • Build real-time streaming pipelines using Spark Streaming or Structured Streaming. • Connect with streaming sources such as Apache Kafka, Flume, Kinesis, or HDFS. • Process and analyze data in near real-time. Example: from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Streaming Example").getOrCreate()

Stream data from Kafka

df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topic_name").load()

Transform and display

query = df.selectExpr("CAST(value AS STRING)").writeStream
.outputMode("append").format("console").start()

query.awaitTermination()

Data Integration • Integrate data from various sources like RDBMS, NoSQL, APIs, Cloud Storage, etc. • Spark’s JDBC connector allows reading/writing from databases like Oracle, MySQL, or Postgres. • Merge datasets to create a unified data lake or data warehouse. Example:

Connect to a MySQL database

jdbc_url = "jdbc:mysql://dbserver:3306/dbname" connection_properties = {"user": "username", "password": "password"}

df = spark.read.jdbc(url=jdbc_url, table="employees", properties=connection_properties) df.show()

SQL-based Analytics (Spark SQL) • Run SQL queries on large datasets with Spark SQL. • Integrate with tools like Hive, Presto, and BI tools (Power BI, Tableau, etc.). • Useful for structured data analysis. Example: spark.sql("SELECT country, SUM(revenue) FROM sales GROUP BY country").show()

Machine Learning Pipelines (MLlib) • Build and execute machine learning pipelines for data engineering and feature engineering tasks. • Use Spark’s MLlib for large-scale machine learning algorithms. • Prepare, clean, and transform data for ML workflows. Example: from pyspark.ml.feature import VectorAssembler from pyspark.ml.clustering import KMeans

Feature engineering

assembler = VectorAssembler(inputCols=["col1", "col2"], outputCol="features") df = assembler.transform(df)

Train a KMeans model

kmeans = KMeans(k=3, seed=1) model = kmeans.fit(df)

Make predictions

predictions = model.transform(df) predictions.show()

Data Lake and Warehouse Management • Use Spark to manage and process data stored in data lakes or data warehouses. • Integrate with tools like Delta Lake for transaction support and schema enforcement. • Use Spark to load, transform, and optimize data storage.

Graph Processing (GraphX) • Perform graph computation tasks for scenarios like network analysis or social graph processing. Example: from pyspark.graphx import Graph

Create and analyze graph data

Performance Optimization • Optimize queries and jobs with partitioning, caching, and broadcast joins. • Manage memory and execution plans using Spark UI.

Data Migration • Move and transform data between on-premises systems and cloud platforms (AWS, Azure, GCP). • Use Spark as a bridge for seamless data transfers.

Common Tools and Integrations for Data Engineers • Hadoop Ecosystem: HDFS, Hive, HBase, YARN. • Messaging Systems: Kafka, RabbitMQ. • Databases: MySQL, Oracle, Cassandra, MongoDB. • Cloud: AWS S3, Azure Blob Storage, GCP BigQuery. • Orchestration: Apache Airflow, Oozie. • Delta Lake: Versioned data lake with ACID transactions.

Why Spark is Key for Data Engineers

• Unified platform for batch, real-time, and machine learning. • Scales efficiently for petabyte-scale data. • Compatible with a variety of data sources. • User-friendly APIs for rapid development.

Spark empowers Data Engineers to build end-to-end data pipelines, integrate multiple systems, and enable data-driven decision-making. It bridges the gap between data storage, data processing, and analytics.

What is Kafka?

Kafka is a distributed event-streaming platform designed to handle high-throughput, real-time data feeds. It is primarily used for building data pipelines, stream processing, and event-driven applications. Kafka was originally developed by LinkedIn and later open-sourced as part of the Apache Software Foundation.

Key Features of Kafka:

Distributed Architecture: Kafka is designed to scale horizontally by distributing data across multiple brokers (servers) in a cluster. High Throughput: It can process a vast number of events per second, making it suitable for real-time data ingestion and processing. Fault Tolerance: Kafka replicates data across brokers to ensure durability and high availability. Scalable: Adding new brokers to a Kafka cluster allows for horizontal scalability without downtime. Persistent Storage: Kafka stores data on disk in a fault-tolerant manner, allowing consumers to reprocess events if needed.

Key Components: Producer: Applications that publish (write) data to Kafka topics. Topic: A category to which messages are sent by producers. Topics are partitioned for parallel processing. Partition: A subset of a topic, enabling Kafka to distribute data across a cluster for better scalability. Consumer: Applications that subscribe to (read) data from topics. Broker: A server in a Kafka cluster that stores and serves data. ZooKeeper: Used for managing cluster metadata (e.g., brokers, topics, and partitions). However, modern Kafka deployments are moving toward using Kafka Raft (KRaft) for this purpose.

Common Use Cases: Real-time Analytics: Stream and process data for dashboards or analytics in real time. Data Integration: Connect multiple data sources and sinks, enabling ETL pipelines. Log Aggregation: Collect and centralize logs for analysis and monitoring. Event Sourcing: Maintain a history of events for replay and debugging. Messaging: Use Kafka as a messaging system for decoupling microservices. IoT Applications: Process data streams from IoT devices.

Advantages: Scalability: Kafka handles large-scale data ingestion and processing. Reliability: Replication ensures no data loss. Flexibility: Supports a variety of data sources and sinks. Open Ecosystem: Integrates well with big data tools like Apache Spark, Flink, and Hadoop.

Kafka Ecosystem Tools: Kafka Streams: A library for building stream-processing applications. Kafka Connect: A tool to integrate Kafka with external systems (e.g., databases, file systems). Confluent Platform: A commercial distribution of Kafka offering additional tools like monitoring and schema registry.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
LICENSE		LICENSE
README.md		README.md
Scala vs Pyspark.py		Scala vs Pyspark.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rag2riches

Hadoop vs Spark

Hadoop and Spark Together

Architecture Example: Spark on Hadoop Cluster

Commands for Spark with Hadoop

Conclusion

List of popular AWS Services

List of popular Azure Services

What all we can do with Apache Spark as Data Engineer?

Read data from HDFS

Transform: Clean and filter data

Load: Write back to HDFS or a database

Aggregating large data by group

Stream data from Kafka

Transform and display

Connect to a MySQL database

Feature engineering

Train a KMeans model

Make predictions

Create and analyze graph data

Why Spark is Key for Data Engineers

What is Kafka?

Key Features of Kafka:

About

Releases

Packages

Languages

License

vineetprasad19/rag2riches_data_engineer

Folders and files

Latest commit

History

Repository files navigation

rag2riches

Hadoop vs Spark

Hadoop and Spark Together

Architecture Example: Spark on Hadoop Cluster

Commands for Spark with Hadoop

Conclusion

List of popular AWS Services

List of popular Azure Services

What all we can do with Apache Spark as Data Engineer?

Read data from HDFS

Transform: Clean and filter data

Load: Write back to HDFS or a database

Aggregating large data by group

Stream data from Kafka

Transform and display

Connect to a MySQL database

Feature engineering

Train a KMeans model

Make predictions

Create and analyze graph data

Why Spark is Key for Data Engineers

What is Kafka?

Key Features of Kafka:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages