Rag to Riches is the repository which has everything a Data Engineer needs.
Hadoop and Spark are two powerful frameworks widely used in big data processing, but they have distinct roles and advantages. Below is a breakdown of their architectures, differences, and how they can complement each other.
Overview Hadoop: • Core Components: o HDFS (Hadoop Distributed File System): Storage layer to manage large volumes of data distributed across multiple nodes. o MapReduce: Batch processing framework using disk-based intermediate storage. o YARN (Yet Another Resource Negotiator): Resource management and job scheduling layer. • Strengths: o Handles massive amounts of data reliably. o Fault-tolerant and scalable. o Cost-effective storage using commodity hardware. o Primarily for batch processing.
Spark: • Core Components: o RDD (Resilient Distributed Dataset): In-memory data abstraction for fault-tolerant distributed computing. o DataFrame & Dataset APIs: Higher-level abstractions for querying and transforming structured data. o Spark SQL: Module for querying data with SQL syntax. o Spark Streaming: Real-time data processing module. o MLlib: Machine learning library. o GraphX: Graph computation library. • Strengths: o Faster than Hadoop (in-memory processing). o Supports batch, streaming, machine learning, and graph processing. o Provides easy-to-use APIs for Python, Java, Scala, and R. o Can integrate with HDFS, Hive, HBase, Cassandra, and other storage systems.
Although Spark is faster and more versatile, it doesn’t replace Hadoop entirely. Instead, they work well together:
- Storage: Spark uses Hadoop's HDFS for distributed file storage.
- Resource Management: Spark can run on YARN for resource management, leveraging an existing Hadoop cluster.
- Hive Integration: Spark SQL can query data in Hive, which uses HDFS as the storage layer.
- Batch and Streaming: Spark processes streaming data (Spark Streaming) alongside Hadoop's batch jobs.
#When to Use Hadoop vs. Spark • Hadoop: Best for cost-effective storage and batch processing of large-scale data. • Spark: Best for real-time processing, machine learning, and iterative workloads. Ideal Scenarios for Hadoop + Spark: • Processing large historical datasets stored in HDFS with Spark's in-memory speed. • Running ETL jobs where Spark processes data quickly, and Hadoop stores the data reliably. • Combining batch jobs (MapReduce) and real-time streaming jobs (Spark Streaming).
• HDFS: Stores input and output data. • YARN: Manages Spark jobs and resources. • Spark Applications: Run on top of YARN, reading from and writing to HDFS.
- Reading Data from HDFS: from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("HDFSExample").getOrCreate() df = spark.read.text("hdfs://namenode:9000/path/to/input") df.show()
-
Running Spark on YARN: spark-submit --master yarn --deploy-mode cluster my_app.py
-
Write Data Back to HDFS: df.write.format("parquet").save("hdfs://namenode:9000/path/to/output")
• Hadoop provides scalable, fault-tolerant storage with HDFS and resource management with YARN. • Spark builds on this foundation to provide faster, versatile data processing capabilities. • Together, they form a robust ecosystem for big data processing.
Compute
- Amazon EC2 (Elastic Compute Cloud) - Scalable virtual servers.
- AWS Lambda - Serverless compute service.
- Amazon ECS (Elastic Container Service) - Container orchestration service.
- Amazon EKS (Elastic Kubernetes Service) - Managed Kubernetes service.
- AWS Fargate - Serverless compute for containers.
- AWS Elastic Beanstalk - Managed platform for web apps.
Storage
- Amazon S3 (Simple Storage Service) - Scalable object storage.
- Amazon EBS (Elastic Block Store) - Persistent block storage for EC2.
- Amazon EFS (Elastic File System) - Managed file storage for Linux.
- AWS Backup - Centralized backup service.
- AWS Snowball - Data transfer appliance.
Database
- Amazon RDS (Relational Database Service) - Managed relational databases. (e.g., MySQL, PostgreSQL, MariaDB, Oracle, SQL Server).
- Amazon DynamoDB - NoSQL database service.
- Amazon Aurora - High-performance relational database.
- Amazon Redshift - Data warehousing service.
- Amazon ElastiCache - Managed in-memory data store (supports Redis and Memcached).
- DocumentDB: Managed document database service compatible with MongoDB.
Networking and Content Delivery
- Amazon VPC (Virtual Private Cloud) - Isolated cloud networks.
- AWS CloudFront - Content delivery network (CDN).
- AWS Direct Connect - Private network connection to AWS.
- Elastic Load Balancing - Automatic traffic distribution.
- Amazon Route 53 - Scalable domain name system (DNS).
Machine Learning
- Amazon SageMaker - Machine learning model building and deployment.
- AWS DeepLens - AI-enabled camera.
- Amazon Comprehend - Natural language processing (NLP).
- Amazon Rekognition - Image and video analysis.
- Amazon Textract - Text extraction from scanned documents.
Analytics
- Amazon EMR (Elastic MapReduce) - Managed big data frameworks like Hadoop(raw unprocessed data) , pySpark, sqoop, hbase etc. Its a managed cluster and allows to scale by adding and removing nodes to your cluster. It runs on EC2 under the hood. We need to add steps in EMR cluster which pyspark scripts we wrote .
- AWS Glue - ETL (Extract, Transform, Load) service for data preparation. Create crawler from data catalog then add datasource like S3 bucket location, then add database where the data is loaded in the tables, then start crawler to load the data.
- Amazon Athena - Serverless query service for analyzing data in S3 using SQL. supports advanced sql, windows funstions and arrays. Very easy for non technical folks and BI Analyst folks.
- Amazon Kinesis - Real-time data streaming.
- Amazon QuickSight - Business intelligence (BI) service.
Developer Tools
- AWS CodePipeline - Continuous integration and delivery (CI/CD).
- AWS CodeBuild - Build and test applications.
- AWS CodeDeploy - Automated application deployment.
- AWS Cloud9 - Cloud-based IDE.
- AWS X-Ray - Debugging and tracing applications.
Security and Identity
- AWS IAM (Identity and Access Management) - Access control.
- AWS KMS (Key Management Service) - Encryption key management.
- AWS Shield - DDoS protection.
- AWS WAF (Web Application Firewall) - Application security.
- Amazon Macie - Data security and privacy service.
Management and Governance
- AWS CloudFormation - Infrastructure as code (IaC).
- AWS CloudTrail - Track user activity and API usage.
- Amazon CloudWatch - Monitoring and logging.
- AWS Config - Resource configuration tracking.
- AWS Organizations - Multi-account management.
IoT
- AWS IoT Core - Connect IoT devices to the cloud.
- AWS Greengrass - Edge computing for IoT.
- AWS IoT Analytics - Analytics for IoT data.
- AWS IoT Events - Detect and respond to events from IoT devices.
Migration and Transfer
- AWS Migration Hub - Centralized migration tracking.
- AWS DataSync - Automated data transfer.
- AWS Snow Family - Data transport appliances.
- AWS Application Migration Service - Simplified app migration.
AWS Eventbridge: for workflow scheduling (Based of cron) and automation. message brokers: Amazon Simple Notification Service (SNS) and Amazon Simple Queue Service (SQS) are both message brokers in AWS that allow for asynchronous communication between components
Here’s a list of popular Microsoft Azure services across various categories:
Compute
- Azure Virtual Machines - Scalable virtual servers.
- Azure App Service - Platform for web and mobile apps.
- Azure Kubernetes Service (AKS) - Managed Kubernetes for containerized applications.
- Azure Functions - Serverless compute service.
- Azure Container Instances (ACI) - Run containers without managing servers.
- Azure Batch - Batch computing for large-scale parallel jobs.
Storage
- Azure Blob Storage - Scalable object storage for unstructured data.
- Azure Disk Storage - Managed disks for VMs.
- Azure File Storage - Fully managed file shares in the cloud.
- Azure Data Lake Storage - Storage optimized for big data analytics.
- Azure Backup - Simplified and secure backup solutions.
Database
- Azure SQL Database - Managed relational database service.
- Azure Cosmos DB - Globally distributed NoSQL database.
- Azure Database for MySQL/PostgreSQL - Managed open-source databases.
- Azure Synapse Analytics - Unified analytics platform for big data and data warehousing.
- Azure Cache for Redis - Managed in-memory caching service.
Networking
- Azure Virtual Network (VNet) - Private network in the cloud.
- Azure Traffic Manager - DNS-based traffic load balancer.
- Azure Load Balancer - Distributes traffic across multiple servers.
- Azure Application Gateway - Layer 7 load balancing and WAF.
- Azure Content Delivery Network (CDN) - Deliver content globally with low latency.
AI and Machine Learning
- Azure Machine Learning - Build and deploy machine learning models.
- Azure Cognitive Services - Pre-built AI services for vision, speech, and text.
- Azure Bot Service - Develop and manage intelligent chatbots.
- Azure OpenAI Service - Access to advanced AI models like GPT.
- Azure Video Indexer - AI-powered video analysis.
Analytics
- Azure Data Factory - ETL service for data integration.
- Azure Stream Analytics - Real-time stream processing.
- Azure Log Analytics - Analyze and query log data.
- Azure Event Hubs - Big data streaming platform.
- Azure Monitor - Monitor applications and infrastructure.
Developer Tools
- Azure DevOps - CI/CD pipelines, version control, and project management.
- Azure DevTest Labs - Manage development and testing environments.
- Azure Pipelines - CI/CD automation for app deployment.
- Azure Repos - Cloud-hosted Git repositories.
- Azure Artifacts - Package management service.
Security
- Azure Active Directory (Azure AD) - Identity and access management.
- Azure Key Vault - Securely store and manage keys, secrets, and certificates.
- Azure Security Center - Unified security management.
- Azure Sentinel - Cloud-native SIEM and threat detection.
- Azure Firewall - Managed network security service.
Management and Monitoring
- Azure Resource Manager (ARM) - Infrastructure as code (IaC).
- Azure Advisor - Personalized recommendations for best practices.
- Azure Cost Management - Monitor and optimize cloud costs.
- Azure Automation - Automate repetitive tasks.
- Azure Blueprints - Define and deploy cloud environments consistently.
Internet of Things (IoT)
- Azure IoT Hub - Connect and manage IoT devices.
- Azure IoT Central - Fully managed IoT application platform.
- Azure Digital Twins - Create digital replicas of physical systems.
- Azure Time Series Insights - Analytics and visualization for IoT data.
- Azure Sphere - Secure IoT devices and applications.
Hybrid and Multicloud
- Azure Arc - Manage resources across on-premises, Azure, and other clouds.
- Azure Stack - Run Azure services on-premises.
- Azure Backup - Unified data backup across hybrid environments.
- Azure Site Recovery - Disaster recovery as a service.
Migration Tools
- Azure Migrate - Discover, assess, and migrate workloads.
- Azure Database Migration Service - Simplify database migrations.
- Azure Data Box - Secure offline data transfer appliance.
Blockchain
- Azure Blockchain Service - Simplified blockchain development and deployment.
- Azure Blockchain Workbench - Tools to build blockchain solutions.
Business Intelligence
- Azure Power BI Embedded - Integrate Power BI reports into applications.
- Azure Analysis Services - Enterprise-grade analytics and BI modeling.
As a Data Engineer, Apache Spark is a versatile and powerful tool to handle large-scale data processing tasks. Spark’s unified ecosystem supports batch processing, streaming, machine learning, and SQL-based analytics on massive datasets. Here’s an overview of the key tasks you can perform with Spark as a Data Engineer:
- ETL (Extract, Transform, Load) Pipelines • Extract data from various sources (HDFS, AWS S3, Kafka, RDBMS, NoSQL databases, etc.). • Transform data to clean, enrich, and apply business logic. • Load processed data into data warehouses, databases, or analytical platforms. Example: from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ETL Example").getOrCreate()
raw_df = spark.read.csv("hdfs://path/to/data.csv", header=True)
transformed_df = raw_df.filter("age > 18").withColumnRenamed("name", "full_name")
transformed_df.write.parquet("hdfs://path/to/output")
- Batch Processing of Large Datasets • Process historical or static data stored in HDFS, S3, Azure Blob Storage, etc. • Use DataFrame and Dataset APIs for efficient batch operations. • Spark can replace traditional MapReduce jobs with much faster execution. Example:
df.groupBy("country").agg({"revenue": "sum"}).show()
- Real-Time Data Processing (Streaming) • Build real-time streaming pipelines using Spark Streaming or Structured Streaming. • Connect with streaming sources such as Apache Kafka, Flume, Kinesis, or HDFS. • Process and analyze data in near real-time. Example: from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Streaming Example").getOrCreate()
df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topic_name").load()
query = df.selectExpr("CAST(value AS STRING)").writeStream
.outputMode("append").format("console").start()
query.awaitTermination()
- Data Integration • Integrate data from various sources like RDBMS, NoSQL, APIs, Cloud Storage, etc. • Spark’s JDBC connector allows reading/writing from databases like Oracle, MySQL, or Postgres. • Merge datasets to create a unified data lake or data warehouse. Example:
jdbc_url = "jdbc:mysql://dbserver:3306/dbname" connection_properties = {"user": "username", "password": "password"}
df = spark.read.jdbc(url=jdbc_url, table="employees", properties=connection_properties) df.show()
- SQL-based Analytics (Spark SQL) • Run SQL queries on large datasets with Spark SQL. • Integrate with tools like Hive, Presto, and BI tools (Power BI, Tableau, etc.). • Useful for structured data analysis. Example: spark.sql("SELECT country, SUM(revenue) FROM sales GROUP BY country").show()
- Machine Learning Pipelines (MLlib) • Build and execute machine learning pipelines for data engineering and feature engineering tasks. • Use Spark’s MLlib for large-scale machine learning algorithms. • Prepare, clean, and transform data for ML workflows. Example: from pyspark.ml.feature import VectorAssembler from pyspark.ml.clustering import KMeans
assembler = VectorAssembler(inputCols=["col1", "col2"], outputCol="features") df = assembler.transform(df)
kmeans = KMeans(k=3, seed=1) model = kmeans.fit(df)
predictions = model.transform(df) predictions.show()
- Data Lake and Warehouse Management • Use Spark to manage and process data stored in data lakes or data warehouses. • Integrate with tools like Delta Lake for transaction support and schema enforcement. • Use Spark to load, transform, and optimize data storage.
- Graph Processing (GraphX) • Perform graph computation tasks for scenarios like network analysis or social graph processing. Example: from pyspark.graphx import Graph
- Performance Optimization • Optimize queries and jobs with partitioning, caching, and broadcast joins. • Manage memory and execution plans using Spark UI.
- Data Migration • Move and transform data between on-premises systems and cloud platforms (AWS, Azure, GCP). • Use Spark as a bridge for seamless data transfers.
Common Tools and Integrations for Data Engineers • Hadoop Ecosystem: HDFS, Hive, HBase, YARN. • Messaging Systems: Kafka, RabbitMQ. • Databases: MySQL, Oracle, Cassandra, MongoDB. • Cloud: AWS S3, Azure Blob Storage, GCP BigQuery. • Orchestration: Apache Airflow, Oozie. • Delta Lake: Versioned data lake with ACID transactions.
• Unified platform for batch, real-time, and machine learning. • Scales efficiently for petabyte-scale data. • Compatible with a variety of data sources. • User-friendly APIs for rapid development.
Spark empowers Data Engineers to build end-to-end data pipelines, integrate multiple systems, and enable data-driven decision-making. It bridges the gap between data storage, data processing, and analytics.
Kafka is a distributed event-streaming platform designed to handle high-throughput, real-time data feeds. It is primarily used for building data pipelines, stream processing, and event-driven applications. Kafka was originally developed by LinkedIn and later open-sourced as part of the Apache Software Foundation.
Distributed Architecture: Kafka is designed to scale horizontally by distributing data across multiple brokers (servers) in a cluster. High Throughput: It can process a vast number of events per second, making it suitable for real-time data ingestion and processing. Fault Tolerance: Kafka replicates data across brokers to ensure durability and high availability. Scalable: Adding new brokers to a Kafka cluster allows for horizontal scalability without downtime. Persistent Storage: Kafka stores data on disk in a fault-tolerant manner, allowing consumers to reprocess events if needed.
Key Components: Producer: Applications that publish (write) data to Kafka topics. Topic: A category to which messages are sent by producers. Topics are partitioned for parallel processing. Partition: A subset of a topic, enabling Kafka to distribute data across a cluster for better scalability. Consumer: Applications that subscribe to (read) data from topics. Broker: A server in a Kafka cluster that stores and serves data. ZooKeeper: Used for managing cluster metadata (e.g., brokers, topics, and partitions). However, modern Kafka deployments are moving toward using Kafka Raft (KRaft) for this purpose.
Common Use Cases: Real-time Analytics: Stream and process data for dashboards or analytics in real time. Data Integration: Connect multiple data sources and sinks, enabling ETL pipelines. Log Aggregation: Collect and centralize logs for analysis and monitoring. Event Sourcing: Maintain a history of events for replay and debugging. Messaging: Use Kafka as a messaging system for decoupling microservices. IoT Applications: Process data streams from IoT devices.
Advantages: Scalability: Kafka handles large-scale data ingestion and processing. Reliability: Replication ensures no data loss. Flexibility: Supports a variety of data sources and sinks. Open Ecosystem: Integrates well with big data tools like Apache Spark, Flink, and Hadoop.
Kafka Ecosystem Tools: Kafka Streams: A library for building stream-processing applications. Kafka Connect: A tool to integrate Kafka with external systems (e.g., databases, file systems). Confluent Platform: A commercial distribution of Kafka offering additional tools like monitoring and schema registry.