Apache Spark

Spark = In-memory cluster computing ≈ Improved MapReduce

Spark’s conceptual foundation comes from (Hadoop) MapReduce.
MapReduce: Fixed pattern — map → shuffle → reduce, always writing intermediate data to disk.
Spark: General DAG (Directed Acyclic Graph) engine — any chain of transformations, in-memory, with automatic fault recovery.
Spark extends the MapReduce model for more types of computations which includes:
- Interactive Queries
- Stream Processing
- Batch applications
- Iterative algorithms

Reference

Spark Architecture

Untitled presentation

The Driver Program creates a Spark session, & communicates with the cluster manager to create RDDs (Resilient Distributed Datasets).
To create an RDD, the data is divided up and distributed across worker nodes in a cluster.
We can perform two types of operations on RDDs:
- Transformations manipulate RDDs on the cluster.
- Actions return a computation back to the main driver program.
Data Flow:
- Input → loaded as RDD.
- Apply transformations (lazy).
- When an action is called, Spark builds and executes the DAG (Directed Acyclic Graph)

RDDs (Resilient Distributed Datasets)

RDD is the fundamental data abstraction in Spark. It’s an immutable, distributed collection of objects that can be processed in parallel across a cluster.
- Immutable: Once created, it cannot be changed.
- Distributed: Data is split across multiple nodes.
- Resilient: Can recover automatically from node failures using lineage (how it was built from other RDDs).

val data = Array(1, 2, 3, 4, 5)

val rdd = spark.sparkContext.parallelize(data)

Transformations

Operations on RDDs that return a new RDD. They are lazy, meaning Spark doesn’t execute them immediately but builds a DAG (Directed Acyclic Graph) of transformations.
Common transformations:
- map(): Apply a function to each element.
  - val rdd2 = rdd.map(x => x * 2) // Multiply each element by 2
- filter(): Keep elements that satisfy a condition.
  - val rdd3 = rdd2.filter(x => x > 5) // Keep elements > 5
- flatMap(): Like map, but can return multiple outputs per input.
- reduceByKey(): Aggregate by key (for key-value pairs).

Actions

Operations that trigger execution of the DAG and return results to the driver program or write to storage.
Common actions:
- collect(): Returns all elements to the driver.
  - val result = rdd3.collect() // Triggers computation
- count(): Counts elements.
- reduce(): Aggregates elements using a function.
- take(n): Returns first n elements.

DataFrames

A distributed collection of data organized into named columns, similar to a table in a relational database or a Pandas DataFrame.
Think of DataFrames as RDDs with a schema, allowing smarter optimization and SQL-like querying.
Advantages over RDDs:
- Optimized using Catalyst optimizer.
- Can use SQL-like operations.
- Easier to work with structured data.

val df = spark.read.json("people.json")

df.select("name", "age").show()

Spark SQL

Module for querying structured data using SQL syntax or DataFrame APIs.
Features:
- Supports SQL queries directly on DataFrames.
- Integrates with Hive, JDBC, Parquet, ORC, JSON, etc.
- Leverages Catalyst optimizer for performance.

df.createOrReplaceTempView("people")

val adults = spark.sql("SELECT name FROM people WHERE age >= 18")

adults.show()

Spark Streaming

A lightweight API that allows developers to perform batch processing and real-time streaming of data.
Processing live data streams that creates a discretized stream (Dstream) of RDD batches.

MLlib

Machine learning modules for designing pipelines used for feature engineering and algorithm training.
MLlib eases the deployment and development of scalable machine learning algorithms.

GraphX

More than just visualizing data, this API converts RDDs to resilient distributed property graphs (RDPGs) which utilize vertex and edge properties for relational data analysis.
Interactive grapth computations

Spark use cases

ETL pipelines: Extract, transform, and load required data from multiple sources.
Real-time analytics: Fraud detection, log analysis, and monitoring.
Machine learning: Customer segmentation, predictive analytics.
Recommendation systems: Personalized content in media and e-commerce.
Graph processing: Social network analysis and relationship mapping.

Example: Word Counting

Problem: Suppose you have a large log file (100 GB) stored in HDFS (Hadoop File System) or S3 (AWS), and you want to count how many times each word appears.

Example: DataFrames + Spark SQL

It loads a CSV file, performs DataFrame operations, and executes SQL queries on the data.

Google Sites

Report abuse