Spark’s conceptual foundation comes from (Hadoop) MapReduce.
MapReduce: Fixed pattern — map → shuffle → reduce, always writing intermediate data to disk.
Spark: General DAG (Directed Acyclic Graph) engine — any chain of transformations, in-memory, with automatic fault recovery.
Spark extends the MapReduce model for more types of computations which includes:
Interactive Queries
Stream Processing
Batch applications
Iterative algorithms
The Driver Program creates a Spark session, & communicates with the cluster manager to create RDDs (Resilient Distributed Datasets).
To create an RDD, the data is divided up and distributed across worker nodes in a cluster.
We can perform two types of operations on RDDs:
Transformations manipulate RDDs on the cluster.
Actions return a computation back to the main driver program.
Input → loaded as RDD.
Apply transformations (lazy).
When an action is called, Spark builds and executes the DAG (Directed Acyclic Graph)
RDD is the fundamental data abstraction in Spark. It’s an immutable, distributed collection of objects that can be processed in parallel across a cluster.
Immutable: Once created, it cannot be changed.
Distributed: Data is split across multiple nodes.
Resilient: Can recover automatically from node failures using lineage (how it was built from other RDDs).
val data = Array(1, 2, 3, 4, 5)
val rdd = spark.sparkContext.parallelize(data)
Operations on RDDs that return a new RDD. They are lazy, meaning Spark doesn’t execute them immediately but builds a DAG (Directed Acyclic Graph) of transformations.
Common transformations:
map(): Apply a function to each element.
val rdd2 = rdd.map(x => x * 2) // Multiply each element by 2
filter(): Keep elements that satisfy a condition.
val rdd3 = rdd2.filter(x => x > 5) // Keep elements > 5
flatMap(): Like map, but can return multiple outputs per input.
reduceByKey(): Aggregate by key (for key-value pairs).
Operations that trigger execution of the DAG and return results to the driver program or write to storage.
Common actions:
collect(): Returns all elements to the driver.
val result = rdd3.collect() // Triggers computation
count(): Counts elements.
reduce(): Aggregates elements using a function.
take(n): Returns first n elements.
A distributed collection of data organized into named columns, similar to a table in a relational database or a Pandas DataFrame.
Think of DataFrames as RDDs with a schema, allowing smarter optimization and SQL-like querying.
Advantages over RDDs:
Optimized using Catalyst optimizer.
Can use SQL-like operations.
Easier to work with structured data.
val df = spark.read.json("people.json")
df.select("name", "age").show()
Module for querying structured data using SQL syntax or DataFrame APIs.
Features:
Supports SQL queries directly on DataFrames.
Integrates with Hive, JDBC, Parquet, ORC, JSON, etc.
Leverages Catalyst optimizer for performance.
df.createOrReplaceTempView("people")
val adults = spark.sql("SELECT name FROM people WHERE age >= 18")
adults.show()
A lightweight API that allows developers to perform batch processing and real-time streaming of data.
Processing live data streams that creates a discretized stream (Dstream) of RDD batches.
Machine learning modules for designing pipelines used for feature engineering and algorithm training.
MLlib eases the deployment and development of scalable machine learning algorithms.
More than just visualizing data, this API converts RDDs to resilient distributed property graphs (RDPGs) which utilize vertex and edge properties for relational data analysis.
Interactive grapth computations
ETL pipelines: Extract, transform, and load required data from multiple sources.
Real-time analytics: Fraud detection, log analysis, and monitoring.
Machine learning: Customer segmentation, predictive analytics.
Recommendation systems: Personalized content in media and e-commerce.
Graph processing: Social network analysis and relationship mapping.
Problem: Suppose you have a large log file (100 GB) stored in HDFS (Hadoop File System) or S3 (AWS), and you want to count how many times each word appears.
It loads a CSV file, performs DataFrame operations, and executes SQL queries on the data.