Apache Spark 3.5 Tutorial with Examples

In this Apache Spark Tutorial for Beginners, you will learn Spark version 3.5 with Scala code examples. All Spark examples provided in this Apache Spark Tutorial for Beginners are basic, simple, and easy to practice for beginners who are enthusiastic about learning Spark, and these sample examples were tested in our development environment.

Note: If you can’t find the spark sample code example you are looking for on this tutorial page, I would recommend using the Search option from the menu bar to find your tutorial.

Table of Contents

Note that every sample example explained here is available at Spark Examples Github Project for reference

What is Apache Spark?

Apache Spark Tutorial – Apache Spark is an Open source analytical processing engine for large-scale powerful distributed data processing and machine learning applications. Spark was Originally developed at the University of California, Berkeley’s, and later donated to the Apache Software Foundation. In February 2014, Spark became a Top-Level Apache Project and has been contributed by thousands of engineers making Spark one of the most active open-source projects in Apache.

Apache Spark 3.5 is a framework that is supported in Scala, Python, R Programming, and Java. Below are different implementations of Spark.

Examples explained in this Spark tutorial are with Scala, and the same is also explained with PySpark Tutorial (Spark with Python) Examples. Python also supports Pandas which also contains Data Frame but this is not distributed.

Features of Apache Spark

Advantages of Apache Spark

What Versions of Java & Scala Spark 3.5 Supports?

Apache Spark 3.5 is compatible with Java versions 8, 11, and 17, Scala versions 2.12 and 2.13, Python 3.8 and newer, as well as R 3.5 and beyond. However, it’s important to note that support for Java 8 versions prior to 8u371 has been deprecated starting from Spark 3.5.0.

apache spark tutorial architecture

For additional learning on this topic, I would recommend reading the following.

Cluster Manager Types

As of writing this Apache Spark Tutorial, Spark supports below cluster managers:

local – which is not really a cluster manager but still I wanted to mention that we use “local” for master() in order to run Spark on our laptop/computer.

Spark Installation

In order to run the Apache Spark examples mentioned in this tutorial, you need to have Spark and its needed tools to be installed on your computer. Since most developers use Windows for development, I will explain how to install Spark on Windows in this tutorial. you can also Install Spark on a Linux server if needed.

Download Apache Spark by accessing the Spark Download page and selecting the link from “Download Spark (point 3)”. If you want to use a different version of Spark & Hadoop, select the one you wanted from dropdowns, and the link on point 3 changes to the selected version and provides you with an updated link to download.

Spark 3.5 Tutorial Beginners

After downloading, untar the binary using 7zip and copy the underlying folder spark-3.5.0-bin-hadoop3 to c:\apps

Now set the following environment variables.

 SPARK_HOME = C:\apps\spark-3.5.0-bin-hadoop3 HADOOP_HOME = C:\apps\spark-3.5.0-bin-hadoop3 PATH=%PATH%;C:\apps\spark-3.0.5-bin-hadoop3\bin 

Setup winutils.exe

Download wunutils.exe file from winutils, and copy it to %SPARK_HOME%\bin folder. Winutils are different for each Hadoop version hence download the right version from https://github.com/steveloughran/winutils

spark-shell

Spark binary comes with an interactive spark-shell. In order to start a shell, go to your SPARK_HOME/bin directory and type “ spark-shell “. This command loads the Spark and displays what version of Spark you are using.

By default, spark-shell provides with spark (SparkSession) and sc (SparkContext) objects to use. Let’s see some examples.

spark shell

Spark-shell also creates a Spark context web UI and by default, it can access from http://localhost:4041.

Spark-submit

The spark-submit command is a utility to run or submit a Spark or PySpark application program (or job) to the cluster by specifying options and configurations, the application you are submitting can be written in Scala, Java, or Python (PySpark) code. You can use this utility in order to do the following.

  1. Submitting Spark applications on different cluster managers like Yarn, Kubernetes, Mesos, and Stand-alone.
  2. Submitting Spark application on client or cluster deployment modes
 ./bin/spark-submit \ --master \ --deploy-mode \ --conf \ --driver-memory g \ --executor-memory g \ --executor-cores \ --jars --class \ \ [application-arguments] 

Spark Web UI

Apache Spark provides a suite of Web UIs (Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) to monitor the status of your Spark application, resource consumption of the Spark cluster, and Spark configurations. On Spark Web UI, you can see how the operations are executed.

Spark Web UI tutorial

Spark History Server

Spark History server, keep a log of all completed Spark application you submit by spark-submit, spark-shell. before you start, first you need to set the below config on spark-defaults.conf

 spark.eventLog.enabled true spark.history.fs.logDirectory file:///c:/logs/path 

Now, start the Spark history server on Linux or Mac by running.

 $SPARK_HOME/sbin/start-history-server.sh 

If you are running Spark on Windows, you can start the history server by starting the below command.

 $SPARK_HOME/bin/spark-class.cmd org.apache.spark.deploy.history.HistoryServer 

By default, the History server listens at 18080 port and you can access it from the browser using http://localhost:18080/

history server

By clicking on each App ID, you will get the details of the application in Spark web UI.

The history server is very helpful when you are doing Spark performance tuning to improve spark jobs where you can cross-check the previous application run with the current run.

Spark Modules

Modules and components

Spark Core

In this section of the Apache Spark Tutorial, you will learn different concepts of the Spark Core library with examples in Scala code. Spark Core is the main base library of Spark which provides the abstraction of how distributed task dispatching, scheduling, basic I/O functionalities etc.

Before getting your hands dirty on Spark programming, have your Development Environment Setup to run Spark Examples using IntelliJ IDEA

SparkSession

SparkSession introduced in version 2.0, is an entry point to underlying Spark functionality in order to programmatically use Spark RDD, DataFrame, and Dataset. It’s object spark is default available in spark-shell.

Creating a SparkSession instance would be the first statement you would write to the program with RDD, DataFrame and Dataset. SparkSession will be created using SparkSession.builder() builder pattern.

 // Create SparkSession import org.apache.spark.sql.SparkSession val spark:SparkSession = SparkSession.builder() .master("local[1]") .appName("SparkByExamples.com") .getOrCreate() 

Spark Context

SparkContext is available since Spark 1.x (JavaSparkContext for Java) and is used to be an entry point to Spark and PySpark before introducing SparkSession in 2.0. Creating SparkContext was the first step to the program with RDD and to connect to Spark Cluster. It’s object sc by default available in spark-shell .

Since Spark 2.x version, When you create SparkSession, SparkContext object is by default created and it can be accessed using spark.sparkContext

Note that you can create just one SparkContext per JVM but can create many SparkSession objects.

RDD Spark Tutorial

RDD (Resilient Distributed Dataset) is a fundamental data structure of Spark and it is the primary data abstraction in Apache Spark and the Spark Core. RDDs are fault-tolerant, immutable distributed collections of objects, which means once you create an RDD you cannot change it. Each dataset in RDD is divided into logical partitions, which can be computed on different nodes of the cluster.

This Apache Spark RDD Tutorial will help you start understanding and using Apache Spark RDD (Resilient Distributed Dataset) with Scala code examples. All RDD examples provided in this tutorial were also tested in our development environment and are available at GitHub spark scala examples project for quick reference.

In this section of the Apache Spark tutorial, I will introduce the RDD and explain how to create them and use their transformation and action operations. Here is the full article on Spark RDD in case you want to learn more about it and get your fundamentals strong.

RDD creation

RDDs are created primarily in two different ways, first parallelizing an existing collection and secondly referencing a dataset in an external storage system ( HDFS , HDFS , S3 and many more).

sparkContext.parallelize()

sparkContext.parallelize is used to parallelize an existing collection in your driver program. This is a basic method to create RDD.

 //Create RDD from parallelize val dataSeq = Seq(("Java", 20000), ("Python", 100000), ("Scala", 3000)) val rdd=spark.sparkContext.parallelize(dataSeq) 

sparkContext.textFile()

Using textFile() method we can read a text (.txt) file from many sources like HDFS, S#, Azure, local e.t.c into RDD.

 //Create RDD from external Data source val rdd2 = spark.sparkContext.textFile("/path/textFile.txt") 

RDD Operations

On Spark RDD, you can perform two kinds of operations.

RDD Transformations

Spark RDD Transformations are lazy operations meaning they don’t execute until you call an action on RDD. Since RDDs are immutable, When you run a transformation(for example map()), instead of updating a current RDD, it returns a new RDD.

Some transformations on RDDs are flatMap() , map() , reduceByKey() , filter() , sortByKey() and all these return a new RDD instead of updating the current.

RDD Actions

RDD Action operation returns the values from an RDD to a driver node. In other words, any RDD function that returns non RDD[T] is considered as an action. RDD operations trigger the computation and return RDD in a List to the driver program.

Some actions on RDDs are count() , collect() , first() , max() , reduce() and more.

RDD Examples

DataFrame Spark Tutorial with Basic Examples

DataFrame definition is very well explained by Databricks hence I do not want to define it again and confuse you. Below is the definition I took from Databricks.

DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs.

– Databricks

DataFrame creation

The simplest way to create a Spark DataFrame is from a seq collection. Spark DataFrame can also be created from an RDD and by reading files from several sources.

using createDataFrame()

By using createDataFrame() function of the SparkSession you can create a DataFrame.

 // Create DataFrame val data = Seq(('James','','Smith','1991-04-01','M',3000), ('Michael','Rose','','2000-05-19','M',4000), ('Robert','','Williams','1978-09-05','M',4000), ('Maria','Anne','Jones','1967-12-01','F',4000), ('Jen','Mary','Brown','1980-02-17','F',-1) ) val columns = Seq("firstname","middlename","lastname","dob","gender","salary") df = spark.createDataFrame(data), schema = columns).toDF(columns:_*) 

Since DataFrames are structure format that contains names and column, we can get the schema of the DataFrame using the df.printSchema()

df.show() shows the 20 elements from the DataFrame.

 +---------+----------+--------+----------+------+------+ |firstname|middlename|lastname|dob |gender|salary| +---------+----------+--------+----------+------+------+ |James | |Smith |1991-04-01|M |3000 | |Michael |Rose | |2000-05-19|M |4000 | |Robert | |Williams|1978-09-05|M |4000 | |Maria |Anne |Jones |1967-12-01|F |4000 | |Jen |Mary |Brown |1980-02-17|F |-1 | +---------+----------+--------+----------+------+------+ 

In this Apache Spark SQL DataFrame Tutorial, I have explained several mostly used operation/functions on DataFrame & DataSet with working Scala examples.

Spark DataFrame Advanced concepts

Spark Array and Map operations

Spark Aggregate

Spark SQL Joins

Spark Performance

Other Helpful topics on DataFrame

Spark SQL Schema & StructType

Spark SQL Functions

Spark SQL provides several built-in functions, When possible try to leverage the standard library as they are a little bit more compile-time safe, handle null, and perform better when compared to UDF. If your application is critical on performance try to avoid using custom UDF at all costs as these are not guaranteed on performance.

In this section, we will see several Spark SQL functions Tutorials with Scala examples.

Spark Data Source with Examples

Spark SQL supports operating on a variety of data sources through the DataFrame interface. This section of the tutorial describes reading and writing data using the Spark Data Sources with Scala examples. Using Data source API we can load from or save data to RDMS databases, Avro, parquet, XML etc.

Text

CSV

JSON

JSON’s readability, flexibility, language-agnostic nature, and support for semi-structured data make it a preferred choice in big data Spark applications where diverse sources, evolving schemas, and efficient data interchange are common requirements.

Key characteristics and reasons why JSON is used in big data include Human-Readable Format, Language-Agnostic, Semi-Structured Data, Schema Evolution e.t.c

Parquet

Avro

ORC

XML

Hive & Tables

SQL Spark Tutorial

Spark SQL is one of the most used Spark modules which is used for processing structured columnar data format. Once you have a DataFrame created, you can interact with the data by using SQL syntax. In other words, Spark SQL brings native RAW SQL queries on Spark meaning you can run traditional ANSI SQL on Spark Dataframe. In the later section of this Apache Spark tutorial, you will learn in detail using SQL select , where , group by , join , union e.t.c

In order to use SQL, first, we need to create a temporary table on DataFrame using createOrReplaceTempView() function. Once created, this table can be accessed throughout the SparkSession and it will be dropped along with your SparkContext termination.

On a table, SQL query will be executed using sql() method of the SparkSession and this method returns a new DataFrame.

 df.createOrReplaceTempView("PERSON_DATA") val df2 = spark.sql("SELECT * from PERSON_DATA") df2.printSchema() df2.show() 

Let’s see another example using group by .

 val groupDF = spark.sql("SELECT gender, count(*) from PERSON_DATA group by gender") groupDF.show() 

This yields the below output

 +------+--------+ |gender|count(1)| +------+--------+ | F| 2| | M| 3| +------+--------+ 

Similarly, you can run any traditional SQL queries on DataFrames using Spark SQL.

Spark HDFS & S3 Tutorial

Spark Streaming Tutorial & Examples

Spark Streaming is a scalable, high-throughput, fault-tolerant streaming processing system that supports both batch and streaming workloads. It is used to process real-time data from sources like file system folders, TCP sockets, S3, Kafka, Flume, Twitter, and Amazon Kinesis to name a few. The processed data can be pushed to databases, Kafka, live dashboards e.t.c

streaming

Spark with Kafka Tutorials

Spark – HBase Tutorials & Examples

In this section of the Spark Tutorial, you will learn several Apache HBase spark connectors and how to read an HBase table to a Spark DataFrame and write DataFrame to HBase table.

Apache HBase is an open-source, distributed, and scalable NoSQL database that runs on top of the Hadoop Distributed File System (HDFS). It provides real-time read and write access to large datasets and is designed for handling massive amounts of unstructured or semi-structured data, making it suitable for big data applications.

Spark – Hive Tutorials

In this section, you will learn what is Apache Hive and several examples of connecting to Hive, creating Hive tables, reading them into DataFrame

Apache Hive is a data warehousing and SQL-like query language tool built on top of Hadoop. It facilitates querying and managing large datasets stored in Hadoop Distributed File System (HDFS) using a familiar SQL syntax. Hive translates SQL-like queries into MapReduce or Apache Tez jobs, enabling users without extensive programming skills to analyze big data. It supports schema-on-read, allowing flexible data structures, and integrates with HBase and other Hadoop ecosystem components. Hive is particularly useful for batch processing, data summarization, and analysis tasks, making big data analytics accessible to a broader audience within the Apache Hadoop ecosystem.

Spark GraphX and GraphFrames

Spark GraphFrames are introduced in Spark 3.0 version to support Graphs on DataFrames. Prior to 3.0, Spark had GraphX library which ideally runs on RDD, and lost all Data Frame capabilities.

GraphFrames is a graph processing library for Apache Spark that provides high-level abstractions for working with graphs and performing graph analytics. It extends Spark’s DataFrame API to support graph operations, allowing users to express complex graph queries using familiar DataFrame operations.

Below is an example of how to create and use Spark GraphFrame.

 // Import necessary libraries import org.apache.spark.sql.SparkSession import org.graphframes.GraphFrame // Create a Spark session val spark = SparkSession.builder.appName("GraphFramesExample").getOrCreate() // Define vertices and edges as DataFrames val vertices = spark.createDataFrame(Seq( (1, "Scott", 30), (2, "David", 40), (3, "Mike", 45) )).toDF("id", "name", "age") val edges = spark.createDataFrame(Seq( (1, 2, "friend"), (2, 3, "follow") )).toDF("src", "dst", "relationship") // Create a GraphFrame val graph = GraphFrame(vertices, edges) // Display vertices and edges graph.vertices.show() graph.edges.show() // Perform Graph Queries val aliceFriends = graph.edges.filter("src = 1").join(graph.vertices, "dst").select("dst", "name") aliceFriends.show() // Graph Analytics - In-degrees val inDegrees = graph.inDegrees inDegrees.show() // Subgraph Creation val subgraph = graph.filterVertices("age >= 40").filterEdges("relationship = 'friend'") subgraph.vertices.show() subgraph.edges.show() // Graph Algorithms - PageRank val pageRankResults = graph.pageRank.resetProbability(0.15).maxIter(10).run() pageRankResults.vertices.show() pageRankResults.edges.show() // Stop the Spark session spark.stop() 

What are the key features and improvements released in Spark 3.5.0

References:

Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn LinkedIn

Apache Spark Tutorial

Spark RDD

Spark SQL Tutorial

Spark SQL Functions

Spark Data Source API

Spark Streaming & Kafka