When you will go throught the workshop you will get to know what distributed computing is, differences between approaches of MapReduce and Spark, basic Spark architecture. You will be able to start Spark job in the standalone cluster and work with basic Spark API - RDD and Datasets/DataFrames. The workshop focus only on Spark SQL module.
NOTE This workshop was initially created for the DevFest 2017 in Prague.
As the first step, you have to set your Spark environment to get everything work. It includes docker installation and description how to run docker container where Apache Spark will be ready to use.
Let's find out what distributed computing means and when actually choose this approach.
Why isn't the MapReduce approach good enough and what are differences of Spark? You can read here.
In order to understand how to use Spark, it is good to understand the basics of Spark architecture.
Get to know the Spark, Spark REPL and run your first job.
You will write your first Spark application. The word-count is the "hello world" in the distribution computation.
You will analyze real data with help RDD and Dataset.
You can submit and run all spark jobs on Spark standalone cluster in cluster deploy mode.
Recommendation for further reading: Spark: The Definitive Guide