Collections and their Standard Methods (e.g. map())
Working with:
o Functions o Methods o Function Literals
Working with:
o Functions o Methods o Function Literals
Define the Following as they Relate to Scale:
o Class o Object o Case Class
Overview, Motivations, Spark Systems
Spark Ecosystem
Spark vs. Hadoop
Acquiring and Installing Spark
The Spark Shell, SparkContext
Day 1 Demonstrations
Setting Up the Lab Environment
Starting the Scala Interpreter
A First Look at Spark
A First Look at the Spark Shel
Day 2 Objective
RDD Concepts, Lifecycle, Lazy Evaluation
RDD Partitioning and Transformations
Working with RDDs Including: o Creating and Transforming (map, filter, etc.)
An Overview of RDDs
SparkSession, Loading/Saving Data, Data Formats (JSON, CSV, Parquet, text …)
Introducing DataFrames and DataSets (Creation and Schema Inference)
Identify Supported Data Formats, Including:
o JSON o Text o CSV o Parquet
Working with the DataFrame (untyped) Query DSL, including:
o Column o Filtering o Grouping o Aggregation
SQL-based Queries
Working with the DataSet (typed) API
Mapping and Splitting (flatMap(), explode(), and split())
DataSets vs. DataFrames vs. RDDs
Day 2 Demonstrations
RDD Basics
Operations on Multiple RDDs
Data Formats
Spark SQL Basics
DataFrame Transformations
The DataSet Typed API
Splitting Up Data
Day 3 Objective
Working with:
o Grouping o Reducing o Joining
Shuffling, Narrow vs. Wide Dependencies, and Performance Implications
Exploring the Catalyst Query Optimizer (explain(), Query Plans, Issues with lambdas)
The Tungsten Optimizer (Binary Format, Cache Awareness, Whole-Stage Code Gen)
Discuss Caching, Including:
o Concepts o Storage Type o Guidelines
Minimizing Shuffling for Increased Performance
Using Broadcast Variables and Accumulators
General Performance Guidelines
o Using the Spark UI o Efficient Transformations o Data Storage o Monitoring
Day 3 Demonstrations
Exploring Group Shuffling
Seeing Catalyst at Work
Seeing Tungsten at Work
Working with Caching, Joins, Shuffles, Broadcasts, Accumulators
Broadcast General Guideline
Day 4 Objective
Core API, SparkSession.Builder
Configuring and Creating a SparkSession
Building and Running Applications – sbt/build.sbt and spark-submit
Application Lifecycle (Driver, Executors, and Tasks)
Cluster Managers (Standalone, YARN, Mesos)
Logging and Debugging
Introduction and Streaming Basics
Spark Streaming (Spark 1.0+)
o DStreams, Receivers, Batching o Stateless Transformation o Windowed Transformation o Stateful Transformation
Structured Streaming (Spark 2+)
o Continuous Applications o Table Paradigm, Result Table o Steps for Structured Streaming o Sources and Sinks
Consuming Kafka Data
o Kafka Overview o Structured Streaming – “kafka” Format o Processing the Stream
Day 4 Demonstrations
Spark Job Submission
Additional Spark Capabilities
Spark Streaming
Spark Structured Streaming
Spark Structured Streaming with Kafka
Course Pre-Requisite
Students should be familiar with programming principles and have previous experience in software development using Scala. Previous experience with data streaming, SQL, and Hadoop is also helpful, but not required.