Course Outline
-
Scala Primer
- A rapid introduction to Scala
- Labs: Familiarizing yourself with Scala
-
Spark Fundamentals
- Historical background and context
- The relationship between Spark and Hadoop
- Core concepts and architectural design
- The Spark ecosystem (core components, Spark SQL, MLlib, and Streaming)
- Labs: Installing and launching Spark
-
First Impressions of Spark
- Executing Spark in local mode
- Navigating the Spark Web UI
- Utilizing the Spark shell
- Dataset analysis – Part 1
- Examining Resilient Distributed Datasets (RDDs)
- Labs: Exploring the Spark shell
-
Resilient Distributed Datasets (RDDs)
- Foundational RDD concepts
- Understanding partitions
- RDD operations and transformations
- Various RDD types
- Working with Key-Value pair RDDs
- Implementing MapReduce patterns on RDDs
- Strategies for caching and persistence
- Labs: Creating and inspecting RDDs; Implementing RDD caching
-
Programming with Spark APIs
- Introduction to the Spark API and RDD API
- Submitting your first Spark program
- Techniques for debugging and logging
- Managing configuration properties
- Labs: Coding with the Spark API and submitting jobs
-
Spark SQL
- SQL capabilities within Spark
- Understanding DataFrames
- Defining tables and importing datasets
- Executing SQL queries on DataFrames
- Storage formats: JSON and Parquet
- Labs: Creating and querying DataFrames; Assessing data formats
-
MLlib
- Overview of MLlib
- Available MLlib algorithms
- Labs: Developing MLlib applications
-
GraphX
- Overview of the GraphX library
- GraphX APIs
- Labs: Processing graph data with Spark
-
Spark Streaming
- Streaming overview
- Evaluating various Streaming platforms
- Core Streaming operations
- Implementing sliding window operations
- Labs: Developing Spark Streaming applications
-
Spark and Hadoop Integration
- Introduction to Hadoop (HDFS and YARN)
- Architecture of Hadoop combined with Spark
- Running Spark on Hadoop YARN
- Processing HDFS files using Spark
-
Spark Performance and Tuning
- Utilizing Broadcast variables
- Understanding Accumulators
- Memory management and caching strategies
-
Spark Operations
- Deploying Spark in production environments
- Review of sample deployment templates
- Essential configurations
- Monitoring best practices
- Troubleshooting common issues
Requirements
PRE-REQUISITES
Proficiency in at least one of the following languages: Java, Scala, or Python (laboratory exercises are conducted in Scala and Python).
A fundamental understanding of the Linux development environment, including command-line navigation and file editing using VI or nano.
Testimonials (6)
Doing similar exercises different ways really help understanding what each component (Hadoop/Spark, standalone/cluster) can do on its own and together. It gave me ideas on how I should test my application on my local machine when I develop vs when it is deployed on a cluster.
Thomas Carcaud - IT Frankfurt GmbH
Course - Spark for Developers
Ajay was very friendly, helpful and also knowledgable about the topic he was discussing.
Biniam Guulay - ICE International Copyright Enterprise Germany GmbH
Course - Spark for Developers
Ernesto did a great job explaining the high level concepts of using Spark and its various modules.
Michael Nemerouf
Course - Spark for Developers
The trainer made the class interesting and entertaining which helps quite a bit with all day training.
Ryan Speelman
Course - Spark for Developers
We know a lot more about the whole environment.
John Kidd
Course - Spark for Developers
Richard is very calm and methodical, with an analytic insight - exactly the qualities needed to present this sort of course.