Get in Touch

Course Outline

  1. Scala Primer

    • A rapid introduction to Scala
    • Labs: Familiarizing yourself with Scala
  2. Spark Fundamentals

    • Historical background and context
    • The relationship between Spark and Hadoop
    • Core concepts and architectural design
    • The Spark ecosystem (core components, Spark SQL, MLlib, and Streaming)
    • Labs: Installing and launching Spark
  3. First Impressions of Spark

    • Executing Spark in local mode
    • Navigating the Spark Web UI
    • Utilizing the Spark shell
    • Dataset analysis – Part 1
    • Examining Resilient Distributed Datasets (RDDs)
    • Labs: Exploring the Spark shell
  4. Resilient Distributed Datasets (RDDs)

    • Foundational RDD concepts
    • Understanding partitions
    • RDD operations and transformations
    • Various RDD types
    • Working with Key-Value pair RDDs
    • Implementing MapReduce patterns on RDDs
    • Strategies for caching and persistence
    • Labs: Creating and inspecting RDDs; Implementing RDD caching
  5. Programming with Spark APIs

    • Introduction to the Spark API and RDD API
    • Submitting your first Spark program
    • Techniques for debugging and logging
    • Managing configuration properties
    • Labs: Coding with the Spark API and submitting jobs
  6. Spark SQL

    • SQL capabilities within Spark
    • Understanding DataFrames
    • Defining tables and importing datasets
    • Executing SQL queries on DataFrames
    • Storage formats: JSON and Parquet
    • Labs: Creating and querying DataFrames; Assessing data formats
  7. MLlib

    • Overview of MLlib
    • Available MLlib algorithms
    • Labs: Developing MLlib applications
  8. GraphX

    • Overview of the GraphX library
    • GraphX APIs
    • Labs: Processing graph data with Spark
  9. Spark Streaming

    • Streaming overview
    • Evaluating various Streaming platforms
    • Core Streaming operations
    • Implementing sliding window operations
    • Labs: Developing Spark Streaming applications
  10. Spark and Hadoop Integration

    • Introduction to Hadoop (HDFS and YARN)
    • Architecture of Hadoop combined with Spark
    • Running Spark on Hadoop YARN
    • Processing HDFS files using Spark
  11. Spark Performance and Tuning

    • Utilizing Broadcast variables
    • Understanding Accumulators
    • Memory management and caching strategies
  12. Spark Operations

    • Deploying Spark in production environments
    • Review of sample deployment templates
    • Essential configurations
    • Monitoring best practices
    • Troubleshooting common issues

Requirements

PRE-REQUISITES

Proficiency in at least one of the following languages: Java, Scala, or Python (laboratory exercises are conducted in Scala and Python).
A fundamental understanding of the Linux development environment, including command-line navigation and file editing using VI or nano.

 21 Hours

Number of participants


Price per participant

Testimonials (6)

Upcoming Courses

Related Categories