Struttura del corso
PySpark & Machine Learning
Module 1: Big Data & Spark Foundations
- Overview of the Big Data ecosystem and the role of Spark in modern data platforms
- Understanding Spark architecture: driver, executors, cluster manager, lazy evaluation, DAG and execution planning
- Differences between RDD and DataFrame APIs and when to use each approach
- Creating and configuring SparkSession and understanding application configuration fundamentals
Module 2: PySpark DataFrames
- Reading and writing data from enterprise sources and formats (CSV, JSON, Parquet, Delta)
- Working with PySpark DataFrames: transformations, actions, column expressions, filtering, joins and aggregations
- Implementing advanced operations such as window functions, handling timestamps and working with nested data
- Applying data quality checks and writing reusable, maintainable PySpark code
Module 3: Processing Large Datasets Efficiently
- Understanding performance fundamentals: partitioning strategies, shuffle behaviour, caching and persistence
- Using optimisation techniques including broadcast joins and execution plan analysis
- Efficient processing of large datasets and best practices for scalable data workflows
- Understanding schema evolution and modern storage formats used in enterprise environments
Module 4: Feature Engineering at Scale
- Performing feature engineering with Spark MLlib: handling missing values, encoding categorical variables and feature scaling
- Designing reusable preprocessing steps and preparing datasets for Machine Learning pipelines
- Introduction to feature selection and handling imbalanced datasets
Module 5: Machine Learning with Spark MLlib
- Understanding MLlib architecture and the Estimator/Transformer pattern
- Training regression and classification models at scale (Linear Regression, Logistic Regression, Decision Trees, Random Forest)
- Comparing models and interpreting results in distributed Machine Learning workflows
Module 6: End-to-End ML Pipelines
- Building end-to-end Machine Learning pipelines combining preprocessing, feature engineering and modelling
- Applying train/validation/test split strategies
- Performing cross-validation and hyperparameter tuning using grid search and random search
- Structuring reproducible Machine Learning experiments
Module 7: Model Evaluation & Practical ML Decision Making
- Applying appropriate evaluation metrics for regression and classification problems
- Identifying overfitting and underfitting and making practical model selection decisions
- Interpreting feature importance and understanding model behaviour
Module 8: Production & Enterprise Practices
- Persisting and loading models in Spark
- Implementing batch inference workflows on large datasets
- Understanding the Machine Learning lifecycle in enterprise environments
- Introduction to versioning, experiment tracking concepts and basic testing strategies
Practical Outcome
- Ability to work autonomously with PySpark
- Ability to process large datasets efficiently
- Ability to perform feature engineering at scale
- Ability to build scalable Machine Learning pipelines
Requisiti
Participants should have the following background:
Basic Python programming knowledge including working with functions, data structures and libraries
Fundamental understanding of data analysis concepts such as datasets, transformations and aggregations
Basic knowledge of SQL and relational data concepts
Introductory understanding of Machine Learning concepts such as training datasets, features and evaluation metrics
Familiarity with command line environments and basic software development practices is recommended
Experience with Pandas, NumPy or similar data processing libraries is helpful but not mandatory.
Recensioni (1)
compiti di esercitazione
Pawel Kozikowski - GE Medical Systems Polska Sp. Zoo
Corso - Python and Spark for Big Data (PySpark)
Traduzione automatica