Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
-
Introduction
- History and core concepts of Hadoop.
- The Hadoop Ecosystem.
- Overview of Distributions.
- High-level architecture.
- Common Hadoop myths.
- Challenges in Hadoop (hardware and software).
- Labs: Discussion of your own Big Data projects and challenges.
-
Planning and Installation
- Selecting software and Hadoop distributions.
- Cluster sizing and planning for future growth.
- Hardware and network selection.
- Rack topology considerations.
- Installation procedures.
- Multi-tenancy management.
- Directory structures and log management.
- Benchmarking performance.
- Labs: Installing the cluster and running performance benchmarks.
-
HDFS Operations
- Core concepts (horizontal scaling, replication, data locality, rack awareness).
- Nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode).
- Health monitoring techniques.
- Administration via command-line and browser interfaces.
- Expanding storage and replacing defective drives.
- Labs: Familiarizing with HDFS command-line tools.
-
Data Ingestion
- Using Flume for logs and other data ingestion into HDFS.
- Utilizing Sqoop for importing data from SQL databases to HDFS and exporting back to SQL.
- Implementing Hadoop data warehousing with Hive.
- Transferring data between clusters using distcp.
- Leveraging S3 as a complement to HDFS.
- Best practices and architectures for data ingestion.
- Labs: Setting up and utilizing Flume and Sqoop.
-
MapReduce Operations and Administration
- Parallel computing precedents: Comparing HPC with Hadoop administration.
- Managing MapReduce cluster loads.
- Nodes and Daemons (JobTracker, TaskTracker).
- Walkthrough of the MapReduce User Interface.
- MapReduce configuration settings.
- Job configuration details.
- Strategies for optimizing MapReduce performance.
- Ensuring robust MapReduce operations: Guidance for programmers.
- Labs: Executing MapReduce examples.
-
YARN: New Architecture and Capabilities
- YARN design goals and implementation architecture.
- New components: ResourceManager, NodeManager, Application Master.
- Installing YARN.
- Job scheduling within YARN.
- Labs: Investigating job scheduling mechanisms.
-
Advanced Topics
- Hardware monitoring strategies.
- Cluster-wide monitoring.
- Adding and removing servers, and upgrading Hadoop.
- Backup, recovery, and business continuity planning.
- Managing Oozie job workflows.
- Achieving Hadoop High Availability (HA).
- Implementing Hadoop Federation.
- Securing the cluster with Kerberos.
- Labs: Setting up monitoring systems.
-
Optional Tracks
- Cloudera Manager: For cluster administration, monitoring, and routine tasks, including installation and usage. In this track, all exercises and labs are conducted within the Cloudera distribution environment (CDH5).
- Ambari: For cluster administration, monitoring, and routine tasks, including installation and usage. In this track, all exercises and labs are conducted within the Ambari cluster manager and Hortonworks Data Platform (HDP 2.0).
Requirements
- Proficiency in basic Linux system administration.
- Foundational scripting skills.
While prior knowledge of Hadoop and Distributed Computing is not mandatory, these topics will be introduced and explained throughout the course.
Lab environment
Zero Installation Required: Students do not need to install Hadoop software on their personal machines. A fully functional Hadoop cluster will be provided for use.
Participants will need to have the following tools installed:
- An SSH client (Linux and Mac systems come with SSH clients by default; for Windows, PuTTY is recommended).
- A web browser to access the cluster. We recommend using Firefox with the FoxyProxy extension installed.
21 Hours
Testimonials (1)
Hands on exercises. Class should have been 5 days, but the 3 days helped to clear up a lot of questions that I had from working with NiFi already