Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Skillshare

The Ultimate Hands-On Hadoop: Tame your Big Data!

via Skillshare

This course may be unavailable.

Overview

Learn and master the most popular big data technologies in this comprehensive course, taught by a former engineer and senior manager from Amazon and IMDb. We'll go way beyond Hadoop itself, and dive into all sorts of distributed systems you may need to integrate with.

  • Install and work with a real Hadoop installation right on your desktop with Hortonworks and the Ambari UI
  • Manage big data on a cluster with HDFS and MapReduce
  • Write programs to analyze data on Hadoop with Pig and Spark
  • Store and query your data with Sqoop, Hive, MySQL, HBase, Cassandra, MongoDB, Drill, Phoenix, and Presto
  • Design real-world systems using the Hadoop ecosystem
  • Learn how your cluster is managed with YARN, Mesos, Zookeeper, Oozie, Zeppelin, and Hue
  • Handle streaming data in real time with Kafka, Flume, Spark Streaming, Flink, and Storm

Syllabus

  • Introduction
  • Install Hadoop on your Desktop
  • Hadoop Overview and History
  • Overview of the Hadoop Ecosystem
  • HDFS: What it is, and how it works
  • [Activity] Install the MovieLens dataset into HDFS using the Ambari UI
  • [Activity] Install the MovieLens dataset into HDFS using the command line
  • MapReduce: What it is, and how it works
  • How MapReduce distributes processing
  • MapReduce example: Break down movie ratings by rating score
  • [Activity] Installing Python, MRJob, and nano
  • [Activity] Code up the ratings histogram MapReduce job and run it
  • [Exercise] Rank movies by their popularity
  • [Activity] Check your results against mine!
  • Introducing Ambari
  • Introducing Pig
  • Example: Find the oldest movie with a 5-star rating using Pig
  • [Activity] Find old 5-star movies with Pig
  • More Pig Latin
  • [Exercise] Find the most-rated one-star movie
  • Pig Challenge: Compare Your Results to Mine!
  • Why Spark?
  • The Resilient Distributed Dataset (RDD)
  • [Activity] Find the movie with the lowest average rating - with RDD's
  • Datasets and Spark 2.0
  • [Activity] Find the movie with the lowest average rating - with DataFrames
  • [Activity] Movie recommendations with MLLib
  • [Exercise] Filter the lowest-rated movies by number of ratings
  • [Activity] Check your results against mine!
  • What is Hive?
  • [Activity] Use Hive to find the most popular movie[Activity] Use Hive to find the most popular movie
  • How Hive works
  • [Exercise] Use Hive to find the movie with the highest average rating
  • Compare your solution to mine.
  • Integrating MySQL with Hadoop
  • [Activity] Install MySQL and import our movie data
  • [Activity] Use Sqoop to import data from MySQL to HFDS/Hive
  • [Activity] Use Sqoop to export data from Hadoop to MySQL
  • Why NoSQL?
  • What is HBase
  • [Activity] Import movie ratings into HBase
  • [Activity] Use HBase with Pig to import data at scale.
  • Cassandra overview
  • [Activity] Installing Cassandra
  • [Activity] Write Spark output into Cassandra
  • MongoDB Overview
  • [Activity] Install MongoDB, and integrate Spark with MongoDB
  • [Activity] Using the MongoDB shell
  • Choosing a database technology
  • [Exercise] Choose a database for a given problem
  • Overview of Drill
  • [Activity] Setting Up Drill
  • [Activity] Querying across multiple databases with Drill
  • Overview of Phoenix
  • [Activity] Install Phoenix and query HBase with it
  • [Activity] Integrate Phoenix with Pig
  • Overview of Presto
  • [Activity] Install Presto, and query Hive with it.
  • [Activity] Query both Cassandra and Hive using Presto.
  • YARN explained
  • Tez explained
  • [Activity] Use Hive on Tez and measure the performance benefit
  • Mesos explained
  • ZooKeeper explained
  • [Activity] Simulating a failing master with ZooKeeper
  • Oozie explained
  • [Activity] Set up a simple Oozie workflow
  • Zeppelin overview
  • [Activity] Use Zeppelin to analyze movie ratings, part 1
  • [Activity] Use Zeppelin to analyze movie ratings, part 2
  • Hue overview
  • Other technologies worth mentioning
  • Kafka explained
  • [Activity] Setting up Kafka, and publishing some data.
  • [Activity] Publishing web logs with Kafka
  • Flume explained
  • [Activity] Set Up Flume and publish logs with Spark
  • [Activity] Set up Flume to monitor a directory and store its data in HDFS
  • Spark Streaming: Introduction
  • [Activity] Analyze web logs published with Flume using Spark Streaming
  • [Exercise] Monitor Flume-published logs for errors in real time
  • Exercise solution: Aggregating HTTP access codes with Spark Streaming
  • Apache Storm: Introduction
  • [Activity] Count words with Storm
  • Flink: An Overview
  • [Activity] Counting words with Flink
  • The Best of the Rest
  • Review: How the pieces fit together
  • Understanding your requirements
  • Sample application: consume webserver logs and keep track of top-sellers
  • Sample application: serving movie recommendations to a website
  • [Exercise] Design a system to report web sessions per day
  • Exercise solution: Design a system to count daily sessions

Taught by

Frank Kane

Reviews

Start your review of The Ultimate Hands-On Hadoop: Tame your Big Data!

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.