Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Udemy

Complete PySpark Developer Course (Spark with Python)

via Udemy

This course may be unavailable.

Overview

Save Over 40% with an Annual Subscription to Udemy Personal Plan.
Learn PySpark in depth with hundreds of Practical examples. Be a complete PySpark Developer. Set up a Hadoop Cluster.

What you'll learn:
  • Complete Curriculum for a successful PySpark Developer
  • Hadoop Single Node Cluster Set up and Integrate with Spark 2.x and Spark 3.x
  • Complete Flow of Installation of PySpark (Windows and Unix)
  • Detailed HDFS Course
  • Python Crash Course
  • Introduction to Spark
  • Understand SparkSession
  • Spark RDD Fundamentals, Operations, Persistence. Practical Examples to solve problems.
  • Spark Cluster Architecture - Execution, YARN, JVM Processes, DAG Scheduler, Task Scheduler
  • Spark Shared Variables
  • Spark SQL Architecture, Catalyst Optimizer, Volcano Iterator Model, Tungsten Execution Engine
  • DataFrame Fundamentals
  • DataFrame Rows, Columns and DataTypes. Practical examples.
  • ETL Using DataFrame (Extraction APIs, Transformation APIs, and Loading APIs). Practical Examples.
  • Optimization and Management - Join Strategies, Driver Conf, Executor Conf etc

This is a complete PySpark Developer course for Data Engineers and Data Scientists and others who wants to process Big Data in an effective manner. We will cover below topics and more:

  • Complete Curriculum for a successful PySpark Developer

  • Set up Hadoop Single Node Cluster and Integrate it with Spark 2.x and Spark 3.x

  • Complete Flow of Installation of Standalone PySpark (Unix and Windows Operating System)

  • Detailed HDFS Commands and Architecture.

  • Python Crash Course

  • Introduction to Spark (Why Spark was Developed, Spark Features, Spark Components)

  • Understand SparkSession

  • Spark RDD Fundamentals

  • How to Create RDDs

  • RDD Operations (Transformations & Actions)

  • Spark Cluster Architecture - Execution, YARN, JVM Processes, DAG Scheduler, Task Scheduler

  • RDD Persistence

  • Spark Shared Variables - Broadcast

  • Spark Shared Variables - Accumulators)

  • Spark SQL Architecture, Catalyst Optimizer, Volcano Iterator Model, Tungsten Execution Engine, Different Benchmarks

  • Difference between Catalyst Optimizer and Volcano Iterator Model

  • Spark Commonly Used Functions - Version, range, createDataFrame, sql, table, SparkContext, conf, read, udf, newSession, stop, catalog etc

  • DataFrame Built-in functions - new column functions, encryption functions, string functions, regexp functions, date functions, null functions, collection functions, na functions, math and statistics functions, explode functions, flatten functions, formatting and json functions

  • What is Partition,

  • What is Repartition

  • What is Coalesce

  • Repartition Vs Coalesce

  • Extraction - csv file, text file, Parquet File, orc file, json file, avro file, hive, jdbc

  • DataFrame Fundamentals

  • What is a DataFrame

  • DataFrame Sources

  • DataFrame Features

  • DataFrame Organization

  • DataFrame Rows,

  • DataFrame Columns

  • DataTypes. Practical examples.

  • Perform ETL Using DataFrame

    -- Extraction APIs

    -- Transformation APIs

    -- Loading APIs

    -- Practical Examples.

  • Optimization and Management - Join Strategies, Driver Conf, Parallelism Configurations, Executor Conf etc


Taught by

Learn-Spark.info (Spark University)

Reviews

4.3 rating at Udemy based on 645 ratings

Start your review of Complete PySpark Developer Course (Spark with Python)

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.