Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

PySpark Best Practices

Open Data Science via YouTube

Overview

Learn best practices for using PySpark in real-world applications through this conference talk from ODSC West 2015. Discover how to manage dependencies on a cluster, avoid common pitfalls of Python's duck typing, and understand Spark's computational model for effective distributed code execution. Explore techniques for package management with virtualenv, testing PySpark applications, and structuring code for optimal performance. Gain insights into handling complex dependencies, implementing proper logging, and navigating multiple Python environments. Follow along with a practical example of a statistical analysis on time series data to reinforce key concepts and improve your PySpark development skills.

Syllabus

cloudera
Spark Execution Model
PySpark Driver Program
How do we ship around Python functions?
Pickle!
DataFrame is just another word for...
Use DataFrames
REPLs and Notebooks
Share your code
Standard Python Project
What is the shape of a PySpark job?
PySpark Structure?
Simple Main Method
Write Testable Code
Write Serializable Code
Testing with SparkTestingBase
Testing Suggestions
Writing distributed code is the easy part...
Get Serious About Logs
Know your environment
Complex Dependencies
Many Python Environments

Taught by

Open Data Science

Reviews

Start your review of PySpark Best Practices

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.