Overview
Learn best practices for using PySpark in real-world applications through this conference talk from ODSC West 2015. Discover how to manage dependencies on a cluster, avoid common pitfalls of Python's duck typing, and understand Spark's computational model for effective distributed code execution. Explore techniques for package management with virtualenv, testing PySpark applications, and structuring code for optimal performance. Gain insights into handling complex dependencies, implementing proper logging, and navigating multiple Python environments. Follow along with a practical example of a statistical analysis on time series data to reinforce key concepts and improve your PySpark development skills.
Syllabus
cloudera
Spark Execution Model
PySpark Driver Program
How do we ship around Python functions?
Pickle!
DataFrame is just another word for...
Use DataFrames
REPLs and Notebooks
Share your code
Standard Python Project
What is the shape of a PySpark job?
PySpark Structure?
Simple Main Method
Write Testable Code
Write Serializable Code
Testing with SparkTestingBase
Testing Suggestions
Writing distributed code is the easy part...
Get Serious About Logs
Know your environment
Complex Dependencies
Many Python Environments
Taught by
Open Data Science