Data Science on Google Cloud Platform: Building Data Pipelines
Overview
Learn how to design and build big data pipelines on Google Cloud Platform.
Cloud computing brings unlimited scalability and elasticity to data science applications. Expertise in the major platforms, such as Google Cloud Platform (GCP), is essential to the IT professional. This course—one of a series by veteran cloud engineering specialist and data scientists Kumaran Ponnambalam—shows how to use the latest technologies in GCP to build a big data pipeline that ingests, transports, and transforms data entirely in the cloud. Learn how to set up data processing jobs using Apache Beam and Cloud Dataflow. Discover how to leverage Cloud Pub/Sub for stream ingestion and real-time messaging. Finally, find out how to process the stream events in Cloud Dataflow. The course uses an end-to-end use case that shows how to apply the knowledge and best practices from the course in a practical data science workflow.
Cloud computing brings unlimited scalability and elasticity to data science applications. Expertise in the major platforms, such as Google Cloud Platform (GCP), is essential to the IT professional. This course—one of a series by veteran cloud engineering specialist and data scientists Kumaran Ponnambalam—shows how to use the latest technologies in GCP to build a big data pipeline that ingests, transports, and transforms data entirely in the cloud. Learn how to set up data processing jobs using Apache Beam and Cloud Dataflow. Discover how to leverage Cloud Pub/Sub for stream ingestion and real-time messaging. Finally, find out how to process the stream events in Cloud Dataflow. The course uses an end-to-end use case that shows how to apply the knowledge and best practices from the course in a practical data science workflow.
Syllabus
Introduction
- What goes into a data pipeline?
- Data science modules covered
- GCP data pipeline options
- Cloud Dataproc
- Cloud Dataflow
- Cloud Pub/Sub
- What is Apache Beam?
- Beam pipelines
- PCollections
- Transforms
- Pipeline I/O
- Runners
- Setting up GCP for Dataflow
- Setting up Python
- Creating a simple pipeline
- Executing in Dataflow
- Reading text files
- ParDo
- GroupBy
- Map
- Combine
- Writing data to text files
- Other capabilities
- What is Pub/Sub?
- Topics and messages
- Publishers
- Subscribers
- Create a topic
- Create a subscription
- Publish and receive
- Python SDK
- Streaming with Dataflow
- Windowing with Dataflow
- Streaming and windowing example
- Next steps
Taught by
Kumaran Ponnambalam
Related Courses
-
Data Science on Google Cloud Platform: Exploratory Data Analytics
-
Data Science on Google Cloud Platform: Architecting Solutions
-
Google Cloud Platform Big Data and Machine Learning Fundamentals
3.0 -
Google Cloud Platform for Enterprise Essential Training (2019)
-
Google Cloud Platform Essential Training for Administrators
-
Google Cloud Platform (GCP) Essential Training for Developers