
In this course, you will learn data engineering with Apache Spark in Azure Synapse Analytics, including using Delta Lake and data visualization.
You will learn how to master the core features and capabilities of Apache Spark for large-scale data processing and analytics within the Azure Synapse Analytics environment, including configuring Spark pools and using notebooks to run code for loading, analyzing, and visualizing data from a data lake, as well as understanding how Spark works in a distributed environment and how to use dataframes and Spark SQL for data manipulation.
You will be introduced to Delta Lake, an open-source storage layer that brings ACID transactions to Apache Spark. You will learn to create and use Delta Lake tables, including updating, querying previous versions (time travel), and using them for streaming data. You will explore how to define tables in the Spark metastore and query them using SQL, as well as how to use Delta Lake tables as sources and sinks for streaming data.
Begin transforming data using Spark, including loading data into dataframes, restructuring data, and saving it in formats like Parquet. This course covers partitioning data for optimization and filtering partitioned data in queries. The use of SQL for querying and transforming data is also covered, along with visualizing data within Spark notebooks using built-in charts and Python libraries like Matplotlib.