This repository contains resources for the Data Engineering Specialization, a MOOC offered by deeplearning.ai that provides a comprehensive guide to developing and deploying data systems that deliver real business value. It includes course subtitles and solutions to the lab exercises for all courses in the specialization.
This specialization equips learners with the skills to approach data engineering problems systematically, focusing on the full data engineering lifecycle. Key topics include building effective data pipelines, designing scalable architectures, and applying data transformations to serve business needs.
- Think critically about the components of an end-to-end data architecture that satisfies requirements while remaining flexible for the future.
- Evaluate technologies and tools against the context of requirements and good data architecture.
- Design a data architecture and implement a batch and streaming pipeline on AWS.
- Identify data formats and appropriate source systems for various use cases.
- Differentiate between relational and NoSQL databases, and understand ACID compliance.
- Perform batch and streaming data ingestion using ETL and ELT patterns.
- Interact with REST APIs, object storage, and event-streaming platforms for ingestion.
- Learn DataOps concepts such as CI/CD, Infrastructure as Code, and observability.
- Orchestrate data pipelines using Airflow and integrate data quality tests.
- Explore storage systems (object, block, and file) and their impact on performance.
- Understand the differences between row-oriented and column-oriented databases.
- Learn about data warehouses, data lakes, and the lakehouse architecture.
- Implement advanced SQL queries and strategies for query performance optimization.
- Execute aggregate and join queries on streaming data.
- Define data modeling techniques and apply normalization to data.
- Understand warehouse modeling approaches (Inmon, Kimball, Data Vault) and transform data for analytics.
- Prepare tabular, textual, and image data for machine learning models.
- Compare batch and streaming transformation frameworks such as Spark and Pandas.
- Serve processed data to stakeholders using modern architectures.
- Transcripts
- Lab solutions
This repository is intended for educational purposes only. The contents of this repository, including subtitles, are not my own. The solutions to the labs are provided for reference and are not endorsed by the course instructors or platform. Please adhere to the course’s honor code and guidelines when using these resources.