Azure Formula1 project is an implementation of the data pipeline which consumes data from the Ergast API and makes F1 drivers/constructors standings available for Business Intelligence consumption. The pipeline infrastructure was built using Microsoft Azure as a backbone with ADLS Gen 2 as Datalake, Databricks/Spark as a data transformation framework, and Data Factory as an orchestrator.
These are the files from the Ergast API and the respective file formats that are used in this project, so different approaches are needed from the spark API to read each type of file.
| File Name | Format |
|---|---|
| Races | CSV |
| Constructors | Single Line JSON |
| Drivers | Single Line Nested JSON |
| Results | Single Line JSON |
| Pit Stops | Multi Line JSON |
| Lap Times | Split CSV Files |
| Qualifying | Split Multi Line JSON Files |
- Ingest all 8 files into the data lake.
- Ingested Data must have the schema applied.
- Ingested Data must have audit columns.
- Ingested Data must be stored in columnar format.
- Must be able to analyze ingested data via SQL.
- Ingestion logic must be able to handle the incremental load.
- Join the key information required for reporting to create a new table.
- Join the key information required for analysis to create a new table.
- Transformed tables must have audit columns.
- Driver standings for each year.
- Constructor Standings for each year.
- Most dominant drivers over the years.
- Most dominant Constructors over the years.
- Visualize the outputs.
- Microsoft Azure account
- Azure Databricks Service
- Azure Data Factory

