An Overview of the Medallion Pipeline

Data pipeline name: Medallion to Bronze ETL Pipeline

Owner: Data Engineering Team

Used since: July 2023

Purpose: The Medallion pipeline is an automated process that handles the extraction, loading, and processing of CSV data from SATUK or GMISUK systems into bronze storage as parquet files.

Overview

The Medallion pipeline begins by setting source variables based on whether the workspace is dev or prod. If dev, the storage container is set to dev and the source is set to SATUK. Otherwise, the storage container is set to prd and source to GMISUK. Next, a notebook counts the number of CSV files in the respective source folder, either SATUK or GMISUK. An if condition checks if the file count is greater than 10, failing the pipeline if below to ensure sufficient data.

With the source variables and file count validated, the pipeline executes two child pipelines: Move Files to import and Archive Files. Move Files to import first deletes all previous files from the import folder then copies CSVs from the source folder into the import folder. Archive Files copies CSVs from source to an archive folder tagged with the current date. After archiving and moving source files to import, The move files to Processed pipeline is triggered to copy CSVs from the import folder to a processed folder.

Next, a “Delete Previous Bronze Import” Notebook executes, deleting all files in the Bronze import directory. Then a “Source to Bronze” Notebook copies the CSV files to the bronze container in parquet format. The CSVs are transferred from staging/import into Bronze /import. Finally, a “Handle Files” Notebook is executed. This copies the files from bronze/import to bronze archive and bronze processed.

This Medallion pipeline runs on a scheduled nightly trigger at 3am UTC.

Pipeline Steps

1. Set Source Variables

  • An if condition checks if workspace = dev
    • If true:
      • Set storage container to dev
      • Set source variable to SATUK
    • Else (workspace != dev):
      • Set storage container to prd
      • Set source variable to GMISUK

2. Count Source Files

  • The ‘Run CheckImportFiles’ Notebook counts CSV files in source folder (/SATUK or /GMISUK)
  • An if condition checks if file count > 10
    • If true, continue
    • If false, fail pipeline

3. Move Source Files

  • Execute Move Files pipeline:
    • Deletes all previous files from /Import
    • Copies CSV files from source to /import
  • Execute Archive Files pipeline
    • Copies CSV files from source to /archive, using current date

4. Process Imported Files

  • Execute Process Imports pipeline:
    • Deletes previous files from /Processed
    • Copies CSV files from /import to /processed

5. Load to Bronze

  • The ‘Source to Bronze’ Notebook copies staged CSVs to Bronze storage as Parquets
  • The Handle Files Notebook moves the Parquet files into separate directories including archive and processed

6. Trigger Schedule

  • 3 AM daily

1 thought on “An Overview of the Medallion Pipeline”

Leave a Comment