Objective:
The Snapshot pipeline aims to streamline data analysis and management by creating optimised views from raw data files.
Process Overview:
- Loading Parquet Files:
- The pipeline begins by reading raw data stored in Parquet format from the bronze container. These files contain data sourced from IFS systems.
- Delta Lake Tables:
- Upon successful loading, the data is saved as delta lake tables within the Synapse Workspace. This ensures data integrity and enables efficient processing.
- View Creation:
- SQL queries stored in DevOps are executed using Spark SQL to generate views corresponding to specific Data Marts. Each query is meticulously crafted for accuracy and consistency.
- Storage and Accessibility:
- Generated views are uploaded to a designated storage container for easy accessibility. They are then made accessible via an application API, which serves as a bridge for further pipeline processes.
- Integration into Data Warehouse:
- Views are integrated into the data warehouse under the ‘svw’ (Snapshot view) schema. This integration enables centralised data management and retrieval within the data warehouse ecosystem.
Loading parquet files and Delta lake table Creation:
- Parquet Data Ingestion:
- The Snapshot pipeline kicks off by reading raw data stored in Parquet format. This process is orchestrated using a Synapse Notebook.
- A for each loop efficiently iterates through the Parquet files, ensuring seamless data ingestion.
- Dynamic Table Names:
- Table names are dynamically acquired via MyBMT Entity 26 and stored in variables.
- These names remain accessible throughout the pipeline.
- Delta Lake Transformation:
- To ensure compatibility with Delta Lake and Spark SQL, any special characters (such as “$”) in table names or columns are replaced with underscores.
- The tables are then saved as Delta Lake tables within the Synapse Workspace.
View Creation
- Query Commitment in Azure DevOps:
- SQL queries are committed in Azure DevOps, ensuring version control and collaboration.
- These queries define the logic for creating views.
- Release Pipeline and Storage Container:
- A release pipeline uploads views from the svw folder into a designated storage container.
- This container enables easy access to the views.
- Custom Function and Query Transformation:
- A custom function reads queries from the container.
- Each view is dynamically created based on its specific query.
- Python’s
.replace
method modifies query areas, making them compatible with Spark SQL.
- Spark SQL Execution and Pandas Dataframe:
- Spark SQL executes the modified query, resulting in a Spark dataframe.
- The dataframe is then converted to a Pandas dataframe.
- Two additional columns are added: one for the pipeline log (current timestamp) and another for next Sunday’s date.
- Parquet Upload to Gold Container:
- Each Pandas file is uploaded to Parquet format in the gold container within the storage account.
API Access and loading into the data warehouse
- Python Flask App for Data Retrieval:
- A Python Flask app, hosted as an Azure Web App, acts as the data gateway.
- It reads data from the gold container and serves it in JSON format.
- Copy Activity for Dynamic Table Creation:
- The JSON data becomes the source for a copy activity.
- Each view within the container is dynamically copied into its own specific table in the data warehouse.
- These tables reside under the ‘svw’ schema.
Conclusion:
The Snapshot pipeline offers an efficient solution for transforming raw data into actionable insights. By automating view creation and integration into the data warehouse, it enhances data accessibility and streamlines data analysis processes.
“`= Sql.Database(DataMart_Host, DataMart_Endpoint, [Query=”EXEC get.myView @dataMart = ‘project’, @viewName = ‘details_plus’, @token = ‘”& DataMart_Token &”‘, @version = ‘202404’”])“`