myOSH_To_Bronze – knowhow

Overview

The myOSH_To_Bronze notebook is designed to handle multiple MyOSH API endpoints dynamically, meaning it can process different endpoints such as records or users without hardcoding logic. For most endpoints (like users), the process is straightforward: it performs a regular API call and writes the response directly to JSON in the source container and to Parquet in the Bronze layer.

When the endpoint is records, the notebook adds extra logic to merge the latest API data with the historical archive, remove duplicates, and filter out invalid records before writing the cleaned dataset back to the Bronze Processed folder. This ensures that the records dataset remains complete and analytics-ready, while other endpoints follow a simpler workflow.

Why Do We Need It?

APIs often return only current data, and archives can accumulate duplicates or incomplete records over time. Without cleaning and merging, analytics and compliance reporting can be compromised. This notebook solves that by:

Fetching the latest records from the MyOSH API.
Combining them with the historical archive.
Removing duplicates and invalid entries.
Writing the cleaned dataset back to the Bronze layer for downstream processing.

What Happens Behind the Scenes?

1. Fetch Current Records

The notebook connects to the MyOSH API and retrieves all available records in batches of 1,000 rows.

2. Write Raw Data to Import and Archive

The raw API response is written to:

Import folder (JSON format for immediate reference).
Archive folder (dated snapshot for historical tracking).

3. Serialise Nested Data

Complex JSON fields are converted into strings to ensure compatibility with CSV and Parquet formats.

4. Add Pipeline Log

A timestamp column is added to track when the data was processed.

5. Remove Duplicates and Invalid Records

Duplicate rows based on id are dropped.
Records with empty creationDate are removed to ensure data integrity.

6. Merge with Existing Archive

The cleaned API data is combined with the existing archive stored in the Bronze layer. This creates a unified dataset containing both historical and current records.

7. Write to Bronze Processed

The final dataset is written back to:

Bronze Processed folder (Parquet format for analytics).
Archive container (dated snapshot for compliance).

Why Is This Important?

Accuracy: Removes duplicates and invalid entries.
Completeness: Combines current and historical data.
Analytics-ready: Stores data in Parquet for efficient querying.

Key Benefits

Automated data cleaning and merging.
Historical snapshots for auditing.
Optimised storage for downstream analytics.