Overview
The myOSH_To_Bronze notebook is designed to handle multiple MyOSH API endpoints dynamically, meaning it can process different endpoints such as records or users without hardcoding logic. For most endpoints (like users), the process is straightforward: it performs a regular API call and writes the response directly to JSON in the source container and to Parquet in the Bronze layer.
When the endpoint is records, the notebook adds extra logic to merge the latest API data with the historical archive, remove duplicates, and filter out invalid records before writing the cleaned dataset back to the Bronze Processed folder. This ensures that the records dataset remains complete and analytics-ready, while other endpoints follow a simpler workflow.
Why Do We Need It?
APIs often return only current data, and archives can accumulate duplicates or incomplete records over time. Without cleaning and merging, analytics and compliance reporting can be compromised. This notebook solves that by:
- Fetching the latest records from the MyOSH API.
- Combining them with the historical archive.
- Removing duplicates and invalid entries.
- Writing the cleaned dataset back to the Bronze layer for downstream processing.
What Happens Behind the Scenes?
1. Fetch Current Records
The notebook connects to the MyOSH API and retrieves all available records in batches of 1,000 rows.
2. Write Raw Data to Import and Archive
The raw API response is written to:
- Import folder (JSON format for immediate reference).
- Archive folder (dated snapshot for historical tracking).
3. Serialise Nested Data
Complex JSON fields are converted into strings to ensure compatibility with CSV and Parquet formats.
4. Add Pipeline Log
A timestamp column is added to track when the data was processed.
5. Remove Duplicates and Invalid Records
- Duplicate rows based on
idare dropped. - Records with empty
creationDateare removed to ensure data integrity.
6. Merge with Existing Archive
The cleaned API data is combined with the existing archive stored in the Bronze layer. This creates a unified dataset containing both historical and current records.
7. Write to Bronze Processed
The final dataset is written back to:
- Bronze Processed folder (Parquet format for analytics).
- Archive container (dated snapshot for compliance).
Why Is This Important?
- Accuracy: Removes duplicates and invalid entries.
- Completeness: Combines current and historical data.
- Analytics-ready: Stores data in Parquet for efficient querying.
Key Benefits
- Automated data cleaning and merging.
- Historical snapshots for auditing.
- Optimised storage for downstream analytics.