Overview
The myosh_Archive notebook is designed to maintain a complete and accurate archive of records from the MyOSH API. It ensures that no data is lost by identifying gaps in the current API response and backfilling missing records, then merging everything into a single, consolidated archive stored in the Azure Data Lake container.
Why Do We Need It?
APIs often return only the latest data, and older records can be lost if they are not stored properly. This notebook solves that problem by:
- Detecting missing IDs that should exist.
- Fetching those records individually.
- Combining them with the existing archive for a complete dataset.
This guarantees data integrity and makes historical analysis possible.
What Happens Behind the Scenes?
Here’s the process in plain language:
1. Pull Current Data
The notebook performs a paginated API request to the MyOSH records endpoint, retrieving data in batches of 1,000 rows. All batches are appended into a single DataFrame for processing.
2. Identify Missing IDs
Once the data is collected, the DataFrame is sorted by id to determine the highest ID returned by the API response. Using this value, the notebook creates a complete range of IDs that should exist. It then compares this range against:
- IDs from the current API call.
- IDs already stored in the existing records_ archive CSV.
Any IDs not found in either set are flagged as missing. These missing IDs form a list that will be used for targeted indivdual API call.
3. Backfill Missing Records
The notebook loops through the list of missing IDs and makes individual API calls to retrieve those records. These backfilled records are combined and stored in a new DataFrame called final_df.
4. Merge with Existing Archive
The new records are combined with the existing archive:
CSV Archive: The current archive file (records_archive.csv) is read and concatenated with final_df to create an updated CSV containing all historical and new records.
Parquet Archive: The same process is applied to the Parquet file in the Bronze layer, ensuring analytics-ready storage.
6. Save Outputs
After merging the new records with the existing archive, the notebook writes the updated dataset back to Azure Data Lake in two formats: CSV and Parquet.
- Import Folder (Rolling Archive)
This is the main archive file (records_archive) that is continuously updated with new records. It acts as a rolling archive, meaning it always contains the most complete and current version of all records collected so far. - Archive Folder (Historical Snapshots)
A dated copy of the archive is saved here each time the notebook runs. These snapshots provide a point‑in‑time view of the data for auditing and compliance purposes.
Why Is This Important?
- Completeness: No missing records, even if the API skips data.
- Analytics-ready: Data is stored in Parquet for efficient querying.