Overview:
The Bronze Layer is the initial raw data storage tier in the data warehouse, representing the true start of the data lake. It is designed to store data in its raw, untransformed state, yet in an organised and accessible manner. This layer ensures that the data is preserved in its original form, allowing for traceability and reprocessing if needed.
Key Characteristics:
- Structure and Organization:
- Sub-layers: The Bronze Layer is organised into three sub-layers: Import, Archive, and Processed.
- Import: This sub-layer is where the data lands after being prepared in the Import Layer. Data is stored in Parquet format, organised by source and date.
- Archive: The Archive sub-layer holds historical data, maintaining an immutable record of all data as it was received. This helps in audits and allows for backtracking if issues arise in later stages.
- Processed: This sub-layer contains data that has undergone initial processing and transformation, ready to be moved to the Silver layer.
- Sub-layers: The Bronze Layer is organised into three sub-layers: Import, Archive, and Processed.
- Data Handling:
- Parquet Format: Data stored in the Bronze Layer is in Parquet format, providing efficient storage and quick read/write capabilities. Parquet’s columnar storage is particularly suited for big data analytics.
- Metadata Management: Alongside the raw data, metadata is maintained to describe the schema, source, and any transformations applied. This metadata is crucial for data governance and lineage tracking.
- Data Processing:
- Initial Transformation: Data in the Bronze Layer undergoes minimal transformation, primarily standardizing formats, cleaning obvious errors, and enriching data with additional metadata. This ensures the data is consistent and easier to work with in subsequent layers.
- Preparation for Silver Layer: Once the data is adequately processed in the Bronze Layer, it is submitted to the Staging SQL database, which forms the Silver Layer. This involves ensuring the data is in a ready-to-query state, optimized for performance.
- Integration with Data Catalogue (myBMT):
- Data Cataloguing: The data catalogue, myBMT, continues to play a vital role in the Bronze Layer by keeping track of data lineage, schema, and processing history. This ensures a clear understanding of the data’s journey from its raw state to its more refined forms.
Workflow Summary:
- Data Arrival: Data lands in the Import sub-layer from the Import Layer.
- Archival: Data is archived to maintain historical integrity.
- Initial Processing: Data undergoes minimal processing and standardisation.
- Preparation for Silver Layer: Data is prepared and submitted to the Staging SQL database (Silver Layer).
Importance:
- Raw Data Preservation: The Bronze Layer ensures that the original raw data is preserved, allowing for complete traceability and reprocessing if needed.
- Foundation for Data Lake: By organising and minimally processing the data, the Bronze Layer provides a strong foundation for further data refinement and analysis in the Silver and Gold layers.
RAID
Risks:
- Data Integrity: Errors in converting data to parquet format can compromise data integrity.
- System Performance: The process of transforming and loading data can strain system resources and affect performance.
- Access Control: Ensuring only authorized users can access or modify data is crucial.
Issues:
- Processing Failures: Failures in the ETL process can lead to incomplete or incorrect data being moved to the next layer.
- Storage Management: Efficiently managing storage to avoid excessive costs and performance issues is critical.
- Data Catalog Maintenance: Keeping the data catalog up-to-date with new data sources and changes can be labor-intensive.
Dependencies:
- ETL Tools and Frameworks: Dependence on robust ETL tools and frameworks for data transformation.
- Data Storage Solutions: Reliable and scalable storage solutions for storing raw data.
- Data Integration: Effective integration with various data sources.
Assumptions:
- Data Consistency: Assumption that data remains consistent during the ETL process.
- Data Quality: Assumption that incoming data is of acceptable quality, requiring minimal cleansing.
- ETL Process Efficiency: Assumption that ETL processes are optimized for performance.
Opportunities:
- Efficient Data Transformation: Using optimized ETL processes to improve data transformation efficiency.
- Data Compression: Leveraging data compression techniques to reduce storage costs.
- Metadata Management: Enhancing metadata management to improve data discoverability and governance.
Mitigations:
- Error Handling: Implementing comprehensive error handling and logging mechanisms in ETL processes.
- Storage Optimization: Regularly monitoring and optimizing storage usage to prevent excessive costs.
- Automated Data Cataloging: Using automated tools to maintain an up-to-date data catalog.