Overview:
The Import Layer serves as the initial landing zone for all incoming data into the data warehouse. This layer is designed to handle and store data in a variety of formats and from different sources, ensuring a flexible and robust intake process. It acts as a staging area where data can be validated, archived, and prepared for subsequent processing into the Bronze layer.
Key Characteristics:
- Data Submission:
- Flexible Formats: Data owners can submit their data in multiple formats, including CSV, JSON, XML, and other common file types. This flexibility ensures that data from various sources can be ingested without the need for pre-conversion.
- API Responses: This layer also stores responses from API calls, capturing real-time or scheduled data fetches from external systems or services.
- Data Archival:
- Date-based Organization: Once data arrives in the import folder, it is archived into directories named according to the date of arrival (e.g., YYYY-MM-DD). This organizational structure helps in maintaining an audit trail and facilitates easy retrieval of historical data.
- Storage Management: The archived data is managed to ensure data integrity and availability, supporting both short-term processing needs and long-term storage requirements.
- Data Catalog Integration:
- Identification and Metadata: Files are identified and catalogued based on metadata defined in the data catalogue, known as myBMT. This includes details such as data source, format, schema, and any specific processing instructions.
- Data Catalogue (myBMT): The data catalogue plays a crucial role in maintaining an inventory of all incoming data, ensuring that each dataset is processed according to predefined rules and standards.
- Processing and Transformation:
- Parquet Conversion: Data identified in the myBMT catalogue undergoes processing where it is transformed into Parquet format. Parquet is a columnar storage file format optimised for big data processing, providing efficient data compression and encoding schemes.
- Preparation for Bronze Layer: Once the data is converted to Parquet, it is then sent to the Bronze layer. The Bronze layer acts as the raw data storage tier within the data warehouse, where the data is stored in its rawest form but in a structured and optimized manner for further processing and analysis.
Workflow Summary:
- Data Submission: Data owners submit their data in various formats.
- Archival: The data is archived in date-specific folders.
- Identification: Files are identified and cataloged using myBMT.
- Processing: Identified data is converted to Parquet format.
- Transfer: The processed data is moved to the Bronze layer.
By organising the import layer in this manner, the data warehouse ensures a seamless, efficient, and scalable process for handling incoming data, enabling better data management and more effective downstream processing.
RAID
Risks:
- Data Quality: Inconsistent or poor-quality data submissions can lead to data quality issues downstream.
- Data Security: Sensitive data may be exposed if proper access controls and encryption are not implemented.
- Data Volume: Large volumes of data can overwhelm storage and processing capabilities.
Issues:
- Format Variability: Handling multiple data formats (CSV, JSON, etc.) requires robust parsing and validation logic.
- Data Duplication: Lack of proper deduplication mechanisms can lead to redundant data storage.
- Archiving Challenges: Properly archiving and organizing incoming data by date can be complex and error-prone.
Dependencies:
- Data Sources Availability: The availability and accessibility of external and internal data sources.
- Network Connectivity: Stable and reliable network connections for data transfer.
- Data Source Formats: Consistency in data source formats and structure for seamless ingestion.
Assumptions:
- Data Source Reliability: Assumption that data sources provide accurate and timely data.
- Data Access Permissions: Necessary permissions are granted for accessing various data sources.
- Data Volume Management: Assumption that the infrastructure can handle the incoming data volume.
- myBMT Catalogue Consistency: Assumption that the myBMT catalogue provides consistent and up-to-date metadata for accurate data ingestion.
Opportunities:
- Efficient Data Transformation: Using optimized ETL processes to improve data transformation efficiency.
- Data Compression: Leveraging data compression techniques to reduce storage costs.
- Metadata Management: Enhancing metadata management to improve data discoverability and governance.
Mitigations:
- Error Handling: Implementing comprehensive error handling and logging mechanisms in ETL processes.
- Storage Optimization: Regularly monitoring and optimizing storage usage to prevent excessive costs.
- Automated Data Cataloging: Using automated tools to maintain an up-to-date data catalog.