Data Warehouse Architecture

Julian.Kellett

Purpose: Begin with a high-level purpose of the data warehouse, explaining how it integrates data from various sources to provide a centralised repository for analytics, reporting, and business intelligence.
Layers: Briefly outline any layered structure (e.g., Bronze, Silver, Gold, Platinum) and explain how each layer contributes to data quality and processing efficiency.

Storage Accounts and Containers: Describe how data is stored within Azure Storage Accounts, Containers, and any subfolder hierarchy.
Data Organisation: Outline how each storage layer (Bronze, Gold, etc.) maps to specific containers or folders and how this supports data governance.

ETL/ELT Processes: Detail the process for Extract, Transform, and Load (ETL) or Extract, Load, and Transform (ELT) used to move data from source systems to the data warehouse.
Data Flow: Describe the data flow, from raw data ingestion in the Bronze layer through transformations in Silver and Gold layers to high-level analytical datasets.
Automation: Mention any tools or services like Azure Data Factory (ADF), Azure Synapse, or Databricks for managing these workflows, including triggers and schedules.

Storage Layer Access Control: Detail the access structure, focusing on the hierarchy of Storage Accounts, Containers, and Folders, and how each layer is assigned specific user groups based on data access needs.
Role-Based Access Control (RBAC): Explain RBAC usage, such as granting Storage Blob Data Reader and Contributor roles at appropriate levels.
Access Control Lists (ACLs): If ACLs are used at the folder level, discuss how they are applied to restrict access, ensuring that users only see authorised folders within a container.
Data Security: Include encryption at rest, any integration with Active Directory, and conditional access policies.

Compute Resources: Describe the compute resources involved (e.g., Azure Synapse, Databricks clusters) and how they support parallel processing and dynamic resource allocation.
Load Balancing: If load balancing is used to distribute incoming requests across resources, explain how it supports efficient processing and user queries.
Auto-Scaling: Mention any auto-scaling configurations that adjust compute resources based on demand, optimising cost and performance.

Partitioning: Describe any partitioning strategy (e.g., date-based partitions) used to optimise query performance.
Indexing and Caching: Discuss indexing on frequently accessed tables and caching strategies (such as Materialised Views) to enhance read performance.
Data Pruning: Include any mechanisms for archiving or pruning historical data to maintain efficiency.

Tracking Lineage: Outline tools or processes for tracking data lineage to help understand the data’s journey through transformations and loading.
Metadata Repository: Describe where metadata (like column definitions and transformations) is stored and how it’s accessed by users.

Access Points: Describe the tools or applications (e.g., Power BI, Azure Data Explorer) that users access to query data.
Query Optimisation: Outline query optimisation strategies and how users are supported with performant access to the data warehouse.