Service Level Agreement (SLA) for DataMart availability

This draft framework provides a robust SLA for DataMart availability, balancing business needs, technical feasibility, and stakeholder communication.


1. Availability Target

Set a clear uptime percentage that aligns with the criticality of the DataMart to your business processes.

  • Example:
    • Datamarts must maintain a 99% availability per month.
    • Define the allowed downtime (e.g., ~7.3 hours/month for 99% uptime).

2. Operational Hours

Define the expected operational hours of the DataMart. Specify if it needs to be available 24/7 or only during specific business hours.

  • Example:
    • “The DataMart is expected to be available 24/7, excluding planned maintenance windows.”
    • “Critical periods are 07:00โ€“18:00 (GMT) on weekdays.”

3. Planned Maintenance

Outline the process and expectations for planned downtime.

  • Example:
    • “Planned maintenance must be scheduled at least 7 days in advance, and the maintenance window must not exceed 4 hours per occurrence unless approved.”
    • “Maintenance should occur during low-traffic periods, typically between 19:00 and 00:00 (GMT).”

4. Incident Response and Resolution

Define how quickly issues should be addressed and resolved based on their severity.

PRD (Production) Environment

(Aligns with Business Report Availability)

  • Critical Incidents: Datamart is completely inaccessible, impacting business-critical reports or processes.
    • Response Time: Within 1 hour.
    • Resolution Time: Within 8 hours.
  • High Incidents: Partial functionality issues (e.g., degraded performance or failure of key components) that impact significant reporting processes but have workarounds available.
    • Response Time: Within 4 hours.
    • Resolution Time: Within 2 business days.
  • Low Incidents: Non-critical errors (e.g., minor data discrepancies, cosmetic issues, or low-priority feature failures) that do not impact core business operations.
    • Response Time: Within 2 business days.
    • Resolution Time: Within 5 business days.

DEV (Development) Environment

(Aligns with Agile Sprint Planning)

  • Critical Incidents: Datamart is inaccessible, blocking development, testing, or deployment of features within the sprint.
    • Response Time: Within 4 hours.
    • Resolution Time: Within 2 business days.
  • High Incidents: Partial functionality issues (e.g., key features are non-functional or workflows are impaired) impacting sprint goals but not blocking progress entirely.
    • Response Time: Within 2 business days.
    • Resolution Time: Within 5 business days.
  • Low Incidents: Non-critical issues or errors (e.g., enhancements, minor bugs, or performance tuning requests) with minimal impact on sprint goals.
    • Response Time: Within 5 business days.
    • Resolution Time: Within 15 business days.

5. Performance Metrics

Set thresholds for DataMart performance to ensure usability.

  • Example:
    • Query execution times, for direct DataMart execution, should not exceed 10 seconds for 95% of queries.
    • Load/refresh processes should complete within 30 minutes of their scheduled time.

6. Data Availability

Define how often the DataMart is refreshed and how up-to-date the data must be.

  • Example:
    • “DataMart data must be refreshed every 24 hours by 06:00 (GMT). Any delays exceeding 1 hour must be communicated to stakeholders.”

7. Communication and Escalation

Establish protocols for informing stakeholders about issues or changes.

  • Example:
    • “Notification of unplanned downtime must be sent to stakeholders via email or Teams within 1 hour of detection.”
    • “Weekly performance reports will include uptime metrics, incident summaries, and upcoming maintenance schedules.”

8. Service Corrective Actions

Consider corrective actions for unmet SLAs to ensure accountability.

  • Example:
    • “Repeated SLA breaches over 3 consecutive months may trigger a review or additional corrective actions.”

9. Exclusions

Specify scenarios where the SLA does not apply.

  • Examples:
    • Issues caused by third-party services (e.g., Azure outages).
    • User errors or queries exceeding system capacity without prior discussion.
    • Force majeure events (natural disasters, widespread power outages, etc.).

10. Review and Improvement

Include a provision for periodic SLA reviews to adapt to changing business requirements.

  • Example:
    • “SLAs will be reviewed quarterly to ensure they remain aligned with business needs and technical capabilities.”

Leave a Comment