Util.Identity_PersonID

Overview

util.identity_personid provides a resolved and reusable identity mapping between source-system person records and a canonical PERSON_UID.

It supports the Person domain by combining:

  • person identifiers
  • email-based linkage
  • payroll references
  • survivorship logic

into a single trusted identity layer.

This process reduces duplication, improves cross-system linking, and provides a stable foundation for:

  • person.core_details
  • related Person domain views

Purpose

  • Establish a canonical person key (PERSON_UID)
  • Resolve multiple source records representing the same individual
  • Link person records using hashed email identity
  • Apply survivorship rules to select the most appropriate record
  • Provide a reusable identity bridge for the Person domain and future identity-led reporting

How It Works

1. Extract Source Email Identities

Source:

  • ifs.QDIM_FND_USER_PROPERTY_OL
  • Filter: SMTP_MAIL_ADDRESS

Processing:

  • Rename to person_email
  • Convert to lowercase
  • Generate email_hash:
    • MD5(lower(email))
    • first 8 characters
    • uppercased

Outcome:

  • A compact, repeatable identity link key

2. Enrich with Person Details

Join to:

  • ifs.QDIM_PERSON_INFO_OL
    • PERSON_ID, NAME
  • ifs.QDIM_COMPANY_PERSON_ALL_CF_OL
    • EMPLOYEE_STATUS
    • PAYROLL_ID

Outcome:

  • Combined identity + operational context for ranking and selection

3. Clean and Standardise

Rules applied:

  • PAYROLL_ID = 'n/a' → null
  • Email normalised to lowercase
  • Data types standardised
  • Null / empty values filtered

Outcome:

  • Consistent and predictable identity inputs across source variations

4. Apply Survivorship Ranking

A priority score (sortStatus) is calculated to resolve duplicates.

Current precedence:

  1. Active employees
  2. Records with PAYROLL_ID

Outcome:

  • Deterministic selection of the most operationally useful record

5. Resolve to Canonical PERSON_UID

Join to:

  • identity_email_proxy (derived from identity_bridge)

Logic:

  • If bridge match exists → use mapped PERSON_UID
  • Else → fallback to email_hash

Outcome:

  • Supports both:
    • bridge-led canonical identity
    • safe fallback identity

6. Deduplicate to One Row per Person

Final step:

  • Retain one row per PERSON_UID
  • Apply survivorship ranking
  • Remove technical columns:
    • sortStatus, IDENTITY, ID, CODE, pipeline_log

Output:

  • PERSON_UID
  • PERSON_ID
  • NAME
  • PAYROLL_ID
  • email_hash
  • person_email

Design Principles

Canonical Where Possible, Fallback Where Necessary

  • Use bridge-defined PERSON_UID where available
  • Fallback to email_hash ensures continuity

Email as the Linking Mechanism

  • Email is the most consistent cross-system identifier
  • Hashing enables secure, repeatable linkage

Deterministic Survivorship

  • Selection is rule-based, not arbitrary
  • Ensures consistent outputs across refresh cycles

Privacy-Aware Linkage

  • Uses hashed email as primary key
  • Limits exposure of raw personal data

Separated Identity Architecture

  • identity_bridge → resolves identity relationships
  • util.identity_personid → materialises identity
  • person.core_details → consumes identity

What This Enables

  • Stable joins using PERSON_UID
  • Consistent person-level reporting across systems
  • Controlled linkage between Person and Employee domains
  • Foundation for:
    • person.core_details
    • person.meta_* views
  • Future Entra / proxy identity integration

Reduces risk of:

  • duplicate person records
  • inconsistent joins
  • ad hoc identity logic in reports

Known Constraints & Design Notes

1. Bridge Dependency

  • Relies on src_identity_bridge for canonical mapping
  • Falls back to email_hash where mapping is unavailable

2. Email-Led Logic

Depends on:

  • valid SMTP email population
  • stable email ownership
  • reliable system linkage

3. Survivorship is Contextual

Current prioritisation:

  • Active status
  • Payroll presence

This reflects operational preference, not universal identity truth.


4. Hashing Implementation Constraint

  • Microsoft Dataflows (Power Query M) does not provide a native hashing function
  • A deterministic hash is required to:
    • create a stable email_hash
    • support cross-system identity linkage
    • avoid reliance on raw email values as join keys

Current approach:

  • Uses an external MD5 hashing service (HashToMD5)
  • Applies:
    • lowercase normalisation
    • MD5 hash
    • first 8 characters (uppercased)

Outcome:

  • Produces a compact, repeatable, platform-independent identity key

Design position:

  • This is an intentional workaround for platform limitations, not an ad hoc solution
  • Ensures consistent hashing behaviour across all Dataflow executions

Future consideration:

  • If/when native hashing becomes available in Fabric / Dataflows, this implementation can be internalised without changing the model or downstream dependencies

Summary

util.identity_personid provides the canonical identity resolution layer for the Person domain.

It transforms multiple source-system representations into a single, stable PERSON_UID, enabling consistent, secure, and scalable person-level modelling across the Data Warehouse.

Leave a Comment