Release 70

View: project_forecast_202409

Enhancements:

  • Inclusion of the following fields from ifs.DIM_PROJ_FORECAST_OL:
    • Forecast.FORECAST_VERSION_ID
    • Forecast.VERSION_DESCRIPTION
    • Forecast.FORECAST_STATUS

Spark-SQL Compatibility and Safe-Code Updates

Changes Implemented

To improve compatibility with Spark SQL and prepare views for future transition into Fabric or Lakehouse environments, the following adjustments have been made:

  • Square Brackets Removed
    Removed the use of [square_brackets] from object references. Spark SQL does not support square brackets and instead uses backticks (`) for quoting, though best practice is to avoid reserved keywords entirely where possible.
  • Dollar Signs Removed from Table References (Bronze)
    Applied a consistent naming convention across all Bronze-layer objects to ensure compatibility with Spark-based engines and allow SQL parsing to succeed in Fabric or Delta Lake. Replaced $ suffixes in legacy table names with a standardised Q prefix to maintain source traceability while avoiding Spark SQL syntax errors. For example:
    • DIM_PROJECT_OL$QDIM_PROJECT_OL
    • FACT_CUSTOMER_ORDER_LINE_OL$QFACT_CUSTOMER_ORDER_LINE_OL
    • FACT_BUSINESS_OPP_OL$QFACT_BUSINESS_OPP_OL
    • DIM_PROC_TEMPLATE_STAGE_OL$QDIM_PROC_TEMPLATE_STAGE_OL
    • Removed $ suffixes from ALL other QFACT and QDIM tables
  • ISNULL() Replaced with COALESCE()
    Standardised all null-handling logic to use COALESCE(), which is ANSI-compliant and supported by both SQL Server and Spark SQL.
  • Quote-Safe Descriptions Implemented
    Introduced REPLACE(text,"'","^") logic within the SQL code generator to escape single quotes within string literals—particularly in DESCRIPTION fields—to avoid SQL injection risks and parsing errors.

Outstanding Compatibility Issues


1. IIF() in object CODE should be CASE

  • Issue: IIF() is a SQL Server–specific shorthand for simple conditional logic. Spark SQL (and ANSI SQL) does not support it.
  • Insight: Convert IIF(condition, true_value, false_value) to a CASE WHEN block.
  • Suggested Action: As part of your metadata-driven SQL generator, introduce a conversion step:
IIF(A = B, 1, 0) → CASE WHEN A = B THEN 1 ELSE 0 END
  • Risk: Negligible, assuming generator logic can apply consistent rewriting.

2. Square brackets on reserved words (e.g. [identifier])

  • Issue: Square brackets ([]) are not valid in Spark SQL. Spark uses backticks (`) for escaping reserved keywords.
  • Insight: Reserved words such as identifier, timestamp, group, etc., should be escaped using backticks if necessary.
  • Suggested Action: Replace [identifier] with `identifier`, or better, alias to avoid escaping.
  • Note: Consider adding a reserved word dictionary and apply escaping or aliasing systematically.

3. CF$ field prefixes remain

  • Issue: Spark interprets $ in column names literally and often fails to parse such fields, especially in Delta or Hive environments.
  • Insight: Either rename in source metadata using AS CF_FORECAST_VALUE or access via backticks `CF$FIELDNAME`.
  • Suggested Action: Add an alias rule in the view layer to strip or rename CF$ columns.
  • Long-term: If feasible, clean naming at ingestion stage.

4. Use of CROSS APPLY for unpivot logic

  • Issue: CROSS APPLY is a T-SQL construct and unsupported in Spark SQL.
  • Insight: Often used to simulate row expansion (e.g. JSON or table-valued functions).
  • Alternatives:
    • Use LATERAL VIEW with explode() for arrays.
    • Use inline SELECT with UNION ALL for fixed-column unpivoting.
  • Challenge: If your unpivot relies on UDFs or metadata-based expressions, a full rewrite may be required.
  • Recommendation: Flag CROSS APPLY usage as a structural blocker and isolate in a compatibility audit.

5. Use of CHARINDEX()

  • Issue: CHARINDEX() is not supported in Spark; use instr() instead.
  • Example: CHARINDEX('ABC', col) → instr(col, 'ABC')
  • Caveat: The two behave differently in some edge cases (e.g. null-handling).
  • Suggested Action: Add a search function shim to auto-convert when emitting Spark SQL.

6. STRING_AGG() unsupported until Spark 4.0

  • Issue: No native aggregation for string concatenation in Spark < v4.0.
  • Workaround Options:
    • Use collect_list() + array_join(): array_join(collect_list(col), ', ')
    • Use Python or Scala UDFs in notebooks, though not portable to SQL-only contexts.
  • Recommendation: Isolate views needing this and defer porting until your environment upgrades to Spark 4.
🛠 Example (Future-Proofing Idea)

To make this portable and repeatable:

SELECT
    '{' +
    CASE WHEN code IS NOT NULL THEN '"'+ attribute + '":' + COALESCE(value_metadata, '{}') ELSE '' END +
    '}' AS meta_data
FROM metadata_template

Run that with fixed inputs and UNION the rows — all without STRING_AGG.


7. Use of GETDATE() as a runtime substitution

  • Issue: GETDATE() is not Spark SQL compliant. Equivalent: current_timestamp().
  • Insight: If GETDATE() is used as a dynamic substitution placeholder in generator logic (e.g. for snapshotting), ensure it’s replaced before SQL compilation.
  • Suggested Action: Replace all GETDATE() with current_timestamp() in Spark-safe code paths.

8. REPLACE() required for quotes in CODE fields

  • Issue: Embedded quotes (') or double quotes (") in data can cause SQL syntax errors when constructing literal strings.
  • Insight: Spark SQL does support REPLACE(), so this is more of a hygiene step than a compatibility issue.
  • Best Practice: Always escape embedded quotes using: REPLACE(col, "'", "''")
  • Recommendation: Keep this as a pre-step in your code generator and encode as part of the literal quoting logic.

Leave a Comment