6.1.1
The CDC SCD2 pipeline builds intermittently fail due to incorrect timestamp casting. The issue stems from Spark's transition to a new datetime parser in version 3.0, which introduces stricter datetime validation. Pipelines relying on timestamp parsing fail with errors, disrupting the build process. This article provides the root cause analysis and steps to resolve the issue effectively.
The error stems from a change in behavior in Spark >= 3.0 related to datetime parsing. By default, Spark uses the new parser introduced in version 3.0, which may fail to parse certain datetime formats, resulting in errors such as:
xxxxxxxxxx
org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER]
Fail to parse '2024-11-26 7:20:59' in the new parser.
The error is due to Spark’s timeParserPolicy
defaulting to EXCEPTION, which treats specific datetime strings as invalid unless they strictly follow ISO8601 formatting. This impacts pipelines reliant on Spark’s datetime parsing capabilities.
To work around this issue, adjust the timeParserPolicy to use the legacy datetime parser. This can be achieved using one of the following approaches:
Navigate to the settings page of that pipeline.
In the Advanced Configuration section, add the following key-value pair:
iw_spark_app_conf
spark.sql.legacy.timeParserPolicy=LEGACY
Use an ephemeral cluster for pipeline execution.
Open the Advanced Configuration settings for the compute in your environment.
Add the following key-value pair:
spark.sql.legacy.timeParserPolicy
LEGACY
Restart the compute to apply the changes.
Directly set the following configuration in the compute settings:
spark.sql.legacy.timeParserPolicy
LEGACY
Restart the compute cluster.