Data Transformation materializes the data as a flattened Hive table after performing all the operations like joins, aggregations, and unions on the big data sources defined by the pipeline design. The target node is used to specify the details of the table such as its schema name, table name and target file system path. |
Following are the steps to apply Target node in pipeline:
You can also select Append to append the data of one pipeline target to already built pipeline target. This can be achieved by giving a reference table in pipeline target.
On selecting the Sync Type to append/merge, a dropdown for Reference Table appears. All built pipeline tables are part of the reference table option. On reference table selection, properties such as schema name, table name, target file system path etc, are visible. The appendable/mergeable target will have these properties of reference table by default. These properties are non-editable from the target node.
There is also a separate section for column mappings of reference table and target node. In mapping section, all columns from reference table are listed on the left-hand side while all columns from target node are listed on the right-hand side.
You can sort the order of the columns (using the edit feature) from the target node if the order of target columns does not match with the order of columns of reference table. Number of target columns needs to be greater than or equal to number of reference table columns. All unmapped columns from target table are excluded automatically on saving. All audit columns must be mapped to respective audit columns. For all other columns, data type of mapped columns must match.
If the column name in target is different than the corresponding mapped column of reference table, user gets a notification for renaming the column and on clicking OK , column name in the target node is renamed. On saving the mapping section, schema of the target table is changed appropriately to match the schema of reference table.
SCD Type includes two options, SCD 1 and SCD 2. These are the two models supported for merge sync type.
SCD 1 (Slowly Changing Dimension type-1) : SCD 1 maintains only the most updated records (no history) in the table while SCD 2 maintains history also. For mergeable target you can select reference table of type SCD 1 or SCD 2 depending on the requirement.
SCD 2 (Slowly Changing Dimension type-2): To maintain history, different levels of granularity are supported like, Year, Month, Day, Hour, Minute and Second.
For example, for a record if you configure granularity as Day, only the most recent update per day is maintained. For any updates that happen on the next day, only the latest update for the day will be maintained. To maintain all history, set granularity level as Second. By default, the granularity level used is Second.
Following are the three audit columns used in SCD 2:
ZIW_ACTIVE: The value is true for all active records and false for history records.
ZIW_TARGET_START_TIMESTAMP: Maintains start timestamp of a record.For example, for Day level granularity, the value is the first second of the day, and for Hour level granularity, the first second of the Hour.
ZIW_TARGET_END_TIMESTAMP: Maintains end timestamp of a record. For active records the value is an INFINITY timestamp whereas, for non-active records the value is the last second of the day/hour/month etc.
Following is a record with three attributes: ID, First_Name, and Last_Name as of 9/7/2020. First_Name is configured as a SCD2 attribute, while Last_Name is configured as non-SCD2 attribute. The granularity is set to second (by default).
ID | First_Name | Last_Name | ZIW_Active | ZIW_ST | ZIW_ET | Active/Inactive |
---|---|---|---|---|---|---|
1 | Fname1 | LName1 | TRUE | 9/7/2020 | 9/9/9999 | Active Record |
Now, let's say that on 9/10/2020, the First_Name value for this record was updated to "Fname2". Since First_Name attribute is marked as SCD2, we will maintain the history and the updated target records would now appear as given below:
ID | First_Name | Last_Name | ZIW_Active | ZIW_ST | ZIW_ET | Active/Inactive |
---|---|---|---|---|---|---|
1 | Fname1 | LName1 | FALSE | 9/7/2020 | 9/10/2020 | Inactive Record |
1 | Fname2 | LName1 | TRUE | 9/10/2020 | 9/9/9999 | Active Record |
Now, let's say that on 9/15/2020, the Last_Name value for this record was updated to "Lname2". Since Last_Name attribute is marked as non-SCD2, the history is not maintained and the same Active record is updated directly:
ID | First_Name | Last_Name | ZIW_Active | ZIW_ST | ZIW_ET | Active/Inactive |
---|---|---|---|---|---|---|
1 | Fname1 | LName1 | FALSE | 9/7/2020 | 9/10/2020 | Inactive Record |
1 | Fname2 | LName2 | TRUE | 9/15/2020 | 9/9/9999 | Active Record |
DateTimeParseException
(For instance, The following error occurs: Root cause error : java.time.format.DateTimeParseException) .
In this case, set “spark.sql.legacy.timeParserPolicy=LEGACY
” as an advanced configuration at pipeline level. org.apache.spark.sql.AnalysisException: Undefined function: 'iwtime_similarity'.
” error on persistent cluster in the Spark-Databricks environment, then re-run the job using ephemeral cluster.Spark Hive Compatibility
Spark bucketing semantics are different from Hive bucketing semantics. This behaviour can be controlled by the Hive Compatible Target check box.
Infoworks Data Transformation handles these differences using the following methods:
Hive Compatible Targets
Non-Hive Compatible Targets
Spark bucketing will be applied. These targets will not be readable by Hive.
Spark SQL and Hive SQL expressions are compatible for majority of cases. However there are some differences. Some of the known issues are captured here:
For example select col1 from tab1 union all select col1 from tab2. If tab1.col1 is decimal(7,2)and tab2.col1 is decimal(38,0),
See the Spark SQL documentation for any issues.
Additive Schema Handling
If only new columns have been added to a source table or pipeline, additive schema handling alters the table to add these columns at the end of the table. Additive schema handling is supported only in the Merge mode.
The additive schema changes are displayed in build log summary of pipeline.
An optimized merge algorithm has been introduced to overcome the performance issue of pipeline merge. The issue occurred due to partition by partition merge of the target table in pipeline and performance was degraded with more number of partitions in the target table.
By default, the optimized pipeline merge is used but to switch back to the previous pipeline merge, you can set the following configuration in the pipeline advance configurations: dt_use_partition_level_merge=true
.
This label enables you to add labels to your tables in BigQuery. For more information about BigQuery labels, refer to BigQuery Labels.