Infoworks supports Redshift as a target for data transformation pipelines. Data can be exported to Redshift in a faster and parallelized way, and data analytics can be performed on Redshift. The Redshift table can be optimised for user query patterns.
Following are the steps to use Redshift target in pipeline:
Field | Description |
---|---|
Build Mode | The options include overwrite, append or merge. Overwrite: Drops and recreates the Redshift target. Append: Appends data to the existing Redshift target. Merge: Merges data to the existing table based on the natural key. |
Data Connection Name | Data connection to be used by the Redshift target. |
Schema Name | Schema name of the Redshift target. |
Create a schema if it does not exist | Enable this option to create a new schema with the name provided above. Ensure that the user has sufficient privileges to create a schema in Redshift target. |
Table Name | Table name of the Redshift target. |
Is Existing Table | When enabled, existing table behaviour works as an external table to Infoworks. The table will not be created/dropped or managed by Infoworks. |
Natural Keys | The required natural keys for the Redshift target. |
Compute Stats | The option to execute the ANALYZE statement after writing to table. |
Distribution Style | The distribution style of data across the nodes in AWS Redshift. The options include Even, All, Key and Auto. The default value is Auto. |
Sort Style | The order in which rows in a table are stored and used, to improve query performance. The options include Compound and Interleaved. |
Sort Keys | The columns for the corresponding sort style. |
For details on setting the Redshift data connection, see Setting Target Data Connections.
Redshift Spark connector is used to write to a table in Redshift. Spark connector writes the data to a temporary location and then uses the Redshift COPY command to load the data to a table in Redshift. The intermediate data is deleted after loading to the Redshift table.
Redshift Additional Write Options
Redshift spark connector uses the Redshift connection details to load to Redshift. You can use the following configuration to set additional options to the connector.
dt_spark_redshift_extra_options
: The configuration to set additional options to the Redshift connector using semicolon-separated (;) values. For example, the value can be dt_spark_redshift_extra_options=tempformat=CSV;extracopyoptions=NO
.
You can set this key in the pipeline Settings > Advanced Configurations > Add Configuration section.
For other options, see Redshift Additional Options.
dt_spark_redshift_extra_options
configuration.
Disabling Delta Format Check in Databricks SaaS Environment
To disable delta format check in Databricks SaaS environment, set the spark.databricks.delta.formatCheck.enabled Spark configuration to false in the pipeline advanced configuration.