Infoworks 5.3.1
Prepare Data

Exporting Data to Redshift

Infoworks supports Redshift as a target for data transformation pipelines. Data can be exported to Redshift in a faster and parallelized way, and data analytics can be performed on Redshift. The Redshift table can be optimised for user query patterns.

Prerequisites

  • Ensure that the Redshift target data connection is configured.
  • Ensure that read/write permissions to Amazon S3 and Redshift are granted for Databricks Virtual Private Cloud (VPC).

Setting Redshift Target Properties

Following are the steps to use Redshift target in pipeline:

  1. Double-click the Redshift Table node. The properties page is displayed.
  2. Click Edit Properties, and set the following fields.
  3. Click Save.
FieldDescription
Build Mode

The options include overwrite, append or merge.

Overwrite: Drops and recreates the Redshift target.

Append: Appends data to the existing Redshift target.

Merge: Merges data to the existing table based on the natural key.

Data Connection NameData connection to be used by the Redshift target.
Schema NameSchema name of the Redshift target.
Create a schema if it does not existEnable this option to create a new schema with the name provided above. Ensure that the user has sufficient privileges to create a schema in Redshift target.
Table NameTable name of the Redshift target.
Is Existing TableWhen enabled, existing table behaviour works as an external table to Infoworks. The table will not be created/dropped or managed by Infoworks.
Natural KeysThe required natural keys for the Redshift target.
Compute StatsThe option to execute the ANALYZE statement after writing to table.
Distribution StyleThe distribution style of data across the nodes in AWS Redshift. The options include Even, All, Key and Auto. The default value is Auto.
Sort StyleThe order in which rows in a table are stored and used, to improve query performance. The options include Compound and Interleaved.
Sort Keys

The columns for the corresponding sort style.

NOTE An Interleaved Sort Key can use a maximum of eight columns.

For details on setting the Redshift data connection, see Setting Target Data Connections.

Redshift Spark Configurations

Redshift Spark connector is used to write to a table in Redshift. Spark connector writes the data to a temporary location and then uses the Redshift COPY command to load the data to a table in Redshift. The intermediate data is deleted after loading to the Redshift table.

Redshift Additional Write Options

Redshift spark connector uses the Redshift connection details to load to Redshift. You can use the following configuration to set additional options to the connector.

dt_spark_redshift_extra_options: The configuration to set additional options to the Redshift connector using semicolon-separated (;) values. For example, the value can be dt_spark_redshift_extra_options=tempformat=CSV;extracopyoptions=NO.

You can set this key in the pipeline Settings > Advanced Configurations > Add Configuration section.

For other options, see Redshift Additional Options.

NOTE Infoworks currently uses the default options provided by the Spark Redshift connector. If required, you can overwrite the default value supported by the connector by setting the appropriate values in the dt_spark_redshift_extra_options configuration.

Disabling Delta Format Check in Databricks SaaS Environment

To disable delta format check in Databricks SaaS environment, set the spark.databricks.delta.formatCheck.enabled Spark configuration to false in the pipeline advanced configuration.

  Last updated