Infoworks 6.1.3
Onboard Data

Onboarding Data from Teradata TPT

Infoworks supports ingesting data from Teradata via Teradata Parallel Transporter (TPT). TPT is a native tool provided by Teradata. Ingestion via TPT helps in reducing load (AmpCPU) on the source Teradata system (5x -10x reduced source load with TPT) and provides better performance over JDBC. TPT is the preferred load/unload tool for Teradata Database.

Prerequisite

Ensure to install TPT on the master node of the interactive cluster, along with the cloud specific access module using the TPT installer.

Perform the following steps to install TPT on the edge node, interactive cluster, and persistent cluster as applicable.

NOTE The TPT installation procedure is OS specific.

  1. Download the Teradata Tools and Utilities (TTU) installer package for the required OS.
  2. Extract the installer package to a directory.
  3. Navigate to the extracted directory.
  4. Execute the following command: sudo ./setup.bat a

For more information on TPT installation steps, refer to Teradata Tools and Utilities Installation.

Creating a Teradata TPT Source

For creating a Teradata TPT source, see Onboarding a RDBMS Source. Ensure that the Source Type selected is Teradata.

Teradata Configurations

FieldDescription
Fetch Data UsingThe mechanism through which Infoworks fetches data from the database.
Connection URLThe connection URL through which Infoworks connects to the database. The URL must be in the following format: jdbc:teradata://<ip>/TMODE=<tmode>,database=<databasename>
UsernameThe username for the connection to the database.
Authentication Type for Password

Select the authentication type from the dropdown. For example, Infoworks Managed or External Secret Store.

If you select Infoworks Managed, then provide Authentication Password for Password.

If you select External Secret Store, then select the Secret which contains the password.

NOTE Only the secrets accessible to the user will be available in the drop-down list.

Source SchemaThe schema in the database to be crawled. The schema value is case sensitive.
Enable Ingestion via Teradata Parallel Transporter (TPT)

Check the box to enable ingestion via TPT.

This option cannot be disabled, if any of the tables is configured for TPT(or TPT without IW Processing ). In order to disable this option, you will need to change the "Ingest Using" field setting to JDBC.

Cloud TypeThe options include Azure Blob Storage (WASB), ADLS Gen2, Amazon S3, and GCS. The cloud type will be displayed based on the environment. If the data environment is Azure, select WASB or ADLS Gen2 from the Cloud Type dropdown.
Custom TagsThis dropdown provides the list of tags which you can choose. It can be used to identify/segregate the source based on the selected tags.

For ADLS Gen2, enter the following:

FieldDescription
Access schemeScheme used to access ADLS Gen2. The available options are abfs:// and abfss://
Authentication mechanismThe mechanism using which the security information is stored. The options include Access Key and Service Principal. Access key refers to the key to authenticate applications when making requests to this Azure storage account. Service Principal functions as the identity of the application instance.
FileSystemProvide a file system where all the data of an artifact will be stored.
Storage account nameProvide the name of Azure storage account.
Access KeyProvide the storage access key to connect to ADLS. NOTE This field appears only when the authentication mechanism is selected as Access Key.
Application IDProvide the ID that uniquely identifies the user application. NOTE This field appears only when the authentication mechanism is selected as Service Principal.
Directory IDProvide the ID that uniquely identifies the Azure AD instance. NOTE This field appears only when the authentication mechanism is selected as Service Principal.
Service CredentialProvide the credential string value that the application uses to prove its identity. NOTE This field appears only when the authentication mechanism is selected as Service Principal.

Windows Azure Storage Blob (WASB), also known as Blob Storage, is a file system for storing large amounts of unstructured object data, such as text or binary data. For more details, see Azure Blob Storage Documentation.

For Azure Blob Storage (WASB), enter the following:

FieldDescription
Storage Account nameThe unique name of the storage account. For more details, see Create a storage account.
Authentication Type for Storage Account Access Key

The storage account access key. This option is displayed if the Authentication Mechanism used is Account Key. For more details, see Manage storage account access keys.

If you select Infoworks Managed from the Authentication Type for Storage Account Access Key dropdown, then provide Authentication Password for Storage Account Access Key.

If you select External Secret Store, select the secret from Secret for Storage Account Access Key dropdown.

Storage Account End PointThe account end point of WASB storage.The default value is: blob.core.windows.net. If the staging cloud storage is in one of the Azure Government Cloud regions, use the following end point: blob.core.usgovcloudapi.net
Container nameThe name of the container in the storage account, in which the files are stored. A container organizes a set of blobs, similar to a directory in a file system. For more details, see Create a container.

Amazon Simple Storage Service (S3) is storage for the Internet. Amazon S3 has a simple web services interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web. For more details, see Amazon S3 Documentation.

For Amazon S3, enter the following:

FieldDescription
Authentication MechanismThe mechanism using which the security information is stored.
Access IDThe access ID uniquely identifies an AWS account. You can use the access ID to send authenticated requests to Amazon S3. The Access ID is a 20-character, alphanumeric string, for example, AKIAIOSFODNN7EXAMPLE. For the access ID details, contact your AWS Cloud admin.
Authentication Type for Secret Key

The secret key to send requests using the AWS account. The secret access key is a 40-character string, for example, wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY. _For the secret key details, contact your AWS Cloud admin. For more details, see AWS Account Access Keys. _

__

If you select Infoworks Managed from the Authentication Type for Secret Key dropdown, then provide Authentication Password for Secret Key.

If you select External Secret Store, select the secret from Secret for Secret Key dropdown. __

Bucket NameThe name of the container/bucket in the storage account in which the files are stored. The container/bucket organizes a set of objects, similar to a directory in a file system. For more details, see Creating a bucket.

Google Cloud Storage (GCS) is an online file storage web service for storing and accessing data on Google Cloud Platform infrastructure. In GCS, buckets are the basic containers that hold the data. For more details, see GCS documentation.

For GCS, enter the following:

FieldDescription
Project IDThe ID of the project in the source. For more details, see Project ID.
Authentication MechanismThe mechanism using which the security information is stored. The Service Account Credential File option refers to the access key to authenticate applications when making requests to this Azure storage account. For more details, see Creating and managing service account keys.
Service Account Credential File UploadYou can upload or enter the location where the service account JSON credential file is available in the edge node. This option is displayed if the Authentication Mechanism selected is Service Account Credential File.
Bucket NameThe name of the container/bucket in the storage account in which the files are stored. The container/bucket organizes a set of objects, similar to a directory in a file system. For more details, see Creating storage buckets.

Enabling TD Wallet

FieldDescription
Enable TD WalletThe option to enable TD Wallet.
TD wallet key used for passwordThe TD Wallet key for current Teradata connection.

Once the settings are saved, you can test the connection.

Configuring a Teradata Table

FieldDescription
Ingest Using

Fetch mechanism to be used for ingestion. The options include JDBC, TPT and TPT without IW processing.

You will not be able to change the fetch mechanism if the option selected is "TPT without IW processing" and any of the table is in ready state. However, you can switch between TPT and JDBC even after the table is in ready state.

NOTE This option is only visible when Enable Ingestion via Teradata Parallel Transporter (TPT) check box is selected in the Source Setup page.

Ingest TypeThe type of synchronization for the table. The options include full refresh and incremental.
Natural Keys

The combination of keys to uniquely identify the row. This field is mandatory in incremental ingestion tables. It helps in identifying and merging incremental data with the already existing data on target.

NOTE At least one of the columns in the natural key must have a non-null value for Infoworks merge to work.

Incremental ModeThe option to indicate if the incremental data must be appended or merged to the base table. This field is displayed only for incremental ingestion. The options include append and merge. For TPT without IW processing, only append mode is available.
Watermark ColumnSelect single/multiple watermark columns to identify the incremental records. The selected watermark column(s) should be of the same datatype.
Ingest subset of dataThe option to configure filter conditions to ingest a subset of data. This option is available for all the RDBMS and Generic JDBC sources. For more details, see Filter Query for RDBMS Sources
Sorting Columns

Select columns on which the target table will be sorted.

NOTE This feature is only available for Snowflake environments.

NOTE While using TPT with Query as a Table, it is advised to use double quotes to enclose column names, schema name, and table name.

Target Configuration

Configure the following fields:

FieldDescription
Target Table NameThe name of the target table.
Catalog Name

The name of the target catalog.

NOTE This field will be available only to the Unity Catalog enabled data environments.

Staging Catalog NameThe staging catalog name for temp tables.
Staging Schema Name

The staging schema name for temp tables.

NOTE Applicable only for Unity Catalog enabled environments on Azure Databricks

Schema NameThe schema name of the target table.
Storage Format

The format in which the tables must be stored. The options include Read Optimized (Delta), Read Optimized (Parquet), Read Optimized (ORC), Write Optimized (Avro), Universal Format (UniForm) and Delimited text files.

NOTE UniForm is available only to the Unity Catalog enabled data environments.

Partition Column

The column used to partition the data in target. Selecting the Create Derived Column option allows you to derive a column from the partition column. This option is enabled only if the partition column datatype is date or timestamp. Provide the Derived Column Function and Derived Column Name. Data will be partitioned based on this derived column.

NOTE The Partition section is disabled when the ingestion is done using TPT without IW Processing.

Optimization Configuration

Configure the following fields:

FieldDescription
Split By ColumnThe column used to crawl the table in parallel with multiple connections to database. Split-by column can be an existing column in the data. Any column for which minimum and maximum values can be computed, can be a split-by key. Select the Create Derived Split Column option and provide the Derived Split Column Function to derive a column from the Split By column. This option is enabled only if the Split By column datatype is date or timestamp. The data will be split based on the derived value.
Is Wide TableThe option to be enabled when any insert, update or delete statement in logs can exceed 4000 characters.

Advanced Configurations

Following are the steps to set advanced configuration for a table:

  1. Click the Data Catalog menu and click Ingest for the required source.

NOTE For already ingested table, click View Source, click the Tables tab, click Configure for the required table and click the Advanced Configuration tab.

  1. Click the Configure Tables tab, click the Advanced Configuration tab and click Add Configuration.
  2. Enter Key, Value, and Description. You can also select the configuration from the list displayed.

Configure Multiple TPT Data Writers and Readers

To configure multiple data writers and readers for TPT:

  1. Click Advanced Configuration -> Add Configuration.
  2. Set the following configuration parameters to any desired integer value in the Add Configuration screen:
  • tpt_reader_instances
  • tpt_writer_instances

NOTE If Cloud Type is WASB and tpt_writer_instances is set to > 1, then tpt_add_directory_separator_to_object should be set to true.

  1. Click Save to save the configuration changes.

Configure the Default Data Fetch Mechanism

To configure default data fetch mechanism, you must configure the following parameters at the source level or global level.

  1. Click Advanced Configuration -> Add Configuration.

  2. In the Add Configuration screen, enter data_fetch_ type under the Key field.

  3. Add one of the following values in the Value field.

    1. jdbc
    2. tpt
    3. skip_iw_processing
  4. Click Save to save the configuration changes.

Sync Data to Target

Using this option, you can configure the Target connections and sync data as described in the section Synchronizing Data to External Target

The following are the steps to sync data to target.

  1. From the Data Sources menu, select one of the tables and click View Source/Ingest button.
  2. Select the source table to be synchronized to Target.
  3. Click the Sync Data to Target button.
  4. Enter the mandatory fields as listed in the table below:
FieldDescription
Job NameThe name of the ingestion job.
Max Parallel TablesThe maximum number of tables that can be crawled at a given instance.
Compute ClusterThe template based on which the cluster will spin up for each table.The compute clusters created by admin and are accessible by the user are listed in the drop down.
Overwrite Worker CountThe option to override the maximum and minimum number of worker node values as configured in the compute template
Number of Worker NodesThe number of worker nodes that will spin up in the cluster.
Save as a Table GroupThe option to save the list of tables as a table group.

Click Onboarding a RDBMS Source to navigate back to complete the onboarding process.

  Last updated by Monika Momaya