Onboarding Data from Vertica

NOTE The Vertica version supported is 12.0.0.

Creating a Vertica Source

For onboarding data from a Vertica source, see Onboarding an RDBMS Source. Ensure that the Source Type selected is Vertica.

Vertica Configurations

Field	Description
Fetch Data Using	The mechanism through which Infoworks fetches data from the database.
Connection URL	The connection URL through which Infoworks connects to the database. The URL must be in the following format: `jdbc:mysql://<ip>:<port>/<databasename>`
Username	The username for the connection to the database.
Authentication Type for Password	Select the authentication type from the dropdown. For example, Infoworks Managed or External Secret Store. If you select Infoworks Managed, then provide Authentication Password for Password. If you select External Secret Store, then select the Secret which contains the password.
Source Schema	The schema in the database to be crawled. The schema value is case sensitive.

Once the settings are saved, you can test the connection.

Configuring a Vertica Table

With the source metadata in the catalog, you can now configure the table for CDC and incremental synchronization.

Step 1: Click the Configuration link, for the desired table.

Step 2: Provide the ingestion configuration details.

Field	Description
Fetch Using	This method is used to run the spark job. The options include JDBC and Spark Vertica Connector.
Query	The custom query based on which the table has been created. NOTE This field is only visible if the table is ingested using Add Query as Table.
Ingest Type	The type of synchronization for the table. The options include full refresh and incremental.
Natural Keys	The combination of keys to uniquely identify the row. This field is mandatory in incremental ingestion tables. It helps in identifying and merging incremental data with the already existing data on target. NOTE At least one of the columns in the natural key must have a non-null value for Infoworks merge to work.
Incremental Mode	The option to indicate if the incremental data must be appended or merged to the base table. This field is displayed only for incremental ingestion. The options include append and merge.
Incremental Fetch Mechanism	The fetch mechanism options include Archive Log and Watermark Column. This field is available only for Oracle log-based ingestion.
Watermark Column	Select single/multiple watermark columns to identify the incremental records. The selected watermark column(s) should be of the same datatype.
Enable Watermark Offset	For Timestamp and Date watermark columns, this option enables an additional offset (decrement) to the starting point for ingested data. Records created or modified within the offset time period are included in the next incremental ingestion job. NOTE Timestamp watermark column has three options: Days, Hours and Minutes, and the Date watermark column has Days option. In both the cases, the options will be decremented from the starting point.
Ingest subset of data	The option to configure filter conditions to ingest a subset of data. This option is available for all the RDBMS and Generic JDBC sources. For more details, see Filter Query for RDBMS Sources

Target Configuration

Configure the following fields:

Field	Description
Target Table Name	The name of the target table.
Storage Format	The format in which the tables must be stored. The options include Read Optimized (Delta), Read Optimized (Parquet), Read Optimized (ORC), Write Optimized (Avro).
Partition Column	The column used to partition the data in target. Selecting the Create Derived Column option allows you to derive a column and then use that as the partition column. This option is enabled only if the partition column datatype is date or timestamp. Provide the Derived Column Function and Derived Column Name. Data will be partitioned based on this derived column.

Field

Description

Target Table Name

The name of the target table.

Storage Format

The format in which the tables must be stored. The options include Read Optimized (Delta), Read Optimized (Parquet), Read Optimized (ORC), Write Optimized (Avro).

Partition Column

The column used to partition the data in target. Selecting the Create Derived Column option allows you to derive a column and then use that as the partition column. This option is enabled only if the partition column datatype is date or timestamp.

Provide the Derived Column Function and Derived Column Name. Data will be partitioned based on this derived column.

Optimization Configuration

Configure the following fields:

Field	Description
Split By Column	The column used to crawl the table in parallel with multiple connections to database. Split-by column can be an existing column in the data. Any column for which minimum and maximum values can be computed, can be a split-by key. Select the Create Derived Split Column option and provide the Derived Split Column Function to derive a column from the Split By column. This option is enabled only if the Split By column datatype is date or timestamp. The data will be split based on the derived value. NOTE: This is not available in Spark Vertica Connector.

Field

Description

Split By Column

The column used to crawl the table in parallel with multiple connections to database. Split-by column can be an existing column in the data. Any column for which minimum and maximum values can be computed, can be a split-by key. Select the Create Derived Split Column option and provide the Derived Split Column Function to derive a column from the Split By column. This option is enabled only if the Split By column datatype is date or timestamp. The data will be split based on the derived value. NOTE: This is not available in Spark Vertica Connector.

Advanced Configurations

Following are the steps to set advanced configuration for a table:

Step 1: Click the Data Catalog menu and click Ingest for the required source.

NOTE For an already ingested table, click View Source, click the Tables tab, click Configure for the required table and click the Advanced Configuration tab.

Step 2: Click the Configure Tables tab, click the Advanced Configuration tab and click Add Configuration.

Step 3: Enter key, value, and description. You can also select the configuration from the list displayed.

Sync Data to Target

Using this option, you can configure the Target connections and sync data as described in the section Synchronizing Data to External Target

The following are the steps to sync data to target.

Step 1: From the Data Sources menu, select one of the tables and click View Source/Ingest button.

Step 2: Select the source table to be synchronized to Target.

Step 3: Click the Sync Data to Target button.

Step 4: Enter the mandatory fields as listed in the table below:

Field	Description
Job Name	The name of the ingestion job.
Max Parallel Tables	The maximum number of tables that can be crawled at a given instance.
Compute Cluster	The template based on which the cluster will spin up for each table.The compute clusters created by admin and are accessible by the user are listed in the drop down.
Overwrite Worker Count	The option to override the maximum and minimum number of worker node values as configured in the compute template
Number of Worker Nodes	The number of worker nodes that will spin up in the cluster.
Save as a Table Group	The option to save the list of tables as a table group.

Click Onboarding an RDBMS Source to navigate back to complete the onboarding process.

Last updated on Mar 20, 2023

Was this page helpful?