Infoworks 6.2.0
Onboard Data

Onboarding Data from Generic JDBC Source Types

The generic JDBC source types allow you to onboard data from any data source that includes a JDBC driver. This feature allows you to create new sources by providing JDBC jars.

Prerequisites

  • The JDBC Driver must be JDBC Type 4 compliant and SQL-92 compliant.
  • The no. of Generic JDBC Source types that can be registered depends on your License. Please make sure you are licensed to use Generic JDBC Source Types and you have free slots to register new source types.

Limitation The performance of the ingestion or load on the source system machine when using a generic JDBC source depends on the driver used.

Registering a Generic Source Type

To register a generic source type, perform the following steps:

  1. Navigate to Admin > Extensions > Generic Source Types.
  2. Click the Add Generic Source Type button.

NOTE You can delete a generic source type only after deleting the sources that use the specific generic source type.

  1. Enter the following details:
FieldDescriptionDetails
Source ClassificationThe classification of the source type that is to be created.Select Generic JDBC.
NameA unique name for the generic JDBC source to be registered.Enter the source name, for example, MySQL.
Driver Class NameThe Driver class name for the JDBC jar.Enter the driver class name, for example, com.mysql.jdbc.Driver for a MySQL JDBC driver.
Upload FilesThe option to upload the driver jars.Click the Upload Files button and select the file. You can upload multiple files.
Enter Server LocationThe location where the driver jars are available in the edge node.Enter the server folder location.

4. Click Save. The Generic JDBC Source will be registered as a new source type.

  1. Click Manage Source Type Registrations to view all the generic source types.

Creating a Generic JDBC Source

For onboarding data from a Generic JDBC source, see Onboarding a RDBMS Source. The name of the generic JDBC source you registered will be displayed in the Source Type drop-down list.

Generic JDBC Source Configurations

FieldDescriptionDetails
JDBC URLThe connection URL through which Infoworks connects to the database.Enter the JDBC URL to connect to the source system. The JDBC URL typically starts with jdbc:://
UsernameThe username for the connection to the database.Enter the username, for example, root.
Authentication Type for Password

Select the authentication type from the dropdown.

For example, Infoworks Managed or External Secret Store

If you select Infoworks Managed, then provide Authentication Password for Password. If you select External Secret Store, then select the Secret which contains the password.
Source SchemaThe schema in the database to be crawled.Enter the source schema, for example, AUTOMATION_DB. The schema value is case sensitive.
Additional Connection ParametersThe connection key-value pair to be used when connecting to the source database.Click the Add button and enter the Key-Value pair. Check Encrypt Value to store sensitive information. Check Is Active to use the parameter.
Use catalog for querying the tables during the metadata crawlSome JDBC drivers provide the list of tables if queried via the catalog name rather than the schema name, for example, MySQL.Select this checkbox to use the JDBC drivers for fetching tables, like for MySQL database.

Source SQL Functions

You can override the default SQL functions at the source level and customize the generated SQLs for the corresponding driver. You can override the default SQL select clause at the table level.

FieldDescriptionDetails
Quoted IdentifierThe syntax for quoting identifiers in SQL statements. This is used for escaping the column and table names.Select a value from the drop-down list. The values include double quote, single quote, and backtick.
Override Default Timestamp Cast FunctionThe option to override the default timestamp cast function used in the where clause while constructing queries.Check this option to enable the Timestamp Cast Function.
Timestamp Cast FunctionThe function to convert the value into timestamp datatype. This is used for constructing the where clause if the incremental column for the table is a timestamp column.Enter the function or use the default value. For example, for a MySQL source, cast(‘${column}’ as datetime) and for Oracle, cast(‘${column}’ as timestamp).
Override Default Date Cast FunctionThe option to override the default date cast function used in the where clause while constructing queries.Check this option to enable the Date Cast Function.
Date Cast FunctionThe function to convert the value into date datatype.This is used for constructing the where clause if the incremental column for the table is a date column.Enter the function or use the default value. For example, cast(‘${column}’ as date)
Sample Data QueryThe query to fetch the sample data. The query can be select ${columnList} from .${tableName} limit 100 for MySQL or select top 100 ${columnList} from .${tableName} for SQL Server.Enter the data query.In the sample query,${columnList} is a placeholder for all the columns that need to be queried.${tableName} is a placeholder for the table name.NOTES:${columnList} and ${tableName} are mandatory placeholder variables. The `` value will be prefilled based on the schema name configured for the source.

Once the settings are saved, you can test the connection.

Configuring a Generic JDBC Table

With the source metadata in the catalog, you can now configure the table for CDC and incremental synchronization.

  1. Click the Configuration link, for the desired table.
  2. Provide the ingestion configuration details.
FieldDescription
Ingestion Configuration
QueryThe custom query based on which the table has been created. NOTE This field is only visible if the table is ingested using Add Query as Table.
Ingest TypeThe type of synchronization for the table. The options include full refresh and incremental.
Natural KeysThe combination of keys to uniquely identify the row. This field is mandatory in incremental ingestion tables. It helps in identifying and merging incremental data with the already existing data on target.
Incremental ModeThe option to indicate if the incremental data must be appended or merged to the base table. This field is displayed only for incremental ingestion. The options include append, merge and insert overwrite.
Watermark ColumnSelect single/multiple watermark columns to identify the incremental records. The selected watermark column(s) should be of the same datatype.
Ingest subset of dataThe option to configure filter conditions to ingest a subset of data. For more details, see Filter Query for RDBMS Sources
SQL Queries
Select Query

The additional inputs required with a select clause, like optimization hints or lock changes. The default value is SELECT ${columnList} FROM ${tableName}, where, ${columnList} is the placeholder for all the columns to be queried and <tableName> will be pre-filled with the appropriate values.

NOTES

  1. Ensure to provide the correct SQL query syntax as per the DB used.
  2. The select query can be used to select a subset of columns, it does not support the ‘where' condition. To use custom conditions please onboard tables with 'Query as table’ option.
Target Configuration
Target Table NameThe name of the target table. The name cannot contain special characters except underscore.
Schema NameThe name of the target schema. The name cannot contain special characters except underscore.
Storage FormatThe format in which the tables must be stored. The options include Read Optimized (Delta) and Write Optimized (Avro), and Delimited text files.
Partition Column

The column used to partition the data in target. Selecting the Create Derived Column option allows you to derive a column from the partition column. This option is enabled only if the partition column datatype is date or timestamp.

Provide the Derived Column Function and Derived Column Name. The data will be partitioned based on this derived column.

Optimization Configuration
Split By Column

The column used to crawl the table in parallel with multiple connections to database. Split-by column can be an existing column in the data. Any column for which minimum and maximum values can be computed, can be a split-by key.

Select the Create Derived Split Column option and provide the Derived Split Column Function to derive a column from the split-by column. This option is enabled only if the split-by column datatype is date or timestamp. The data will be split based on the derived value.

Generate History ViewThe option to preserve data in the history table. If enabled, after each CDC, the data will be appended to the history table.

Advanced Configurations

Following are the steps to set advanced configuration for a table:

  1. Click the Data Catalog menu and click Ingest for the required source.

NOTE For an already ingested table, click View Source, click the Tables tab, click Configure for the required table and click the Advanced Configuration tab.

  1. Click the Configure Tables -> Advanced Configuration -> Add Configuration.
  2. Enter key, value, and description. You can also select the configuration from the list that appears.

NOTE You can also configure Catalog separator and Schema separator through Advanced configurations.

Sync Data to Target

Using this option, you can configure the Target connections and sync data as described in the section Synchronizing Data to External Target

The following are the steps to sync data to target.

  1. From the Data Sources menu, select one of the tables and click View Source/Ingest button.
  2. Select the source table to be synchronized to Target.
  3. Click the Sync Data to Target button.
  4. Enter the mandatory fields as listed in the table below:
FieldDescription
Job NameThe name of the ingestion job.
Max Parallel TablesThe maximum number of tables that can be crawled at a given instance.
Compute ClusterThe template based on which the cluster will spin up for each table.The compute clusters created by admin and are accessible by the user are listed in the drop down.
Overwrite Worker CountThe option to override the maximum and minimum number of worker node values as configured in the compute template
Number of Worker NodesThe number of worker nodes that will spin up in the cluster.
Save as a Table GroupThe option to save the list of tables as a table group.

Previewing Query

You can preview the queries used for ingesting the table in the Preview Query tab.

Limitation The derived split-by option is currently not supported.

Click Onboarding a RDBMS Source to navigate back to complete the onboarding process.

Adding a column to the table

After metadata crawl is complete, you have the flexibility to add a target column to the table.

Target Column refers to adding a target column if you need any special columns in the target table apart from what is present in that source.

You can select the datatype you want to give for the specific column

You can select either of the following transformation modes: Simple and Advanced

Simple Mode

In this mode, you must add a transformation function that has to be applied for that column. Target Column with no transformation function applied will have null values in the target.

Advanced Mode

In this mode, you can provide the Spark expression in this field. For more information, refer to Adding Transform Derivation.

NOTE When table is in ready state (already ingested), schema editing is disabled.

Ingesting Data

For ingesting a Generic JDBC source, see Onboarding Data

Reference

For details on the additional drivers supported by Infoworks, see Additional Connectors for Onboarding Data.

  Last updated by Monika Momaya