Infoworks 6.1.3
Onboard Data

Onboarding Data from SQL Server

Configuring a SQL Server Source

For onboarding data from a SQL Server source, see Onboarding a RDBMS Source. Ensure that the Source Type selected is SQL Server.

SQL Server Configurations

FieldDescription
Fetch Data UsingThe mechanism through which Infoworks fetches data from the database.
Connection URLThe connection URL through which Infoworks connects to the database. The URL must be in the following format: jdbc:sqlserver://<ip>/TMODE=<tmode>,database=<databasename>
UsernameThe username for the connection to the database.
Authentication Type for Password

Select the authentication type from the dropdown. For example, Infoworks Managed or External Secret Store.

If you select Infoworks Managed, then provide Authentication Password for Password.

If you select External Secret Store, then select the Secret which contains the password.

NOTE Only the secrets accessible to the user will be available in the drop-down list.

Source SchemaThe schema in the database to be crawled. The schema value is case sensitive.
Custom TagsThis dropdown provides the list of tags which you can choose. It can be used to identify/segregate the source based on the selected tags.

Once the settings are saved, you can test the connection.

Note To connect to Azure SQL using the SQL connector, you can use catalog filters to load tables and schemas from single catalog. Ensure that the catalog filter matches just one catalog. You will not be able to browse the entire source like other connectors. Perform the following:

Add the catalog name in the Connection URL string. For example, to connect to the test_sql_pool_catalog, the following string can be used: jdbc:sqlserver://<ip>/TMODE=<tmode>;database=test_sql_pool

And Use the same in the Catalog filter:

Configuring a SQL Server Table

With the source metadata in the catalog, you can now configure the table for CDC and incremental synchronization.

  • Click the Configuration link, for the desired table
  • Provide the ingestion configuration details.
FieldDescription
Ingestion Configuration
Query

The custom query based on which the table has been created.

NOTE This field is only visible if the table is ingested using Add Query as Table.

Ingest TypeThe type of synchronization for the table. The options include full refresh and incremental.
Natural Keys

The combination of keys to uniquely identify the row. This field is mandatory in incremental ingestion tables. It helps in identifying and merging incremental data with the already existing data on target.

NOTE At least one of the columns in the natural key must have a non-null value for Infoworks merge to work.

Incremental ModeThe option to indicate if the incremental data must be appended or merged to the base table. This field is displayed only for incremental ingestion. The options include append, merge and insert overwrite.
Incremental Fetch MechanismThe fetch mechanism options include Archive Log and Watermark Column. This field is available only for Oracle log-based ingestion.
Watermark ColumnSelect single/multiple watermark columns to identify the incremental records. The selected watermark column(s) should be of the same datatype.
Enable Watermark Offset

For Timestamp and Date watermark columns, this option enables an additional offset (decrement) to the starting point for ingested data. Records created or modified within the offset time period are included in the next incremental ingestion job.

NOTE Timestamp watermark column has three options: Days, Hours and Minutes, and the Date watermark column has Days option. In both the cases, the options will be decremented from the starting point.

Ingest subset of dataThe option to configure filter conditions to ingest a subset of data. For more details, see Filter Query for RDBMS Sources
Sorting Columns

Select columns on which the target table will be sorted.

NOTE This feature is only available for Snowflake environments.

Target Configuration
Target Table NameThe name of the target table. The name cannot contain special characters except underscore.
Database NameThe name of the target database.
Catalog Name

The name of the target catalog.

NOTE This field will be available only to the Unity Catalog enabled data environments.

Staging Catalog NameThe staging catalog name for temp tables.
Staging Schema Name

The staging schema name for temp tables.

NOTE Applicable only for Unity Catalog enabled environments on Azure Databricks

Schema NameThe name of the target schema. The name cannot contain special characters except underscore.
Merged Table NameThe name of the merged table. This table is created only in case of Merge mode. This table has the de-duplicated data.
Table NameThe name of the target table maintained in Infoworks UI.
Storage Format

The format in which the tables must be stored. The options include Read Optimized (Delta), Write Optimized (Avro) and Universal Format (UniForm).

Note Selecting the storage format is not supported in Snowflake environments.

NOTE UniForm is available only to the Unity Catalog enabled data environments.

Partition Column

The column used to partition the data in target. Selecting the Create Derived Column option allows you to derive a column and then use that as the partition column. This option is enabled only if the partition column datatype is date or timestamp.

Provide the Derived Column Function and Derived Column Name. Data will be partitioned based on this derived column.

Optimization Configuration
Split By Column

The column used to crawl the table in parallel with multiple connections to database. Split-by column can be an existing column in the data. Any column for which minimum and maximum values can be computed, can be a split-by key.

Select the Create Derived Split Column option and provide the Derived Split Column Function to derive a column from the split-by column. This option is enabled only if the split-by column datatype is date or timestamp. The data will be split based on the derived value.

Generate History ViewThe option to preserve data in the history table. If enabled, after each CDC, the data will be appended to the history table.

Performing Incremental Ingestion from SQL Server Replication Table

Through MSSQL Server Log Based method, data is ingested and continuously synchronised with the database using captured CDC tables provided by MSSQL Server. But this method works only with the enterprise editions of MSSQL Server. In this method, change data capture must be enabled for the incremental tables. Before enabling for a table, it must be enabled for the schema in which the table is present.

Infoworks uses SQL Server transactional replication to capture change data from SQL server distribution databases. For CDC to work, you must enable SQL Server Replication on the system from which change data is to be captured. If the database has a high volume of change activity, a distributed server must be used as the host of the distribution database.

To configure CDC in Infoworks, you must define a capture registration for each source table. In the capture registration, you can select a subset of columns for which the data is to be captured.

Prerequisites

  1. The replication table must be enabled in the source SQL Server database.

Logs in MSSQL server is enabled using the CDC program of the SQL Server. This program is bundled with Enterprise, Developer, and Evaluation editions of MSSQL Server installations. Once logs are enabled for a table, the corresponding table changes will be stored in a capture table instance in a default database called CDC. The capture table name can be the same as source table name or user can configure this while enabling logging for the table.

Following is a two-step process {[1,2] or [1,3]} to enable logging. In these scripts, to execute the commands, we assume that there is a database called automation_db and a table timestamps_inserts_updates in that database.

  • Enable CDC for a database automation_db. This is basically executing CDC program on the database:
Command
Copy
  • Enable CDC for table timestamp_inserts_updates with default capture instance name.

NOTE:If you do not specify the name of the capture instance, the capture instance name will be same as source table name in the CDC schema.

Command
Copy
  • Enable CDC for table with user-given captured instance name.

NOTE: If you specify the name of the capture instance while enabling logging for the table, the source table name is timestamp_inserts_updates logging the CDC table name as name_of_capture_instance.

Command
Copy
  • Once the logging is enabled for a table, the metadata of the corresponding captured table will be stored in the cdc.change_tables table. You can use the following query to verify list of captured tables:
SQL
Copy

Below figure shows the sample output of the CDC-enabled tables:

Following are the steps to perform incremental ingestion from SQL Server replication table:

  1. Navigate to the General Configurations page for the required table.
  2. Select the Use Capture Table for Incremental check box.
  3. In the table Capture Table Name field, enter the name of the captured table provided while enabling replication in the source database. When user performs CDC, the changed data will be fetched from this table.

Known Issue If a SQL Server replication table includes any change in schema, the table must be recrawled (Tables > Recrawl Metadata) and the table configuration must be saved again (General Configurations).

Advanced Configurations

Following are the steps to set advanced configuration for a table:

  1. Click the Data Sources menu and click Ingest for the required source.

NOTE For already ingested table, click View Source, click the Tables tab, click Configure for the required table and click the Advanced Configuration tab.

  1. Click the Configure Tables tab, click the Advanced Configuration tab and click Add Configuration.
  2. Enter key, value, and description. You can also select the configuration from the list displayed.

Sync Data to Target

Using this option, you can configure the Target connections and sync data as described in the section Synchronizing Data to External Target

The following are the steps to sync data to target.

  1. From the Data Sources menu, select one of the tables and click View Source/Ingest button.
  2. Select the source table to be synchronized to Target.
  3. Click the Sync Data to Target button.
  4. Enter the mandatory fields as listed in the table below:
FieldDescription
Job NameThe name of the ingestion job.
Max Parallel TablesThe maximum number of tables that can be crawled at a given instance.
Compute ClusterThe template based on which the cluster will spin up for each table.The compute clusters created by admin and are accessible by the user are listed in the drop down.
Overwrite Worker CountThe option to override the maximum and minimum number of worker node values as configured in the compute template
Number of Worker NodesThe number of worker nodes that will spin up in the cluster.
Save as a Table GroupThe option to save the list of tables as a table group.

Click Onboarding a RDBMS Source to navigate back to complete the onboarding process.

  Last updated by Monika Momaya