Infoworks supports ingesting data from Snowflake database in a scalable and parallelized way.
For onboarding data from a Snowflake source, see Onboarding an RDBMS Source. Ensure that the Source Type selected is Snowflake.
Field | Description |
---|---|
Fetch Data Using | The mechanism through which Infoworks fetches data from the database. Currently, Infoworks supports fetching data using Spark. |
Connection URL | The URL of the Snowflake account through which Infoworks connects to the database. |
Account Name | Name of Snowflake account. |
Authentication Type | Select the type of authentication. This is a mandatory field. In the dropdown, select Default or OAuth. |
Username | Username of Snowflake account. Provide user's Snowflake username required to connect to Snowflake. This field appears only when the Authentication type is Default. |
Authentication Type for Password | Select the authentication type from the dropdown. For example, Infoworks Managed or External Secret Store. If you select Infoworks Managed, then provide Authentication Password for Password. If you select External Secret Store, then select the Secret which contains the password. |
OAuth Service | Infoworks supports OAuth service provided by Snowflake and Azure AD as external authorization provider. Select the required OAuth service. Choose from Snowflake or Azure Ad. |
Warehouse | Name of the Snowflake warehouse. This field is mandatory |
Data Environment Name | Environment defines where and how your data will be stored and accessed. Environment name must help the user to identify the environment being configured. User-defined. Provide a meaningful name for the environment being configured. |
Description | Description for the environment being configured. User-defined. Provide required description for the environment being configured. |
Client ID | In Azure AD: This is the ID for the application registered as Snowflake Client in Azure AD. In Snowflake: This is the public identifier for the security integration created in Snowflake. Provide the application id as applicable. |
Authentication Type for Client Secret | In Azure AD: The confidential secret for the application registered as Snowflake Client in Azure AD. In Snowflake: The confidential secret to connect to the Snowflake account. Provide the secret corresponding to the client ID above. If you select Infoworks Managed from the Authentication Type for Client Secret dropdown, then provide Authentication Password for Client Secret. If you select External Secret Store, select the secret from Secret for Client Secret dropdown. |
User Name | Snowflake service account user. Provide the account user details. |
Scope | A scope is a way to limit the permitted actions to a particular set of resources as part of a role assignment. Provide the scope as defined in Azure AD. This field appears when the OAuth Type selected is Azure AD |
Token End Point URL | End point to invoke for getting access token from the Azure AD. Provide the end point url. |
Authentication Type for Refresh Token | If included in the authorization URL, Snowflake presents the user with the option to consent to offline access. In this context, offline access refers to allowing the client to refresh access tokens when the user is not present. With user consent, the authorization server returns a refresh token in addition to an access token when redeeming the authorization code. Provide a refresh token. If you select Infoworks Managed, then provide Authentication Password for Refresh Token. If you select External Secret Store, then select the secret from the Secret for Refresh Token dropdown |
Section: Additional Parameters | Click the Add button to provide the parameters in key-value pair. Provide additional parameters required to connect to Snowflake. It is non mandatory. |
Section: Session Parameters | Click the Add button to provide the parameters in key-value pair. Provide session parameters required to connect to Snowflake. It is non mandatory. |
Once the settings are saved, you can test the connection.
With the source metadata in the catalog, you can now configure the table for CDC and incremental synchronization.
Step 1: Click the Configuration link, for the desired table.
Step 2: Provide the ingestion configuration details.
Field | Description |
---|---|
Query | The custom query based on which the table has been created. |
Ingest Type | The type of synchronization for the table. The options include full refresh and incremental. |
Natural Keys | The combination of keys to uniquely identify the row. This field is mandatory in incremental ingestion tables. It helps in identifying and merging incremental data with the already existing data on target. |
Incremental Mode | The option to indicate if the incremental data must be appended or merged to the base table. This field is displayed only for incremental ingestion. The options include append and merge. |
Incremental Fetch Mechanism | The fetch mechanism options include Archive Log and Watermark Column. This field is available only for Oracle log-based ingestion. |
Watermark Column | Select single/multiple watermark columns to identify the incremental records. The selected watermark column(s) should be of the same datatype. |
Enable Watermark Offset | For Timestamp and Date watermark columns, this option enables an additional offset (decrement) to the starting point for ingested data. Records created or modified within the offset time period are included in the next incremental ingestion job. |
Ingest subset of data | The option to configure filter conditions to ingest a subset of data. This option is available for all the RDBMS and Generic JDBC sources. For more details, see Filter Query for RDBMS Sources |
Configure the following fields:
Field | Description |
---|---|
Target Table Name | The name of the target table. |
Storage Format | The format in which the tables must be stored. The options include Read Optimized (Delta), Read Optimized (Parquet), Read Optimized (ORC), Write Optimized (Avro). |
Partition Column | The column used to partition the data in target. Selecting the Create Derived Column option allows you to derive a column and then use that as the partition column. This option is enabled only if the partition column datatype is date or timestamp. Provide the Derived Column Function and Derived Column Name. Data will be partitioned based on this derived column. |
Following are the steps to set advanced configuration for a table:
Step 1: Click the Data Catalog menu and click Ingest for the required source.
Step 2: Click the Configure Tables tab, click the Advanced Configuration tab and click Add Configuration.
Step 3: Enter key, value, and description. You can also select the configuration from the list displayed.
Using this option, you can configure the Target connections and sync data as described in the section Synchronizing Data to External Target.
The following are the steps to sync data to target.
Step 1:From the Data Sources menu, select one of the tables and click View Source/Ingest button.
Step 2: Select the source table to be synchronized to Target.
Step 3: Click the Sync Data to Target button.
Step 4: Enter the mandatory fields as listed in the table below:
Field | Description |
---|---|
Job Name | The name of the ingestion job. |
Max Parallel Tables | The maximum number of tables that can be crawled at a given instance. |
Compute Cluster | The template based on which the cluster will spin up for each table.The compute clusters created by admin and are accessible by the user are listed in the drop down. |
Overwrite Worker Count | The option to override the maximum and minimum number of worker node values as configured in the compute template |
Number of Worker Nodes | The number of worker nodes that will spin up in the cluster. |
Save as a Table Group | The option to save the list of tables as a table group. |
Click Onboarding an RDBMS Source to navigate back to complete the onboarding process.