Infoworks 6.0.0
Getting Started

Configuring Infoworks with Snowflake

Introduction

Infoworks automates onboarding data directly to Snowflake and supports data transformation and orchestration in Snowflake. To onboard data directly to Snowflake, you should configure a Snowflake environment that includes cloud storage and one or more Spark clusters. Cloud storage is used temporarily to stage data during ingestion and to store sample data.

Prerequisites

  • Ensure that the Snowflake database user has an active Snowflake account.

  • Ensure that the Snowflake database user has the following privileges. These privileges apply to the Snowflake tables.

    • User-Managed table: SELECT, INSERT, UPDATE, TRUNCATE, DELETE, and ALTER (if sync schema is enabled)
    • Infoworks-Managed table: OWNERSHIP
  • Ensure that the Snowflake database user has the following privileges. These privileges apply to the Snowflake schema.

    • User-Managed table (Target Schema): No additional permissions.
    • User-Managed table (Staging Schema): CREATE TABLE
    • Infoworks-Managed table (Target Schema): CREATE TABLE
    • Infoworks-Managed table (Staging Schema): CREATE TABLE
  • Infoworks requires access to an existing database on snowflake accounts which are configured with Infoworks to calculate billing information, By default, public database is used. No data will be stored under this database, this is needed for querying purpose only.

  • In case you are unable to create public database, configure an alternative database using snowflake_billing_default_database config.

  • Database name should not contain #.

Procedure

To configure and connect to the required Amazon EMR instance, navigate to Admin > Manage Data Environments, and then click Add button under the Snowflake option.

The following window appears:

There are three tabs to be configured as follows:

Data Environment

To configure the data environment details, enter values in the following fields. This defines the environmental parameters, to allow Infoworks to be configured to the required Snowflake instance:

After entering all the required values, click Continue to move to the Compute tab.

Compute

A Compute template is the infrastructure used to execute a job. This compute infrastructure requires access to the metastore and storage that needs to be processed. To configure the compute details, enter values in the following fields. This defines the compute template parameters, to allow Infoworks to be configured to the required Snowflake instance.

You can select one of the clusters as the default cluster for running the jobs. However, this can be overwritten at job individual job level.

Infoworks supports creating multiple persistent clusters in an Snowflake environment, by clicking on Add Compute button.

Enter the fields in the Compute section:

NOTE If an admin edits the interactive cluster, the cluster restarts, and hence the job running on that cluster fails.

LIMITATIONS

You will encounter the following limitations while running batch jobs on Databricks Persistent Cluster:

  • Credential configuration issues: Overriding the configurations that are passed to Spark and Distributed File System on the cluster, during the job initialization is not supported in DBx persistent clusters, and could potentially lead to job failures where there are multiple environments with different credentials.
  • Limitations on running CDATA source: CDATA sources require RSD files to be passed to all the worker nodes during the initialization of the cluster. This is not supported in persistent clusters, as we submit the jobs to an already running cluster.
  • Limitations on number of parallel jobs: It depends on the number of resources available for a spark driver to run. Number of jobs utilizing resources of the driver can limit the driver performance as it is a single spark driver running on the cluster.
  • Switching jar between different versions: If the same jar with different versions is used, then the spark always picks the one that is installed first. There is no way for the jobs to pick the right version. This is a limitation from the product side.
  • Restart cluster after jar update: If a jar gets updated, then we need to restart the cluster for spark to pick the new updated jar. This is required in case of upgrades or patches.

Storage

To configure the storage details, enter values in the following fields. This defines the storage parameters, to allow Infoworks to be configured to the required Snowflake instance. After configuring a storage, you can choose to make it default storage for all jobs. However, this can be overwritten at job individual job level.

NOTE To configure a new storage after the first time configuration, click Add button on the UI.

Enter the following fields under the Storage section:

Azure DataLake Storage(ADLS) Gen 1

On selecting Azure DataLake Storage(ADLS) Gen 1 as the storage type, the following fields appear:

Azure DataLake Storage(ADLS) Gen 2

On selecting Azure DataLake Storage (ADLS) Gen 2 as the storage type, the following fields appear:

WASB

On selecting WASB as the storage type, the following fields appear:

S3

On selecting S3 as the storage type, the following fields appear:

GCS

On selecting GCS as the storage type, the following fields appear:

After entering all the required values, click Save. Click Return to Manage Environments to view and access the list of all the environments configured. Edit, Clone, and Delete actions are available on the UI, corresponding to every configured environment.

On This Page