Infoworks 5.5.2
Getting Started

Configuring Infoworks with Snowflake

Introduction

Infoworks automates onboarding data directly to Snowflake and supports data transformation and orchestration in Snowflake. To onboard data directly to Snowflake, you should configure a Snowflake environment that includes cloud storage and one or more Spark clusters. Cloud storage is used temporarily to stage data during ingestion and to store sample data.

Prerequisites

  • Ensure that the Snowflake database user has an active Snowflake account.

  • Ensure that the Snowflake database user has the following privileges. These privileges apply to the Snowflake tables.

    • User-Managed table: SELECT, INSERT, UPDATE, TRUNCATE, DELETE, and ALTER (if sync schema is enabled)
    • Infoworks-Managed table: OWNERSHIP
  • Ensure that the Snowflake database user has the following privileges. These privileges apply to the Snowflake schema.

    • User-Managed table (Target Schema): No additional permissions.
    • User-Managed table (Staging Schema): CREATE TABLE
    • Infoworks-Managed table (Target Schema): CREATE TABLE
    • Infoworks-Managed table (Staging Schema): CREATE TABLE
  • Infoworks requires access to an existing database on snowflake accounts which are configured with Infoworks to calculate billing information, By default, public database is used. No data will be stored under this database, this is needed for querying purpose only.

  • In case you are unable to create public database, configure an alternative database using snowflake_billing_default_database config.

  • Database name should not contain #.

Procedure

To configure and connect to the required Amazon EMR instance, navigate to Admin > Manage Data Environments, and then click Add button under the Snowflake option.

The following window appears:

There are three tabs to be configured as follows:

Data Environment

To configure the data environment details, enter values in the following fields. This defines the environmental parameters, to allow Infoworks to be configured to the required Snowflake instance:

FieldDescriptionDetails
Data Environment NameEnvironment defines where and how your data will be stored and accessed. Environment name must help the user to identify the environment being configured.User-defined. Provide a meaningful name for the environment being configured.
DescriptionDescription for the environment being configured.User-defined. Provide required description for the environment being configured.
Connection URLURL of the Snowflake account.User-defined. Provide the required Snowflake account URL.
Account NameName of Snowflake account.Provide the required Snowflake account name.
FieldDescriptionDetails
Profile NameUnique Name to identify the profile.This is a mandatory field.
Authentication TypeSelect the type of authentication.This is a mandatory field. In the dropdown, select Default or OAuth.
UsernameUsername of Snowflake accountProvide user's Snowflake username required to connect to Snowflake. This field appears only when the Authentication type is Default.
PasswordPassword of Snowflake accountProvide user's Snowflake password required to connect to Snowflake. This field appears only when the Authentication type is Default.
OAuth ServiceInfoworks supports OAuth service provided by Snowflake and Azure AD as external authorization providerSelect the required OAuth service. Choose from Snowflake or Azure Ad.
Client IDIn Azure AD: This is the ID for the application registered as Snowflake Client in Azure AD. In Snowflake: This is the public identifier for the security integration created in Snowflake.Provide the application id as applicable.
Client SecretIn Azure AD: The confidential secret for the application registered as Snowflake Client in Azure AD. In Snowflake: The confidential secret to connect to the Snowflake account.Provide the secret corresponding to the client ID above.
ScopeA scope is a way to limit the permitted actions to a particular set of resources as part of a role assignment.Provide the scope as defined in Azure AD. This field appears when the OAuth Type selected is Azure AD
Token End Point URLEnd point to invoke for getting access token from the Azure AD.Provide the end point url.
Refresh TokenIf included in the authorization URL, Snowflake presents the user with the option to consent to offline access. In this context, offline access refers to allowing the client to refresh access tokens when the user is not present. With user consent, the authorization server returns a refresh token in addition to an access token when redeeming the authorization code.Provide a refresh token
Section: Additional ParametersClick the Add button to provide the parameters in key-value pair.Provide additional parameters required to connect to Snowflake. It is non mandatory.
Section: Session ParametersClick the Add button to provide the parameters in key-value pair.Provide session parameters required to connect to Snowflake. It is non mandatory.
WarehousesClick the icon to add the name and description of the warehouse(s).This field is mandatory. More than one warehouse can be added, and one among then can be set as default.

After entering all the required values, click Continue to move to the Compute tab.

Compute

A Compute template is the infrastructure used to execute a job. This compute infrastructure requires access to the metastore and storage that needs to be processed. To configure the compute details, enter values in the following fields. This defines the compute template parameters, to allow Infoworks to be configured to the required Snowflake instance.

You can select one of the clusters as the default cluster for running the jobs. However, this can be overwritten at job individual job level.

Infoworks supports creating multiple persistent clusters in an Snowflake environment, by clicking on Add Compute button.

Enter the fields in the Compute section:

FieldDescriptionDetails
Compute Provider
Compute ProviderThe type of compute provider required for creating compute templatesChoose from the available options: Azure Databricks, Databricks on AWS, and Databricks on GCP
Workspace UrlURL of the workspace that Infoworks must be attached to.Provide the required workspace URL.
Databricks TokenDatabricks access token of the user who uses Infoworks.Provide the required Databricks token.
RegionGeographical location where you can host your resources.Provide the required region. For example: East US.
Compute Clusters
Cluster TypeThe type of compute cluster that you want to launch.Choose from the available options: Persistent or Ephemeral. Jobs can be submitted on both Ephemeral as well as Persistent clusters.
NameName required for the compute template that you want to use for the jobs.User-defined. Provide a meaningful name for the compute template being configured.
DescriptionDescription required for the compute template.User-defined. Provide required description for the compute template being configured.
Runtime VersionSelect the Runtime version of the compute cluster that is being used.

Select the Runtime version as 9.1 from the drop-down for Snowflake.

If you have selected Azure Databricks from the Compute Provider dropdown, the following versions will appear from the Runtime Version dropdown: 9.1, 11.3, and 11.3_photon_enabled. You can select either of them.

If you have selected either Databricks on AWS or Databricks on GCP from the Compute Provider dropdown, then 9.1 version will appear from the Runtime Version dropdown.

RegionGeographical location where you can host your resources.Provide the required region. For example: East US.
Workspace URLURL of the workspace that Infoworks must be attached to.Provide the required workspace URL.
Databricks TokenDatabricks access token of the user who uses Infoworks.Provide the required Databricks token.
Allow single node instanceOption to run single node clusters.A single node cluster is a cluster consisting of an apache spark driver and no spark workers.
Use Instance PoolOption to use a set of idle instances which optimizes cluster start and auto-scaling times.If Use Instance pool check box is checked, provide the ID of the created instance pool in the additional field that appears.
Worker TypeWorker type configured in the edge node.This field appears only if Use Instance pool check box is unchecked. Provide the required worker type. For example: Standard_L4
Driver TypeDriver type configured in the edge node.This field appears only if Use Instance pool check box is unchecked. Provide the required driver type. For example: Standard_L8
Max Allowed Worker NodesMaximum number of worker instances allowed.Provide the maximum allowed limit of worker instances.
Enable AutoscaleOption for the instances in the pool to dynamically acquire additional disk space when they are running low on disk space.Select this option to enable autoscaling.
Default Min WorkersMinimum number of workers that workspace maintains.This field appears only if Enable Autoscale check box is checked.
Default Max WorkersMaximum number of workers that workspace maintains.This field appears only if Enable Autoscale check box is checked. This must be greater than or equal to Default Min Worker value.
Number of Worker NodesNumber of workers configured for availability.This field appears only if Enable Autoscale check box is unchecked.
Terminate after minutes of inactivityNumber of minutes after inactivity which the pool maintains before being terminated.Provide the minimum number of minutes to be maintained before termination.

NOTE If an admin edits the interactive cluster, the cluster restarts, and hence the job running on that cluster fails.

LIMITATIONS

You will encounter the following limitations while running batch jobs on Databricks Persistent Cluster:

  • Credential configuration issues: Overriding the configurations that are passed to Spark and Distributed File System on the cluster, during the job initialization is not supported in DBx persistent clusters, and could potentially lead to job failures where there are multiple environments with different credentials.
  • Limitations on running CDATA source: CDATA sources require RSD files to be passed to all the worker nodes during the initialization of the cluster. This is not supported in persistent clusters, as we submit the jobs to an already running cluster.
  • Limitations on number of parallel jobs: It depends on the number of resources available for a spark driver to run. Number of jobs utilizing resources of the driver can limit the driver performance as it is a single spark driver running on the cluster.
  • Switching jar between different versions: If the same jar with different versions is used, then the spark always picks the one that is installed first. There is no way for the jobs to pick the right version. This is a limitation from the product side.
  • Restart cluster after jar update: If a jar gets updated, then we need to restart the cluster for spark to pick the new updated jar. This is required in case of upgrades or patches.

Storage

To configure the storage details, enter values in the following fields. This defines the storage parameters, to allow Infoworks to be configured to the required Snowflake instance. After configuring a storage, you can choose to make it default storage for all jobs. However, this can be overwritten at job individual job level.

NOTE To configure a new storage after the first time configuration, click Add button on the UI.

Enter the following fields under the Storage section:

FieldDescriptionDetails
NameStorage name must help the user to identify the storage credentials being configured.User-defined. Provide a meaningful name for the storage set up being configured.
DescriptionDescription for the storage set up being configured.User-defined. Provide required description for the environment being configured.
Storage Type

Type of storage system where all the artifacts will be stored. This depends on the type of cloud/platform provider you choose in the Compute tab.

  • The available options for Azure are Azure DataLake Storage(ADLS) Gen 1, Azure DataLake Storage(ADLS) Gen 2 ,WASB
  • The available option for AWS is S3.
  • The available option for GCP is GCS.
Select the required storage type from the drop-down menu.
Snowflake Integration
Storage Integration NameName of the snowflake storage integration object used to authenticate between snowflake and the cloud storageProvide the details

Azure DataLake Storage(ADLS) Gen 1

On selecting Azure DataLake Storage(ADLS) Gen 1 as the storage type, the following fields appear:

FieldDescriptionDetails
Access SchemeScheme used to access ADLS Gen 1.Select the available option: adl://
Authentication MechanismMechanism using which the security information is stored. Available options are Service Principal and Access Key.Select the required authentication mechanism from the drop-down menu.
Storage Account NameName of the Azure Storage Account.Provide the required storage account name. For example, my_data_acc.
Application IDID that uniquely identifies the user application.Provide the required application ID.
Directory IDID that uniquely identifies the Azure AD instance.Provide the required directory ID.
Service CredentialsCredential that the application uses to prove its identity.Provide the credential string value.

Azure DataLake Storage(ADLS) Gen 2

On selecting Azure DataLake Storage (ADLS) Gen 2 as the storage type, the following fields appear:

FieldDescriptionDetails
Access SchemeScheme used to access ADLS Gen 2. Available options are abfs:// and abfss://.Select the required access scheme from the drop-down menu.
Authentication MechanismMechanism using which the security information is stored. Available options are Service Principal and Access Key.Select the required authentication mechanism from the drop-down menu.
File SystemFile system where all the data of an artifact will be stored.Provide the required file system parameter.
Storage Account NameName of the Azure Storage Account.Provide the required storage account name. For example, my_data_acc.
Application IDID that uniquely identifies the user application.Provide the required application ID.
Directory IDID that uniquely identifies the Azure AD instance.Provide the required directory ID.
Service CredentialCredential that the application uses to prove its identity.Provide the credential string value.

WASB

On selecting WASB as the storage type, the following fields appear:

FieldDescriptionDetails
Access SchemeScheme used to access WASB. Available options are wasb:// and wasbs://Select the required access scheme from the drop-down menu.
Container NameName of the WASB container.Provide the WASB container name.
Storage AccountAccount name of WASB storage.Provide the storage account name.
Account KeyAccess key of WASB storage.Provide the required access key.

S3

On selecting S3 as the storage type, the following fields appear:

FieldDescriptionDetails
Access SchemeScheme used to access S3. Available options are s3a://, s3n://, and s3://.Select the required access scheme from the drop-down menu.
Bucket NameAWS bucket name is part of the domain in the URL. For example: http://bucket.s3.amazonaws.com.Provide the required bucket name. This field is displayed only for S3 storage type.
Access KeyUnique 20-character, alphanumeric string which identifies your AWS account. For example, AKIAIOSFODNN7EXAMPLE.Provide the required access key.
Secret KeyUnique 40-character string which allows you to send requests using the AWS account. For example, wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY.Provide the required secret key.

GCS

On selecting GCS as the storage type, the following fields appear:

FieldDescriptionDetails
Access SchemeScheme used to access GCS. Available option is gs://Select gs:// from the drop-down menu.
Bucket NameBuckets are the basic containers that hold, organise, and control access to your data.Provide the storage bucket key. Do not use gs:// for storage bucket.
Authentication MechanismIncludes the authentication mechanism to access the GCP storage.The following authentication mechanisms are available: Use system role credentials: The default credentials of the instance to identify and authorize the application. Use environment level service account credentials: The IAM Service Account defined at the environment level to identify and authorize the application. Override environment authentication mechanism: Overrides the credentials provided for the environment.
Service CredentialsProvide the credentials used to authenticate calls to Google Cloud APIs.Upload File: Upload the file where the service credentials are stored. Enter File Location: Path of the file to be uploaded. You must enter the server file location.

After entering all the required values, click Save. Click Return to Manage Environments to view and access the list of all the environments configured. Edit, Clone, and Delete actions are available on the UI, corresponding to every configured environment.

  Last updated by Monika Momaya