Infoworks 6.2.0
Getting Started

Configuring Infoworks with Spark on Kubernetes

Introduction

Infoworks’ Spark on Kubernetes environment enables streamlined execution of Apache Spark workloads on Kubernetes-managed clusters. Infoworks handles Kubernetes integration—submitting Spark jobs directly to the Kubernetes API so that Spark driver and executor components run as Kubernetes pods managed by Kubernetes’ native scheduler.

For more information and technical details, see the official Apache Spark documentation on running Spark on Kubernetes

Prerequisites

  • A Kubernetes-based deployment for Infoworks must be in place, ensuring Spark workloads can run in Kubernetes-managed containers.
  • An external Hive metastore is required for managing metadata used by Spark SQL operations. This metastore must be hosted and maintained by the customer—not managed internally by Infoworks.

The platform supports two relational database backends for the Hive metastore:

  • PostgreSQL
  • Microsoft SQL Server (MSSQL)

Infoworks provides schema setup scripts compatible with both supported databases—PostgreSQL and Microsoft SQL Server (MSSQL). These scripts must be executed manually on your chosen database instance and are available at the following locations:

Procedure

To configure and connect to a Spark on Kubernetes environment, navigate to Admin > Manage Compute Environments, and then click the Add button under the Spark on Kubernetes option.

The following window appears:

There are three tabs to be configured as follows:

Data Environment

To configure the data environment details for Spark on Kubernetes, enter values in the following fields. These settings define the compute environment parameters required for Infoworks to submit and manage Spark workloads on your Kubernetes cluster.

FieldDescriptionDetails
Data Environment NameName for the Spark on Kubernetes environment.User-defined label to identify the data environment
DescriptionOptional description for the environmentBrief notes about the purpose or scope of this environment
Network Connection to KubernetesNetwork route used to connect to the Kubernetes clusterOnly Internal is available currently — Infoworks services and Spark jobs run in the same Kubernetes cluster
Metastore DBType of database used to host the Hive metastoreChoose between PostgreSQL or MS SQL
JDBC URLJDBC connection string to the Hive metastore databaseMust follow JDBC syntax for the selected metastore database (PostgreSQL or MSSQL)
UsernameUsername for authenticating with the Hive metastoreTypically a database user with read/write permissions
Authentication Type for PasswordHow the password for the metastore user is managed. Options include Infoworks-Managed or External Secret Store

Infoworks-Managed: This field indicates that Infoworks stores the token in its database.

External Secret Store: This field indicates that the password will be retrieved from External Secret Store/Keyvault as per requirement.

Authentication Password for PasswordThe password corresponding to the metastore userSecure input field; provide the password for the metastore user.
Secret for PasswordReference to a secret stored in an external secret manager like Key VaultUsed when password is not managed by Infoworks; must reference a secret from an external secret store

Compute

In the Spark on Kubernetes environment, a compute template defines the infrastructure used to execute Spark jobs directly within the Kubernetes cluster where Infoworks is installed. Spark drivers and executors run as Kubernetes pods, enabling native containerized execution.

Unlike other environments (e.g., Azure Databricks), where ephemeral clusters can be created on demand, Spark on Kubernetes supports only persistent clusters. This is because the underlying Kubernetes cluster remains continuously available and managed, providing a stable foundation for job execution.

Below are the field definitions required to configure a Spark on Kubernetes compute template:

FieldDescriptionDetails
NameName required for the compute template that you want to use for the jobs.User-defined. Provide a meaningful name for the compute template being configured.
DescriptionDescription required for the compute template.User-defined. Provide required description for the compute template being configured.
Use this as an interactive clusterOption to designate a cluster to run interactive jobs. Interactive clusters allows you to perform various tasks such as displaying sample data for sources and pipelines. You must define only one Interactive cluster to be used by all the artifacts at any given time.Select this check box to designate the cluster as an interactive cluster.
Custom TagsMetadata tags for filtering or organizationUser-defined key-value pairs for governance and automation across environments.
Runtime VersionVersion of Apache Spark used for executionSpecify the Spark version to use. Currently, only 3.5.5 is supported.
Advanced ConfigurationsFine-tune Spark and Kubernetes behaviourOptional section to configure driver/executor memory, core limits, Spark config overrides, etc., for job performance tuning.

NOTE You can configure custom Spark properties for Spark-on-Kubernetes clusters by defining the advanced configuration key: iw_environment_cluster_spark_config. This allows you to pass key-value pairs such as Spark driver/executor memory, number of cores, etc., tailored to your workloads.

Storage

To configure the storage details, enter values in the following fields. This defines the storage parameters, to allow Infoworks to be configured to the required Spark on Kubernetes instance:

To configure a new storage after the first time configuration, click Add button on the UI.

Enter the following fields under the Storage section:

FieldDescriptionDetails
NameStorage name must help the user to identify the storage credentials being configured.User-defined. Provide a meaningful name for the storage set up being configured.
DescriptionDescription for the storage set up being configured.User-defined. Provide required description for the environment being configured.
Storage Type

Type of storage system where all the artifacts will be stored. This depends on the type of cloud/platform provider you choose in the Compute tab.

  • The available options for Azure is Azure DataLake Storage(ADLS) Gen 2.
  • The available option for AWS is S3.
  • The available option for GCP is GCS.
Select the required storage type from the drop-down menu.

Azure DataLake Storage(ADLS) Gen 2

On selecting Azure DataLake Storage (ADLS) Gen 2 as the storage type, the following fields appear:

FieldDescriptionDetails
Access SchemeScheme used to access ADLS Gen 2. Available options are abfs:// and abfss://.Select the required access scheme from the drop-down menu.
Authentication MechanismMechanism using which the security information is stored. Available options are Service Principal and Access Key.Select the required authentication mechanism from the drop-down menu.
File SystemFile system where all the data of an artifact will be stored.Provide the required file system parameter.
Storage Account NameName of the Azure Storage Account.Provide the required storage account name. For example, my_data_acc.
Application IDID that uniquely identifies the user application.Provide the required application ID.
Directory IDID that uniquely identifies the Azure AD instance.Provide the required directory ID.
Service CredentialCredential that the application uses to prove its identity.Provide the credential string value.

S3

On selecting S3 as the storage type, the following fields appear:

FieldDescriptionDetails
Access SchemeScheme used to access S3. Available options are s3a://, s3n://, and s3://.Select the required access scheme from the drop-down menu.
Bucket NameAWS bucket name is part of the domain in the URL. For example: http://bucket.s3.amazonaws.com.Provide the required bucket name. This field is displayed only for S3 storage type.
Access KeyUnique 20-character, alphanumeric string which identifies your AWS account. For example, AKIAIOSFODNN7EXAMPLE.Provide the required access key.
Secret KeyUnique 40-character string which allows you to send requests using the AWS account. For example, wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY.Provide the required secret key.

GCS

On selecting GCS as the storage type, the following fields appear:

FieldDescriptionDetails
Access SchemeScheme used to access GCS. Available option is gs://Select gs:// from the drop-down menu.
Bucket NameBuckets are the basic containers that hold, organise, and control access to your data.Provide the storage bucket key. Do not use gs:// for storage bucket.
Authentication MechanismIncludes the authentication mechanism to access the GCP storage.The following authentication mechanisms are available: Use system role credentials: The default credentials of the instance to identify and authorize the application. Use environment level service account credentials: The IAM Service Account defined at the environment level to identify and authorize the application. Override environment authentication mechanism: Overrides the credentials provided for the environment.
Service CredentialsProvide the credentials used to authenticate calls to Google Cloud APIs.Upload File: Upload the file where the service credentials are stored. Enter File Location: Path of the file to be uploaded. You must enter the server file location.

After entering all the required values, click Save. Click Return to Manage Environments to view and access the list of all the environments configured. Edit, Clone, and Delete actions are available on the UI, corresponding to every configured environment.

  Last updated by Monika Momaya