Double-clicking each node displays the following list of settings.
The input tab includes the following:
The transformation tab includes the following options:
The derivation can be Output or Temporary. Output does not reflect the transformations in the output columns, while Temporary reflects the transformations in the output columns. If an expression generates multiple columns (for example, posexplode or UDF), use the Add Button to add column names to the expression.
For example, the expression, posexplode(array(5,6,7)) AS ('arr_pos','arr_value')
, generates multiple columns ('arr_pos'
and 'arr_value'
) and rows in the output.
The configurations to perform transformation.
This tab includes sample data for the selected transformation/node. You can click the
User can choose the dt_interactive_service_cluster
setting controls where interactive queries are executed for Databricks Compute:
local (Default): Queries are run using Databricks Connect, utilizing local resources for execution.
databricks: When set to a Databricks cluster, interactive queries are submitted as jobs to the specified cluster, leveraging its compute resources.
This allows users to choose between running queries locally or on a Databricks-managed cluster based on their needs.
Interactive requests may time out if the interactive cluster is not created or any environment issue occurs. Interactive service logs are available in the ${IW_HOME}/logs/dt/ folder. In case of environment issues, fix the issues and then restart interactive service.
For Databricks runtime version 14.3, local mode is not supported.
dt_interactive_service_cluster=databricks
.This tab includes the list of output columns of the transformation performed on the inputs.
You can edit the configurations for each node.
Click the Edit Configurations button and check the Enable Repartitioning option to configure the properties for repartitioning columns, setting the number of partitions, and sorting columns.
The post processing configuration allows you to dynamically partition the output of a node for subsequent node processing. It includes the Repartitioning and Sorting options. |
You can select a list of columns. The columns with unique values will be partitioned into a single set. For this subsequent downstream node, processing will be faster based on the partitioned data. You can select any column which is a part of the Outputs.
Once data is partitioned, you can sort the data in the columns within the partition. This enables faster data access and data processing in downstream nodes.
This tab displays the representational SQL queries for the transformations performed. The actual queries are optimized for the execution engine.
This tab displays the set of SQLs to be executed in order to create and insert data into the target. The Preview SQL feature is available only in CDW environments for target nodes in 5.2 Pilot.
This tab displays the list of all actions performed on the specific node of the pipeline.