Optimizing Ingestion Performance and Reducing Cluster Costs Using Split-By and Table Group Parallelism in Infoworks

Issue Summary

Users may observe slow ingestion performance and higher compute costs when ingesting large tables or many tables in a single job. This is often due to the default behavior where ingestion happens sequentially using a single connection. To overcome this, Infoworks provides support for parallel ingestion using Split-By columns and Table Groups.

Solution

1. Use a Split-By Column for Table-Level Parallelism

Split-By enables Spark to partition the table into parallel tasks, each reading a segment of the data independently. This results in faster ingestion, especially for large tables.

How this reduces cost:

Faster ingestion = less cluster runtime
Better resource utilization = avoids overprovisioned VMs sitting idle

Ideal use case: Primary key or timestamp columns that allow even data distribution.

How it works:

The Split-By column should have calculable min and max values.
Ideal columns: primary key, numeric, date, or timestamp.
Infoworks will partition the column range into slices based on the number of connections allowed.

Example: A table with a user_id column ranging from 1 to 1000, split into 10 ranges, will result in 10 parallel ingestion tasks.

Optional: If the Split-By column is a date or timestamp, enable Derived Split Column and use a function like YEAR(date_col) or TO_CHAR(date_col, 'YYYY-MM') to control how data is split.

2. Use Table Groups for Multi-Table Management

A Table Group allows multiple tables to be ingested as part of a single job. Two key configurations control how these tables are handled in parallel:

Max Parallel Entities: Max number of tables that can be ingested at once.
Max Connections: Total DB connections Infoworks can use across all tables.

Example: If your group has 12 tables, but Max Parallel Entities = 3, only 3 tables run at any given time. The rest wait in the queue. To ingest all 12 at once, set Max Parallel Entities = 12.

3. Sample Configuration Strategy

You have:

10 tables: 5 small, 5 large
Goal: Ingest small tables with 1 connection and large tables with 10 parallel threads

Suggested Configuration:

Max Parallel Entities = 10
Max Connections = 55
(5 small × 1 connection + 5 large × 10 connections)*

⚠️ Important: Always check with your source database administrator to understand the upper limit on concurrent connections. Setting this too high may result in connection failures.

⚠️ Common Pitfall

Enabling Split-By at the table level alone does not launch parallel ingestion. You must also increase Max Connections in the associated Table Group to activate parallel crawling.

Best Practices

Item	Recommendation
Split-By Column	Use primary key or numeric/date columns
Avoid Data Skew	Ensure evenly distributed values
Derived Split Column	Use for timestamp/date fields
Table Group Configuration	Set proper Max Parallel Entities and Max Connections
DB Limits	Consult DB admin for connection limits

Last updated on

Was this page helpful?