Optimizing Ingestion Performance and Reducing Cluster Costs Using Split-By and Table Group Parallelism in Infoworks

Optimizing Ingestion Performance and Reducing Cluster Costs Using Split-By and Table Group Parallelism in Infoworks

Issue Summary

Users may observe slow ingestion performance and higher compute costs when ingesting large tables or many tables in a single job. This is often due to the default behavior where ingestion happens sequentially using a single connection. To overcome this, Infoworks provides support for parallel ingestion using Split-By columns and Table Groups.

Solution

1. Use a Split-By Column for Table-Level Parallelism

Split-By enables Spark to partition the table into parallel tasks, each reading a segment of the data independently. This results in faster ingestion, especially for large tables.

How this reduces cost:

  • Faster ingestion = less cluster runtime
  • Better resource utilization = avoids overprovisioned VMs sitting idle

Ideal use case: Primary key or timestamp columns that allow even data distribution.

How it works:

  • The Split-By column should have calculable min and max values.
  • Ideal columns: primary key, numeric, date, or timestamp.
  • Infoworks will partition the column range into slices based on the number of connections allowed.

Example: A table with a user_id column ranging from 1 to 1000, split into 10 ranges, will result in 10 parallel ingestion tasks.

Optional: If the Split-By column is a date or timestamp, enable Derived Split Column and use a function like YEAR(date_col) or TO_CHAR(date_col, 'YYYY-MM') to control how data is split.

2. Use Table Groups for Multi-Table Management

A Table Group allows multiple tables to be ingested as part of a single job. Two key configurations control how these tables are handled in parallel:

  • Max Parallel Entities: Max number of tables that can be ingested at once.
  • Max Connections: Total DB connections Infoworks can use across all tables.

Example: If your group has 12 tables, but Max Parallel Entities = 3, only 3 tables run at any given time. The rest wait in the queue. To ingest all 12 at once, set Max Parallel Entities = 12.

3. Sample Configuration Strategy

You have:

  • 10 tables: 5 small, 5 large
  • Goal: Ingest small tables with 1 connection and large tables with 10 parallel threads

Suggested Configuration:

  • Max Parallel Entities = 10
  • Max Connections = 55
  • (5 small × 1 connection + 5 large × 10 connections)*

⚠️ Important: Always check with your source database administrator to understand the upper limit on concurrent connections. Setting this too high may result in connection failures.

⚠️ Common Pitfall

Enabling Split-By at the table level alone does not launch parallel ingestion. You must also increase Max Connections in the associated Table Group to activate parallel crawling.

Best Practices

ItemRecommendation
Split-By ColumnUse primary key or numeric/date columns
Avoid Data SkewEnsure evenly distributed values
Derived Split ColumnUse for timestamp/date fields
Table Group ConfigurationSet proper Max Parallel Entities and Max Connections
DB LimitsConsult DB admin for connection limits
VariableType to search · ESC to discard
GlossaryType to search · ESC to discard
InsertType to search · ESC to discard
No matches