Title
Message
Create new category
What is the title of your new category?
Edit page index title
What is the title of the page index?
Edit category
What is the new title of your category?
Edit link
What is the new title and URL of your link?
Introduction
Copy Markdown
Open in ChatGPT
Open in Claude
Data preparation is primarily the process to prepare, transform, and augment data using Infoworks.
Feature Highlights
Following are some data preparation feature highlights:
- User-friendly GUI-driven method to transform data.
- SQL syntax support for transforming data.
- ETL workload import support for importing required SQL files to automatically build data pipelines.
- Immediate feedback on syntax and semantic errors.
- Enterprise-friendly features, including domain-based source access and audits.
- Multiple pipelines to create workflows.
- Interactive data preparation for large data sets.
- Intelligent data sampling to view data changes as transformations applied.
- Visual debugging using interactive data.
- Advance analytics nodes.
Optimizations
Data preparation provides the following optimizations:
Design-Time Optimizations
- Instant feedback for data, syntax and semantic errors: During transformation, data errors like data format in column, regex on columns having issues, etc can be verified in the sample data.
- Support to visualize flow of data to design better flows.
- Auto-materialize transformation nodes for faster responses.
- Automatic dependency management: When a transformation node is modified, the system automatically computes the dependent nodes. The platform uses a Mark and Sweep algorithm to perform this efficiently.
- Safe handling, refactors column include/exclude/rename even in user-defined expressions.
- Automatic rename of duplicate column names.
- Reuse of Hive and Impala connections to support interactive viewing of data while designing pipelines.
Execution Optimizations
- Automatic detection of pipeline-specific intermediate storage by backend engine.
- Automatic parallelization of population of multiple targets.
- Expression Reuse: computes once, uses multiple times, reducing CPU/IO.
- Optimal merge process to handle updates.
- Ability to use pipeline-specific environment settings for MapReduce, Hive, memory, and compression for build.
- Automatic selection of only required columns based on the pipeline and thus reduce CPU/IO usage.
Query-Time Optimizations
- Primary partitions with tables created through pipeline targets.
- Different Format ORC and Parquet on pipeline targets.
- Table statistics update after every build for Cost-based Optimizer on pipeline targets.
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
Last updated on
Was this page helpful?
Next to read:
Creating a PipelineFor more details, refer to our Knowledge Base and Best Practices!
For help, contact our support team!
© UNIPHORE TECHNOLOGIES 2025 | Confidential
Discard Changes
Do you want to discard your current changes and overwrite with the template?
Archive Synced Block
Message
Create new Template
What is this template's title?
Delete Template
Message