Introduction

Data preparation is primarily the process to prepare, transform, and augment data using Infoworks.

Feature Highlights

Following are some data preparation feature highlights:

User-friendly GUI-driven method to transform data.
SQL syntax support for transforming data.
ETL workload import support for importing required SQL files to automatically build data pipelines.
Immediate feedback on syntax and semantic errors.
Enterprise-friendly features, including domain-based source access and audits.
Multiple pipelines to create workflows.
Interactive data preparation for large data sets.
Intelligent data sampling to view data changes as transformations applied.
Visual debugging using interactive data.
Advance analytics nodes.

Data preparation provides the following optimizations:

Instant feedback for data, syntax and semantic errors: During transformation, data errors like data format in column, regex on columns having issues, etc can be verified in the sample data.
Support to visualize flow of data to design better flows.
Auto-materialize transformation nodes for faster responses.
Automatic dependency management: When a transformation node is modified, the system automatically computes the dependent nodes. The platform uses a Mark and Sweep algorithm to perform this efficiently.
Safe handling, refactors column include/exclude/rename even in user-defined expressions.
Automatic rename of duplicate column names.
Reuse of Hive and Impala connections to support interactive viewing of data while designing pipelines.

Automatic detection of pipeline-specific intermediate storage by backend engine.
Automatic parallelization of population of multiple targets.
Expression Reuse: computes once, uses multiple times, reducing CPU/IO.
Optimal merge process to handle updates.
Ability to use pipeline-specific environment settings for MapReduce, Hive, memory, and compression for build.
Automatic selection of only required columns based on the pipeline and thus reduce CPU/IO usage.

Primary partitions with tables created through pipeline targets.
Different Format ORC and Parquet on pipeline targets.
Table statistics update after every build for Cost-based Optimizer on pipeline targets.

Last updated on

Was this page helpful?