Infoworks 6.1.3
Prepare Data

Introduction

Data preparation is primarily the process to prepare, transform, and augment data using Infoworks.

Feature Highlights

Following are some data preparation feature highlights:

  • User-friendly GUI-driven method to transform data.
  • SQL syntax support for transforming data.
  • ETL workload import support for importing required SQL files to automatically build data pipelines.
  • Immediate feedback on syntax and semantic errors.
  • Enterprise-friendly features, including domain-based source access and audits.
  • Multiple pipelines to create workflows.
  • Interactive data preparation for large data sets.
  • Intelligent data sampling to view data changes as transformations applied.
  • Visual debugging using interactive data.
  • Advance analytics nodes.

Optimizations

Data preparation provides the following optimizations:

Design-Time Optimizations

  • Instant feedback for data, syntax and semantic errors: During transformation, data errors like data format in column, regex on columns having issues, etc can be verified in the sample data.
  • Support to visualize flow of data to design better flows.
  • Auto-materialize transformation nodes for faster responses.
  • Automatic dependency management: When a transformation node is modified, the system automatically computes the dependent nodes. The platform uses a Mark and Sweep algorithm to perform this efficiently.
  • Safe handling, refactors column include/exclude/rename even in user-defined expressions.
  • Automatic rename of duplicate column names.
  • Reuse of Hive and Impala connections to support interactive viewing of data while designing pipelines.

Execution Optimizations

  • Automatic detection of pipeline-specific intermediate storage by backend engine.
  • Automatic parallelization of population of multiple targets.
  • Expression Reuse: computes once, uses multiple times, reducing CPU/IO.
  • Optimal merge process to handle updates.
  • Ability to use pipeline-specific environment settings for MapReduce, Hive, memory, and compression for build.
  • Automatic selection of only required columns based on the pipeline and thus reduce CPU/IO usage.

Query-Time Optimizations

  • Primary partitions with tables created through pipeline targets.
  • Different Format ORC and Parquet on pipeline targets.
  • Table statistics update after every build for Cost-based Optimizer on pipeline targets.