Data Ingestion and Synchronization

This chapter describes the ingestion functionalities supported by Infoworks.

Ingestion is the first step to perform data analytics on Hadoop. Ingestion brings data from various sources such as RDBMS, delimited files, unstructured files, etc onto Hadoop.

Following are the types of ingestion supported based on the databases:

Following are the two main prerequisites for ingestion:

Creating Source

NOTE: Only an admin can create a source.

  • Login to Infoworks DF.
  • Navigate to Admin > Sources > New Source.

In the Create New Source page, enter the following details:

  • Source Name: Name of source on the Infoworks platform.
  • Source Type: The Files source type includes Structured Files (CSV, TSV), JSON Files, XML Files and Unstructured Files. The RDBMS source type includes Teradata, MySQL, MariaDB, Oracle, SQL Server, DB2, Netezza, SAP Hana, Hive, SybaseIq, Apache Ignite,Redshift and Vertica. The No SQL source type includes MapR DB. The CRM source type includes salesforce.com. For list of data types supported by these source types, see the Data Types section.
  • Driver Name: Once you select the Source Type, the JDBC driver name for the database will be displayed. You can also edit the driver name.
  • Target Hive Schema: Hive schema name created by Hadoop admin.
  • Target HDFS Location: HDFS path on Hadoop cluster, created by Hadoop admin.
  • Click Save Settings.

Installing External Client Drivers

To install external client drivers like Netezza, SAP HANA, and Teradata, see External Client Drivers.

Ingestion Jobs

Following are the types of ingestion jobs you can perform to achieve specific types of ingestions:

Job TypeDescription
source_test_connection
Test connection job for all RDBMS.
source_fetch_metadata
Metadata crawl for RDBMS.
source_structured_fetch_metadata
Metadata crawl for file based ingestion.
source_crawlInitialize and ingest for RDBMS over JDBC.
source_crawl_tpt
Initialize and ingest for teradata source while using TPT.
source_crawl_stage_tpt
Stage data job for teradata while using TPT when crawling for the first time.
source_crawl_process_tpt
Processing job for teradata while using TPT when crawling for the first time.
source_crawl_chunk_load
Segmented load for RDBMS.
source_stage_tpt
Stage data task for incremental load.
source_cdcCDC job for RDBMS.
source_cdc_tpt
CDC job for teradata while using TPT.
source_mergeMerge job for RDBMS.
source_merge_tpt
Merge job for teradata while using TPT.
source_switchCommit to the data after merge.
source_cdc_merge
CDC and merge job ingest now for RDBMS.
source_stage_cdc_merge_tpt
Ingest now for teradata sources while using TPT.
source_unstructured_crawl
Initialize and ingest now for unstructured files.
source_unstructured_cdc
Ingest now for unstructured files.
source_structured_crawl
Initialize and ingest now for CSV and fixed-width ingestion.
source_structured_cdc_merge
Ingest now for CSV and fixed-width ingestion.
source_structured_cdc
CDC job for CSV and fixed-width ingestion.
source_structured_merge
Merge job for CSV and fixed-width ingestion.
source_semistructured_crawl
Initialize and ingest now job for JSON and XML ingestion.
source_semistructured_cdc_merge
Ingest now job for JSON and XML.
source_semistructured_cdc
CDC job for JSON and XML.
source_semistructured_merge
Merge job for JSON and XML.
source_reorganizeData reorganize job for all sources.
source_reconcileData reconciliation job for RDBMS.

Ingestion Notification Services

Source Level Notifications

  • In the Source Settings page, click Add New Subscriber button.
  • Enter the email ID of the subscriber.
  • Select the Notify Via options which include email and slack.
  • Select the jobs for which the subscriber must be notified. The jobs include fetch metadata and deletion of source.
  • Click Save. The subscriber will be notified for the selected jobs.

Table Level Notifications

  • In the Table Configuration page, click Add New Subscriber button.
  • Enter the email ID of the subscriber.
  • Select the Notify Via options which include email and slack.
  • Select the jobs for which the subscriber must be notified. The jobs include crawl, reconciliation, reorganization and export of table.
  • Click Save. The subscriber will be notified for the selected jobs.
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard