Data Ingestion and Synchronization

This chapter describes the ingestion functionalities supported by Infoworks.

Ingestion is the first step to perform data analytics on Hadoop. Ingestion brings data from various sources such as RDBMS, delimited files, unstructured files, etc onto Hadoop.

Following are the types of ingestion supported based on the databases:

RDBMS Ingestion supports the following database types: Teradata, Oracle, MySQL, Maria DB, SQL Server, DB2, Netezza, SAP Hana, Hive, SybaseIq, Apache Ignite, Vertica.
No SQL supports MapR-DB Ingestion.
CRM supports SalesForce Ingestion.
File Ingestion includes the following types:
Structured File Ingestion - Delimited File Ingestion, Fixed Width Ingestion, Mainframe Data File Ingestion
JSON Ingestion
XML Ingestion
Unstructured File Ingestion

Following are the two main prerequisites for ingestion:

Creating Source

NOTE: Only an admin can create a source.

Login to Infoworks DF.
Navigate to Admin > Sources > New Source.

In the Create New Source page, enter the following details:

Source Name: Name of source on the Infoworks platform.
Source Type: The Files source type includes Structured Files (CSV, TSV), JSON Files, XML Files and Unstructured Files. The RDBMS source type includes Teradata, MySQL, MariaDB, Oracle, SQL Server, DB2, Netezza, SAP Hana, Hive, SybaseIq, Apache Ignite,Redshift and Vertica. The No SQL source type includes MapR DB. The CRM source type includes salesforce.com. For list of data types supported by these source types, see the Data Types section.
Driver Name: Once you select the Source Type, the JDBC driver name for the database will be displayed. You can also edit the driver name.
Target Hive Schema: Hive schema name created by Hadoop admin.
Target HDFS Location: HDFS path on Hadoop cluster, created by Hadoop admin.
Click Save Settings.

Installing External Client Drivers

To install external client drivers like Netezza, SAP HANA, and Teradata, see External Client Drivers.

Ingestion Jobs

Following are the types of ingestion jobs you can perform to achieve specific types of ingestions:

Job Type	Description
source_test_connection	Test connection job for all RDBMS.
source_fetch_metadata	Metadata crawl for RDBMS.
source_structured_fetch_metadata	Metadata crawl for file based ingestion.
source_crawl	Initialize and ingest for RDBMS over JDBC.
source_crawl_tpt	Initialize and ingest for teradata source while using TPT.
source_crawl_stage_tpt	Stage data job for teradata while using TPT when crawling for the first time.
source_crawl_process_tpt	Processing job for teradata while using TPT when crawling for the first time.
source_crawl_chunk_load	Segmented load for RDBMS.
source_stage_tpt	Stage data task for incremental load.
source_cdc	CDC job for RDBMS.
source_cdc_tpt	CDC job for teradata while using TPT.
source_merge	Merge job for RDBMS.
source_merge_tpt	Merge job for teradata while using TPT.
source_switch	Commit to the data after merge.
source_cdc_merge	CDC and merge job ingest now for RDBMS.
source_stage_cdc_merge_tpt	Ingest now for teradata sources while using TPT.
source_unstructured_crawl	Initialize and ingest now for unstructured files.
source_unstructured_cdc	Ingest now for unstructured files.
source_structured_crawl	Initialize and ingest now for CSV and fixed-width ingestion.
source_structured_cdc_merge	Ingest now for CSV and fixed-width ingestion.
source_structured_cdc	CDC job for CSV and fixed-width ingestion.
source_structured_merge	Merge job for CSV and fixed-width ingestion.
source_semistructured_crawl	Initialize and ingest now job for JSON and XML ingestion.
source_semistructured_cdc_merge	Ingest now job for JSON and XML.
source_semistructured_cdc	CDC job for JSON and XML.
source_semistructured_merge	Merge job for JSON and XML.
source_reorganize	Data reorganize job for all sources.
source_reconcile	Data reconciliation job for RDBMS.

Ingestion Notification Services

Source Level Notifications

In the Source Settings page, click Add New Subscriber button.
Enter the email ID of the subscriber.
Select the Notify Via options which include email and slack.
Select the jobs for which the subscriber must be notified. The jobs include fetch metadata and deletion of source.
Click Save. The subscriber will be notified for the selected jobs.

Table Level Notifications

In the Table Configuration page, click Add New Subscriber button.
Enter the email ID of the subscriber.
Select the Notify Via options which include email and slack.
Select the jobs for which the subscriber must be notified. The jobs include crawl, reconciliation, reorganization and export of table.
Click Save. The subscriber will be notified for the selected jobs.

Last updated on