Onboarding Data from Kafka

Kafka Ingestion

Features

  • Ingest data from multiple topics into single or multiple tables via Topic Mapping.
  • Ingest data in Batches or Real Time, using Spark Streaming.
  • View snapshot of the configured topics under Topic Preview in Raw or Structured format.
  • Visual Schema Projection to create tables from a subset of data.

Ingestion Flow

  • Infoworks creates a Kafka consumer from the topics based on the user provided configurations.
  • The consumer reads records from topic(s) based on the configuration provided under Topic Mappings.
  • A configurable number of messages are read, and the schema for those records is crawled.
  • Messages are crawled using Spark Structured Streaming, which converts messages into dataframe, and are appended continuously.
  • The Value field in each message is parsed as JSON, and the detected schema is applied.
  • Based on the configurations like storage format, and path to the output directory, a delta target table is populated.

Prerequisites and Considerations

  • Infoworks supports Kafka records in JSON format only.
  • Kafka Ingestion cannot be integrated into a workflow as it is a near real time continuously streaming job.

Creating a Kafka Source

For creating a Kafka source, see $link[page,252043,Creating Source,creating-source]. Ensure that the Source Type selected is Kafka.

Configuring a Kafka Source

Click the Data Catalog menu and click the Ingest button for the source you created.

The configuration flow is organised into four tabs as follows:

$inline[badge,NOTE,primary] Configure details in each tab. Further, click on the next tab name displayed on the top of the window, to complete the Kafka configuration.

Set Up Source

The Setup Source screens is as follows:

Source Tab Fields and Description

Target Fields and Description

After entering the required details, click the Save and Test Connection button to save the settings and ensure that Infoworks is able to connect to the source system. Clicking on Save and Test Connection button also ensures that the Topics list is populated with the suggestions while configuring Topic Mappings in the next step.

Other sections available under Source setup tab are:

  • $link[page,252059,Configuration Migration,configuration-migration]
  • $link[page,252059,Advanced Configurations,advanced-configurations]
  • $link[page,252059,Subscribers,subscribers]

Now, click the Topic Mappings tab.

Topic Mappings

This tab allows you to subscribe to required topics using list or regex patterns, and map it to single or multiple tables.

You may preview the messages in the topics subscribed, crawl them, and configure tables.

Click the Topic Mapping button to configure topics. The following window appears:

Topic Mapping Fields and Description

After adding the required topics, click Save to save the configured settings.

The following tabs are displayed in this window that appears:

  • $link[page,252059,Topic Preview Tab,topic-preview-tab]
  • $link[page,252059,Schema Tab,schema-tab]

Topic Preview tab

The Topic Preview tab allows you to quickly view the snapshot of the topics.** **

Click the + icon corresponding to every message displayed, to preview the content, and then click Crawl Schema button.

The maximum number of rows of messages for which preview can be made available can be configured in topic_preview_row_count parameter of the Advanced Configuration section. The default value is 100 rows.

In the topic preview, two views: Raw(displays the content as it is read) and Pretty(displays the content in a structured format) are available.

$inline[badge,Note,primary] If the Topic Preview is not populated, please check the broker connection details under source configuration or check the list or regex entered in the Topics Mapping page.

Schema tab

Schema tab is then displayed as follows:

Select any path to create a table with the corresponding columns in it. Click on the Required Node. Further, you can also hold Shift key and click on multiple required nodes. The children nodes of all the parent nodes selected become table columns.

For example, If you select only the address in the example above, the table created will consist of four columns: street address, city, state and postal code.

To manage the nodes in the schema, you may use the Add Node, Edit Node, and Remove Node (same as name suggests) buttons on the top-right corner of the tab. To revert the edits in schema, recrawl it from the Topic Preview tab.

Click Create Table, to create the table with the selected nodes. The following window is displayed:

Create Table Field Description

$inline[badge,Note,primary] In case of CDW onboarding (Snowflake Environment) table names are converted to upper case once they are saved. In order to create tables with case sensitive names, please enter the table names within quotes.

After configuring the required details, click Save.

The left panel displays the list of the tables created. Click on the table name, to view/edit the table schema, and to view the sample data. Click the edit icon corresponding to the table name, to edit the table configuration.

Now, navigate to the Configure Tables tab.

Configure Synchronization

For configuring Kafka tables, select the required table, and then enter required details.

Configure Table Field Description

Merge Details

Enter the following fields under Merge Details.

Edit User

You can either edit the user details for the current user or a new user.

  1. Select either Current User or Different User.
  2. Enter the E-mail.
  3. Enter the refresh Token available under My Profile -> Settings.
  4. Click Save.

Schedule Details

If you select Scheduler under Merge details, then you can set the recurrence details as follows.

Recurrence Type: Select one of the following recurrence types. The default recurrence type is Daily.

  • Only Once
  • By minutes
  • Hourly
  • Daily
  • Weekly
  • Monthly

Effective duration: Enter the effective start date of the schedule.

$inline[badge,NOTE,primary] If a scheduled job overlaps another running job, then it will be queued until the running job is completed.

For more information on table configuration, see $link[page,252078,auto$].

After configuring the required details, click Save to save the settings.

Target Configuration

Enter the following fields under Target Configuration.

Adding a column to the table

After metadata crawl is complete, you have the flexibility to add a target column to the table.

Target Column refers to adding a target column if you need any special columns in the target table apart from what is present in that source.

You can select the datatype you want to give for the specific column

You can select either of the following transformation modes: Simple and Advanced

Simple Mode

In this mode, you must add a transformation function that has to be applied for that column. Target Column with no transformation function applied will have null values in the target.

Advanced Mode

In this mode, you can provide the Spark expression in this field. For more information, refer to $link[page,252078,Adding Transform Derivation,editing-table-schema].

$inline[badge,NOTE,primary] When table is in ready state (already ingested), schema editing is disabled.

Onboard Data

Perform the following steps to onboard data from Kafka:

  1. Click Onboard Data tab.
  1. Select the required table(s), and then click Start to start streaming the data.
  2. You may also stop the data streaming by clicking the Stop button. The Truncate button allows you to delete a table.
  3. On clicking Start, the following window appears:
  4. Fill in the required details and then click Ingest. Ensure that the cluster template setup is configured for your source. For more information on field values, see section in the topic.

The following window appears:

Click the Click here to track progress link to view the ingestion status. This takes a few minutes. On clicking the link, job status and summary is displayed on the tab.

Click the Ingestion metrics tab to view the in-details summary of the job. This tab is equipped with helpful filters.

This summarises the complete Kafka ingestion process.

Configuration Migration

$inline[badge,NOTE,primary] The configuration for the tables that are in ready state, will not be migrated.

For details on configuration migration process, see $link[page,252044,Configuration Migration,configuration-migration]

Advanced Configurations

For setting up advanced configuration, see $link[page,252044,Advanced Configurations,setting-advanced-configuration].

Subscribers

For more information on subscribers, see $link[page,252044,Subscribers,setting-ingestion-notification-services].

Limitations

  • Non-struct nodes cannot be selected as the root element of the table. For example, nodes such as id, or type, cannot be selected to create tables.
  • Two different struct nodes which are not directly connected cannot be used to create table columns. For example, nodes such as item and batter, cannot be selected to create the same table.
  • Two struct nodes at the same level, cannot be selected to create the same table. For example, nodes such as batters and topping, cannot be selected to create the same table.
VariableType to search · ESC to discard
GlossaryType to search · ESC to discard
InsertType to search · ESC to discard
No matches
On This Page