Metadata Crawl from BigQuery

Overview

This functionality allows you to get the metadata of already existing BigQuery tables, so that they can be used in pipelines downstream and can be used in conjunction with tables ingested from other sources.

Creating a BigQuery Source

The following are the steps to create a BigQuery source:

Step 1: In the left navigation pane of Infoworks UI page, click the Data Sources icon.

Step 2: Click Onboard New Data. The Source Connectors page appears with the list of all available connectors.

Step 3:In the Search... bar, type “BigQuery Metadata Sync”.

Step 4: Click the BigQuery Metadata Sync connector. The configuration page of the connector appears.

NOTE BigQuery metadata sync source can only be created on a BigQuery environment.

Configuring a BigQuery Source

The following are the steps to configure a BigQuery source:

Configure Source & Target

Step 1: In the Configure Source & Target page, enter the following configuration details.

Field	Description
Source Name	Provide a source name for the target table.
Project ID	Provide the respective Project ID. This ID is present in the Google BigQuery Console.
Data Environment	Select the environment where the tables are registered. Infoworks will spawn a spark session in the persistent cluster running in the environment and fetch all the tables registered. NOTE The dropdown list shows only the available BigQuery environments.
Temporary Storage	Select from one of the storage options defined in the BigQuery environment.
Base Location	The path to the base/target directory where all the data should be stored.
Make available in infoworks domains	Select the relevant domain from the dropdown list to make the source available in the selected domain.

Step 2: Click the Save button. Click Next.

Select Tables

You can select the tables for which the metadata crawl is required. You can add more tables later.

Step 1: In the Select Tables step, you can choose to Browse entire source or Filter tables to browse.

Step 2: Filter the tables by Schema Name, Table Name, by entering multiple names separated by comma or by using a "%" as a wildcard.

Step 3: Click Browse Source. The Browse source area appears.

NOTE The Browse Source page takes longer to appear as the value of bulk_payload_record_size is set to 6500, by default.

For the tables to appear quickly, scroll down to the Advanced Configurations section, and set the value of bulk_payload_record_size to 100. The value can be changed at admin and source levels.

Step 4: Select the check boxes against the relevant table(s), and click Add Selected Tables.

Step 5: Click Crawl Metadata to proceed. A success message appears.

Metadata crawl has been triggered. To view the job status, click View Job Status.

Last updated on Jan 9, 2023

Was this page helpful?