Onboarding Data from Structured Files


	Infoworks allows you to ingest data in the Structured File formats (also known as Comma-Separated Values) into the data lake. You can fetch the CSV files from DBFS, SFTP and cloud storage.

Demo

Here is a demo on ingesting data in the structured file format.

Creating a Structured File Source

For creating a structured file source, see Creating a Source. Ensure that the Source Type selected is Structured Files (CSV).

Setting a Structured File Source

For setting a structured file source, see Setting Up a Source.

Structured File Configurations

Select either of the following storage systems depending on where your files are stored:

Databricks File System (DBFS)
Remote Server (using SFTP)
Cloud Storage

Databricks File System (DBFS)

Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable cloud object storage. For more details, see Databricks Documentation.

For preparing data for ingestion from Databricks File System (DBFS), enter the following:

Source Base Path: The path to the base directory in DBFS where all the CSV files to be accessed are stored. The other files are relative from this path. For example, if the CSV file is stored in iw/filestorage/ingestion/sample in DBFS, the base path refers to iw/filestorage/ingestion.

Infoworks allows you to access files stored in remote servers using SSH File Transfer Protocol (SFTP). This ensures your data is transferred on a secured channel.

Remote Server (Using SFTP)

For preparing data for ingestion Remote Server (using SFTP), enter the following:

Field	Description
Source Base Path	The path to the base directory in the file system where all the CSV files to be accessed are stored. The other files are relative from this path. For example, if the CSV file is stored in iw/filestorage/ingestion/sample, the base path refers to iw/filestorage/ingestion.
SFTP Host	The host from which data will be read.
SFTP port	The port where the SFTP service is run.
Username	The username to log in to the SFTP host.
Authentication Mechanism	Select if you want to authenticate using private key or password.
Authentication Type for Password	Select the authentication type from the dropdown. For example, Infoworks Managed or External Secret Store. If you select Infoworks Managed, then provide Authentication Password for Password. If you select External Secret Store, then select the Secret which contains the password.
Private Key	Type the private key to log into the SFTP host. It can either be a text, uploaded, or a path on the edge node. NOTE This field appears only when Using Private Key is selected as Authentication Mechanism.

NOTES

When using Private Key as authentication mechanism:

The client public key needs to be added under ~/.ssh/authorized_keys on the SFTP server. The corresponding private key on the job cluster will be used to connect.
The private key should be in RSA format. If it is available in OpenSSH format, use the command "ssh-keygen -p -f <file> -m pem" to convert it into RSA format.

To resolve File Ingestion Failure from SFTP, refer to the File ingestion Failure from SFTP Server.

Cloud Storage

Infoworks allows you to access files from cloud storage, ingest the data and perform analytics on them. Cloud storage refers to data stored on remote servers accessed from the internet, or cloud.

For preparing data for ingestion Cloud Storage, enter the following:

Cloud Type: The options include Azure Blob Storage (WASB), Amazon S3, GCS, and Azure DataLake Storage (ADLS) Gen2.

Windows Azure Storage Blob (WASB), also known as Blob Storage, is a file system for storing large amounts of unstructured object data, such as text or binary data. For more details, see Azure Blob Storage Documentation.

For Azure Blob Storage (WASB), enter the following:

Field	Description
Container name	The name of the container in the storage account, in which the files are stored. A container organizes a set of blobs, similar to a directory in a file system. For more details, see Create a container.
Storage Account name	The unique name of the storage account. For more details, see Create a storage account.
Authentication Mechanism	The mechanism using which the security information is stored. The options include Account Key and None. Account key refers to the access key to authenticate applications when making requests to this Azure storage account. You can choose None to access data from public cloud storage folders.
Enable Support for Azure Government Cloud Regions	Select this check box to enable ingestion from a source available in the Government Cloud regions. This option is not editable if the data source tables are already created.
Authentication Type for Storage Account Key	The storage account access key. This option is displayed if the Authentication Mechanism used is Account Key. For more details, see Manage storage account access keys. If you select Infoworks Managed from the Authentication Type for Storage Account Key dropdown, then provide Authentication Password for Storage Account Key. If you select External Secret Store, select the secret from Secret for Storage Account Key dropdown.
Source Base Path	The path to the base directory in the file system where all the CSV files to be accessed are stored. The other files are relative from this path. The default value is /, which indicates the base path of the following format: `wasbs://<container_name>@<account_name>.blob.core.windows.net/`
Snowflake Warehouse	Snowflake warehouse name. Warehouse is pre-filled from the selected snowflake environment and this field is editable. NOTE This field appears only when the Data Environment is snowflake.

Amazon Simple Storage Service (S3) is storage for the Internet. Amazon S3 has a simple web services interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web. For more details, see Amazon S3 Documentation.

For Amazon S3, enter the following:

Field	Description
Authentication Mechanism	The mechanism using which the security information is stored.
Access ID	The access ID uniquely identifies an AWS account. You can use the access ID to send authenticated requests to Amazon S3. The Access ID is a 20-character, alphanumeric string, for example, AKIAIOSFODNN7EXAMPLE. For the access ID details, contact your AWS Cloud admin.
Authentication Type for Secret Key	The secret key to send requests using the AWS account. The secret access key is a 40-character string, for example, wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY. For the secret key details, contact your AWS Cloud admin. For more details, see AWS Account Access Keys. __ If you select Infoworks Managed from the Authentication Type for Secret Key dropdown, then provide Authentication Password for Secret Key. __ If you select External Secret Store, select the secret from Secret for Secret Key dropdown.
Source Base Path	The bucket where all the CSV files to be accessed are stored. For more details, see Creating a bucket.
Snowflake Warehouse	Snowflake warehouse name. Warehouse is pre-filled from the selected snowflake environment and this field is editable. NOTE This field appears only when the Data Environment is snowflake.

Google Cloud Storage (GCS) is an online file storage web service for storing and accessing data on Google Cloud Platform infrastructure. In GCS, buckets are the basic containers that hold the data. For more details, see GCS documentation.

For GCS, enter the following:

Field	Description
Project ID	The ID of the project in the source. For more details, see Project ID.
Authentication Mechanism	The mechanism using which the security information is stored. The Service Account Credential File option refers to the access key to authenticate applications when making requests to this Azure storage account. For more details, see Creating and managing service account keys.
Service Account Credential File Upload	You can upload or enter the location where the service account JSON credential file is available in the edge node. This option is displayed if the Authentication Mechanism selected is Service Account Credential File.
Source Base Path	The path to the base directory in the file system where all the CSV files to be accessed are stored. The base path must begin with gs://. For more details, see Creating storage buckets.
Snowflake Warehouse	Snowflake warehouse name. Warehouse is pre-filled from the selected snowflake environment and this field is editable. NOTE This field appears only when the Data Environment is snowflake.

For Azure DataLake Storage (ADLS) Gen2, enter the following:

Field	Description
Access Scheme	Scheme used to access ADLS Gen2. The available options are abfs:// and abfss://
Authentication Mechanism	The mechanism using which the security information is stored. The options include Access Key and Service Principal. Access key refers to the key to authenticate applications when making requests to this Azure storage account. Service Principal functions as the identity of the application instance.
File System	Provide a file system where all the data of an artifact will be stored.
Storage Account Name	Provide the name of Azure storage account.
Authentication Type for Access Key	Provide the storage access key to connect to ADLS. NOTE This field appears only when the authentication mechanism is selected as Access Key. If you select Infoworks Managed from the Authentication Type for Access Key dropdown, then provide Authentication Password for Access Key. If you select External Secret Store, select the secret from Secret for Access Key dropdown.
Application ID	Provide the ID that uniquely identifies the user application. NOTE This field appears only when the authentication mechanism is selected as Service Principal.
Directory ID	Provide the ID that uniquely identifies the Azure AD instance. NOTE This field appears only when the authentication mechanism is selected as Service Principal.
Authentication Type for Service Credential	Provide the credential string value that the application uses to prove its identity. NOTE This field appears only when the authentication mechanism is selected as Service Principal. If you select Infoworks Managed from the Authentication Type for Secret Key dropdown, then provide Authentication Password for Service Credential. If you select External Secret Store, select the secret from Secret for Service Credential dropdown.

	Target Fields and Description
Target Field	Description
Data Environment	Select the data environment from the drop-down list, where the data will be onboarded.
Storage	Select from one of the storage options defined in the environment. NOTE This field becomes Temporary Storage when the Data Environment is snowflake.
Base Location	The path to the base/target directory where all the data should be stored.
Schema Name	The schema name of the target. NOTE This field becomes Snowflake Schema Name when the Data Environment is snowflake.
Snowflake Database Name	The database name of the Snowflake target. NOTE This field appears only when the Data Environment is snowflake.
Make available in infoworks domains	Select the relevant domain from the dropdown list to make the source available in the selected domain.

Mapping File

This page allows you to map individual files stored in the file system. You can add tables to represent each file, crawl them and map the file details based on the preview.

Click the File Mappings tab and click Add Table.
Enter the following file location details:

Field	Description
Table Name	The name to represent the table in the Infoworks User Interface.
Target Table Name	The name for the target table to be created in the data lake.
Source Relative Path	The path to the directory (relative to source base path) where the CSV files are stored. For example, if the CSV file is stored in iw/filestorage/ingestion/sample, the relative path refers to /sample.
Target Relative Path	The target directory path (relative to the target base path) to store the crawled output. Hive external tables will be created on this directory.
Include Files Pattern (regex)	The regex pattern to include files. Only the files matching this Java regex will be crawled.
Exclude Files Pattern (regex)	The regex pattern to exclude files. The files matching this Java regex will not be crawled.
Include Subdirectories	The option to ingest the CSV files present in the subdirectories of the source path.
File Type	The type of structured file to be ingested.

3. Click Save. The File Preview will be displayed.

Enter the following mapping details based on the file preview:

Field	Description
Number of Header Rows	The number of header rows in the file. If the files contains any header rows enter the value as 1. This field also allows you to skip the lines after the header row. If n lines are to be skipped after the header in all the files, enter the value as n+1.
Number of Footer Rows	This field also allows you to skip n number of the lines from the bottom.
Column Separator	The delimiter in the delimited files. All the files in the mentioned source path must have the same value. This value must be a character.
Column Enclosed By	The character to enclose data in the delimited columns. Generally, CSV file columns are enclosed by single quotes. All the files in the mentioned source path must have the same value. This value must be a character.
Escape Character	The character used in the delimited files to escape occurrences of Column Separator and Column Enclosed By in data. All the files in the mentioned source path must have the same value. This value must be a character. The default value is .
Character Encoding	The character encoding of the delimited files. The default value is UTF-8.
Use Column Record Separator	Select this check box to automatically detect record separator/line terminator other than from \n, \r, and \r\n values. You can provide the record separator/line terminator if you enable this option.
Record Separator	Provide the custom record separator in this field. For example, \u001E.

Click Save and Crawl Schema.

The schema will be crawled and the sample data will be displayed with the sample records of the table.

Edit Schema

You can edit the schema before ingesting the table. For details, see Editing Table Schema.

Configuring a Structured File Table

For configuring a Structured File source, see Configuring a Table.

Ingesting Structured File Data

For ingesting a structured file source, see Onboarding Data.

Known Issues

The total number of files cannot be more than 500.
By default, the Sample Data section displays the datatype as String for every column.

Adding a column to the table

After metadata crawl is complete, you have the flexibility to add a column to the table. It can either be a source column or a target column.

Source Column refers to adding a source column to the table when the metadata crawl of the table did not infer any schema (since we infer the smallest file schema).

Target Column refers to adding a target column if you need any special columns in the target table apart from what is present in that source.

In both the cases, you can select the datatype you want to give for the specific column. You can select either of the following transformation modes: Simple and Advanced.

Simple Mode

In this mode, you must add a transformation function that has to be applied for that column. Target Column with no transformation function applied will have null values in the target.

Advanced Mode

In this mode, you can provide the Spark expression in this field. For more information, refer to the Adding Transform Derivation section.

NOTE When table is in ready state (already ingested), schema editing is disabled.

Additional Information

For details on tables created during the ingestion process, see Tables Created during Ingestion.

NOTE During data crawl, the data that cannot be crawled are stored in an error table, <tablename>_error.

For details on audit columns, see Audit Columns.
There will be error records if any record is not adhering to the schema and if the number of error records crosses the threshold value, the job will fail.
Gzip support: Infoworks supports two types of compressed files: .gz (Gzip) and .bz2 (Bzip2)

Last updated on Apr 19, 2023

Was this page helpful?