Infoworks allows you to ingest data in the Structured File formats (also known as Comma-Separated Values) into the data lake. You can fetch the CSV files from DBFS, SFTP and cloud storage. |
Here is a demo on ingesting data in the structured file format.
For creating a structured file source, see Creating a Source. Ensure that the Source Type selected is Structured Files (CSV).
For setting a structured file source, see Setting Up a Source.
Select either of the following storage systems depending on where your files are stored:
Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable cloud object storage. For more details, see Databricks Documentation.
For preparing data for ingestion from Databricks File System (DBFS), enter the following:
Source Base Path: The path to the base directory in DBFS where all the CSV files to be accessed are stored. The other files are relative from this path. For example, if the CSV file is stored in iw/filestorage/ingestion/sample in DBFS, the base path refers to iw/filestorage/ingestion.
Infoworks allows you to access files stored in remote servers using SSH File Transfer Protocol (SFTP). This ensures your data is transferred on a secured channel.
For preparing data for ingestion Remote Server (using SFTP), enter the following:
Field | Description |
---|---|
Source Base Path | The path to the base directory in the file system where all the CSV files to be accessed are stored. The other files are relative from this path. For example, if the CSV file is stored in iw/filestorage/ingestion/sample, the base path refers to iw/filestorage/ingestion. |
SFTP Host | The host from which data will be read. |
SFTP port | The port where the SFTP service is run. |
Username | The username to log in to the SFTP host. |
Authentication Mechanism | Select if you want to authenticate using private key or password. |
Authentication Type for Password | Select the authentication type from the dropdown. For example, Infoworks Managed or External Secret Store. If you select Infoworks Managed, then provide Authentication Password for Password. If you select External Secret Store, then select the Secret which contains the password. |
Private Key | Type the private key to log into the SFTP host. It can either be a text, uploaded, or a path on the edge node. |
When using Private Key as authentication mechanism:
~/.ssh/authorized_keys
on the SFTP server. The corresponding private key on the job cluster will be used to connect.ssh-keygen -p -f <file> -m pem
" to convert it into RSA format.To resolve File Ingestion Failure from SFTP, refer to the File ingestion Failure from SFTP Server.
Infoworks allows you to access files from cloud storage, ingest the data and perform analytics on them. Cloud storage refers to data stored on remote servers accessed from the internet, or cloud.
For preparing data for ingestion Cloud Storage, enter the following:
Cloud Type: The options include Azure Blob Storage (WASB), Amazon S3, GCS, and Azure DataLake Storage (ADLS) Gen2.
Windows Azure Storage Blob (WASB), also known as Blob Storage, is a file system for storing large amounts of unstructured object data, such as text or binary data. For more details, see Azure Blob Storage Documentation.
For Azure Blob Storage (WASB), enter the following:
Field | Description |
---|---|
Container name | The name of the container in the storage account, in which the files are stored. A container organizes a set of blobs, similar to a directory in a file system. For more details, see Create a container. |
Storage Account name | The unique name of the storage account. For more details, see Create a storage account. |
Authentication Mechanism | The mechanism using which the security information is stored. The options include Account Key and None. Account key refers to the access key to authenticate applications when making requests to this Azure storage account. You can choose None to access data from public cloud storage folders. |
Enable Support for Azure Government Cloud Regions | Select this check box to enable ingestion from a source available in the Government Cloud regions. This option is not editable if the data source tables are already created. |
Authentication Type for Storage Account Key | The storage account access key. This option is displayed if the Authentication Mechanism used is Account Key. For more details, see Manage storage account access keys. If you select Infoworks Managed from the Authentication Type for Storage Account Key dropdown, then provide Authentication Password for Storage Account Key. If you select External Secret Store, select the secret from Secret for Storage Account Key dropdown. |
Source Base Path | The path to the base directory in the file system where all the CSV files to be accessed are stored. The other files are relative from this path. The default value is /, which indicates the base path of the following format: wasbs://<container_name>@<account_name>.blob.core.windows.net/ |
Snowflake Warehouse | Snowflake warehouse name. Warehouse is pre-filled from the selected snowflake environment and this field is editable. |
Amazon Simple Storage Service (S3) is storage for the Internet. Amazon S3 has a simple web services interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web. For more details, see Amazon S3 Documentation.
For Amazon S3, enter the following:
Field | Description |
---|---|
Authentication Mechanism | The mechanism using which the security information is stored. |
Access ID | The access ID uniquely identifies an AWS account. You can use the access ID to send authenticated requests to Amazon S3. The Access ID is a 20-character, alphanumeric string, for example, AKIAIOSFODNN7EXAMPLE. For the access ID details, contact your AWS Cloud admin. |
Authentication Type for Secret Key | The secret key to send requests using the AWS account. The secret access key is a 40-character string, for example, wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY. For the secret key details, contact your AWS Cloud admin. For more details, see AWS Account Access Keys. __ If you select Infoworks Managed from the Authentication Type for Secret Key dropdown, then provide Authentication Password for Secret Key. __ If you select External Secret Store, select the secret from Secret for Secret Key dropdown. |
Source Base Path | The bucket where all the CSV files to be accessed are stored. For more details, see Creating a bucket. |
Snowflake Warehouse | Snowflake warehouse name. Warehouse is pre-filled from the selected snowflake environment and this field is editable. |
Google Cloud Storage (GCS) is an online file storage web service for storing and accessing data on Google Cloud Platform infrastructure. In GCS, buckets are the basic containers that hold the data. For more details, see GCS documentation.
For GCS, enter the following:
Field | Description |
---|---|
Project ID | The ID of the project in the source. For more details, see Project ID. |
Authentication Mechanism | The mechanism using which the security information is stored. The Service Account Credential File option refers to the access key to authenticate applications when making requests to this Azure storage account. For more details, see Creating and managing service account keys. |
Service Account Credential File Upload | You can upload or enter the location where the service account JSON credential file is available in the edge node. This option is displayed if the Authentication Mechanism selected is Service Account Credential File. |
Source Base Path | The path to the base directory in the file system where all the CSV files to be accessed are stored. The base path must begin with gs://. For more details, see Creating storage buckets. |
Snowflake Warehouse | Snowflake warehouse name. Warehouse is pre-filled from the selected snowflake environment and this field is editable. |
For Azure DataLake Storage (ADLS) Gen2, enter the following:
Field | Description |
---|---|
Access Scheme | Scheme used to access ADLS Gen2. The available options are abfs:// and abfss:// |
Authentication Mechanism | The mechanism using which the security information is stored. The options include Access Key and Service Principal. Access key refers to the key to authenticate applications when making requests to this Azure storage account. Service Principal functions as the identity of the application instance. |
File System | Provide a file system where all the data of an artifact will be stored. |
Storage Account Name | Provide the name of Azure storage account. |
Authentication Type for Access Key | Provide the storage access key to connect to ADLS. If you select Infoworks Managed from the Authentication Type for Access Key dropdown, then provide Authentication Password for Access Key. If you select External Secret Store, select the secret from Secret for Access Key dropdown. |
Application ID | Provide the ID that uniquely identifies the user application. |
Directory ID | Provide the ID that uniquely identifies the Azure AD instance. |
Authentication Type for Service Credential | Provide the credential string value that the application uses to prove its identity. If you select Infoworks Managed from the Authentication Type for Secret Key dropdown, then provide Authentication Password for Service Credential. If you select External Secret Store, select the secret from Secret for Service Credential dropdown. |
Target Fields and Description | |
---|---|
Target Field | Description |
Data Environment | Select the data environment from the drop-down list, where the data will be onboarded. |
Storage | Select from one of the storage options defined in the environment. |
Base Location | The path to the base/target directory where all the data should be stored. |
Schema Name | The schema name of the target. |
Snowflake Database Name | The database name of the Snowflake target. |
Make available in infoworks domains | Select the relevant domain from the dropdown list to make the source available in the selected domain. |
This page allows you to map individual files stored in the file system. You can add tables to represent each file, crawl them and map the file details based on the preview.
Field | Description |
---|---|
Table Name | The name to represent the table in the Infoworks User Interface. |
Target Table Name | The name for the target table to be created in the data lake. |
Source Relative Path | The path to the directory (relative to source base path) where the CSV files are stored. For example, if the CSV file is stored in iw/filestorage/ingestion/sample, the relative path refers to /sample. |
Target Relative Path | The target directory path (relative to the target base path) to store the crawled output. Hive external tables will be created on this directory. |
Include Files Pattern (regex) | The regex pattern to include files. Only the files matching this Java regex will be crawled. |
Exclude Files Pattern (regex) | The regex pattern to exclude files. The files matching this Java regex will not be crawled. |
Include Subdirectories | The option to ingest the CSV files present in the subdirectories of the source path. |
File Type | The type of structured file to be ingested. |
3. Click Save. The File Preview will be displayed.
Field | Description |
---|---|
Number of Header Rows | The number of header rows in the file. If the files contains any header rows enter the value as 1. This field also allows you to skip the lines after the header row. If n lines are to be skipped after the header in all the files, enter the value as n+1. |
Number of Footer Rows | This field also allows you to skip n number of the lines from the bottom. |
Column Separator | The delimiter in the delimited files. All the files in the mentioned source path must have the same value. This value must be a character. |
Column Enclosed By | The character to enclose data in the delimited columns. Generally, CSV file columns are enclosed by single quotes. All the files in the mentioned source path must have the same value. This value must be a character. |
Escape Character | The character used in the delimited files to escape occurrences of Column Separator and Column Enclosed By in data. All the files in the mentioned source path must have the same value. This value must be a character. The default value is . |
Character Encoding | The character encoding of the delimited files. The default value is UTF-8. |
Use Column Record Separator | Select this check box to automatically detect record separator/line terminator other than from \n, \r, and \r\n values. You can provide the record separator/line terminator if you enable this option. |
Record Separator | Provide the custom record separator in this field. For example, \u001E. |
The schema will be crawled and the sample data will be displayed with the sample records of the table.
Edit Schema
You can edit the schema before ingesting the table. For details, see Editing Table Schema.
For configuring a Structured File source, see Configuring a Table.
Ingesting Structured File Data
For ingesting a structured file source, see Onboarding Data.
After metadata crawl is complete, you have the flexibility to add a column to the table. It can either be a source column or a target column.
Source Column refers to adding a source column to the table when the metadata crawl of the table did not infer any schema (since we infer the smallest file schema).
Target Column refers to adding a target column if you need any special columns in the target table apart from what is present in that source.
In both the cases, you can select the datatype you want to give for the specific column. You can select either of the following transformation modes: Simple and Advanced.
Simple Mode
In this mode, you must add a transformation function that has to be applied for that column. Target Column with no transformation function applied will have null values in the target.
Advanced Mode
In this mode, you can provide the Spark expression in this field. For more information, refer to the Adding Transform Derivation section.
<tablename>_error
.