Title
Create new category
Edit page index title
Edit category
Edit link
Onboarding Data from Fixed-width Structured Files
Creating a Fixed-width Structured File Source
For creating a fixed-width structured file source, see $link[page,312614,auto$]. Ensure that the Source Type selected is Structured Files (Fixed-width).
Setting a Fixed-width Structured File Source
For setting a fixed-width structured file source, see $link[page,312616,auto$].
Fixed-width Structured File Configurations
Select either of the following storage systems depending on where your files are stored:
- Databricks File System (DBFS)
- Remote Server (using SFTP)
- Cloud Storage
Databricks File System (DBFS)
Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable cloud object storage. For more details, see Databricks Documentation.
For preparing data for ingestion from Databricks File System (DBFS), enter the following:
Source Base Path: The path to the base directory in DBFS where all the fixed-width structured files to be accessed are stored. The other files are relative from this path. For example, if the file is stored in iw/filestorage/ingestion/sample in DBFS, the base path refers to iw/filestorage/ingestion.
Infoworks allows you to access files stored in remote servers using SSH File Transfer Protocol (SFTP). This ensures your data is transferred on a secured channel.
Remote Server (Using SFTP)
For preparing data for ingestion Remote Server (using SFTP), enter the following:
$inline[badge,NOTES,primary]
When using Private Key as authentication mechanism:
- The client public key needs to be added under
~/.ssh/authorized_keyson the SFTP server. The corresponding private key on the job cluster will be used to connect. - The private key should be in RSA format. If it is available in OpenSSH format, use the command "
ssh-keygen -p -f <file> -m pem" to convert it into RSA format.
To resolve File Ingestion Failure from SFTP, refer to the $link[page,312921,auto$].
Cloud Storage
Infoworks allows you to access files from cloud storage, ingest the data and perform analytics on them. Cloud storage refers to data stored on remote servers accessed from the internet, or cloud.
For preparing data for ingestion Cloud Storage, enter the following:
Cloud Type: The options include Azure Blob Storage (WASB), Amazon S3, GCS, and Azure DataLake Storage (ADLS) Gen2.
Windows Azure Storage Blob (WASB), also known as Blob Storage, is a file system for storing large amounts of unstructured object data, such as text or binary data. For more details, see Azure Blob Storage Documentation.
For Azure Blob Storage (WASB), enter the following:
Amazon Simple Storage Service (S3) is storage for the Internet. Amazon S3 has a simple web services interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web. For more details, see Amazon S3 Documentation.
For Amazon S3, enter the following:
Google Cloud Storage (GCS) is an online file storage web service for storing and accessing data on Google Cloud Platform infrastructure. In GCS, buckets are the basic containers that hold the data. For more details, see GCS documentation.
For GCS, enter the following:
For Azure DataLake Storage (ADLS) Gen2, enter the following:
Mapping File
This page allows you to map individual files stored in the file system. You can add tables to represent each file, crawl them and map the file details based on the preview.
- Click the File Mappings tab and click Add Table.
- Enter the following file location details:
- Click Save. The File Preview will be displayed.
- Enter the following mapping details based on the file preview:
- Click Save and Crawl Schema.
The schema will be crawled and the sample data will be displayed with the sample records of the table.
Edit Schema
You can edit the schema before ingesting the table. For details, see $link[page,312629,auto$].
Advanced Configurations
max_chars_per_column
- Default: 512
- Description: Max chars that each column can have in a fixed-width file. Used as the width of the last column when variable-width is chosen.
fixed_width_padding
- Default: “ ” (white space)
- Description: If the data within a column does not completely use all the width assigned to it, then it is padding with the padding character
fixed_width_max_columns
- Default: 1024
- Description: max number of columns that a fixed-width file can have
multiline_fixed_width
- Default: false
- Description: If the records in the fixed-width file are multiline, then this configuration should be set to true
fixed_width_keep_padding
- Default: false
- Description: Set this value to true if the ingested data should have the padding characters
fixed_width_trim_values
- Default: true
- Description: Set the value to false if the ingested values should not be trimmed (remove leading and trailing white spaces)
fixed_width_skip_trailing_chars_until_new_line
- Default: false
- Description: If multiline_fixed_width is set to true and the record width exceeds the given sum of column widths, then the remaining characters will be considered as part of the next record
- $inline[badge,NOTE,primary] This parameter should be set at admin level.
Configuring a fixed-width Structured File Table
For configuring a fixed-width Structured File source, see $link[page,312617,auto$].
Ingesting Fixed-width Structured File Data
For ingesting a fixed-width structured file source, see $link[page,312618,auto$].
$inline[badge,Known Issues,primary]
- The total number of files cannot be more than 500.
- By default, the Sample Data section displays the datatype as String for every column.
Adding a column to the table
After metadata crawl is complete, you have the flexibility to add a column to the table. It can either be a source column or a target column.
Source Column refers to adding a source column to the table when the metadata crawl of the table did not infer any schema (since we infer the smallest file schema).
Target Column refers to adding a target column if you need any special columns in the target table apart from what is present in that source.
In both the cases, you can select the datatype you want to give for the specific column. You can select either of the following transformation modes: Simple and Advanced.
Simple Mode
In this mode, you must add a transformation function that has to be applied for that column. Target Column with no transformation function applied will have null values in the target.
Advanced Mode
In this mode, you can provide the Spark expression in this field. For more information, refer to the Adding Transform Derivation section.
$inline[badge,NOTE,primary] When table is in ready state (already ingested), schema editing is disabled.
Additional Information
- For details on tables created during the ingestion process, see $link[page,312620,auto$].
$inline[badge,NOTE,primary] During data crawl, the data that cannot be crawled are stored in an error table, <tablename>_error.
- For details on audit columns, see $link[page,312620,Audit Columns,audit-columns].
- There will be error records if any record is not adhering to the schema and if the number of error records crosses the threshold value, the job will fail.
- Gzip support: Infoworks supports two types of compressed files: .gz (Gzip) and .bz2 (Bzip2)
For more details, refer to our Knowledge Base and Best Practices!
For help, contact our support team!
© UNIPHORE TECHNOLOGIES 2025 | Confidential