Title
Create new category
Edit page index title
Edit category
Edit link
Onboarding Data from Parquet
The Infoworks Parquet Ingestion feature allows users to migrate data from Hadoop/Hive to Databricks.
Infoworks supports the following types of Parquet files:
- Parquet files that have been ingested by Infoworks or created by Infoworks transformations in the source data lake.
- Parquet files for which Spark can obtain partitioned metadata (with folder structure format,
$table_path/<partition_column>=<partition_value>/).
Demo
Here is a demo on ingesting data in the Parquet file format.
Setting a Parquet Source
For setting a Parquet source, see $link[page,301740,auto$]. Ensure that the Source Type selected is Parquet.
Configuring a Parquet Source
For configuring a Parquet source, see $link[page,301742,auto$].
Parquet Configurations
- Select either of the following storage locations depending on where the files are stored: Databricks File System (DBFS), Cloud Storage.
Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable cloud object storage. For more details, see Databricks Documentation.
Infoworks allows you to access files from cloud storage, ingest the data, and perform analytics on them. Cloud storage refers to data stored on remote servers accessed from the internet, or cloud.
For preparing data for ingestion from Databricks File System (DBFS), enter the following:
- Source Base Path: The path to the base directory in DBFS where all the Parquet files to be accessed are stored. The other files are relative from this path. For example, if the file is stored in iw/filestorage/ingestion/sample in DBFS, the base path refers to iw/filestorage/ingestion.
Infoworks allows you to access files from cloud storage, ingest the data, and perform analytics on them. Cloud storage refers to data stored on remote servers accessed from the internet, or cloud.
For preparing data for ingestion Cloud Storage, enter the following:
- Cloud Type: The options include Azure Blob Storage (WASB) or Amazon S3.
Windows Azure Storage Blob (WASB), also known as Blob Storage, is a file system for storing large amounts of unstructured object data, such as text or binary data. For more details, see Azure Blob Storage Documentation.
For Azure Blob Storage (WASB), enter the following:
Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services that provides object storage through a web service interface. For more details, see Amazon S3 Documentation.
For Amazon S3, enter the following:
For Azure DataLake Storage (ADLS) Gen2, enter the following:
File Mapping
This page allows you to map individual files stored in the file system. You can add tables to represent each file, crawl them, and map the file details based on the preview.
- Click the File Mappings tab and click Add Table.
- Enter the following file location details:
3. Click Save.
- Select the table and click Crawl Schema.
The metadata will be crawled and the job status will be displayed.
Ingesting Parquet Data
For ingesting a Parquet source, see $link[page,301744,auto$].
$inline[badge,Limitations,warning]
- Infoworks does not support parquet file ingestion when column name has space characters.
- Infoworks does not support handling of non-standard Spark partition data, unless the Parquet files are created by Infoworks.
- Infoworks does not support reading data, for which the folder name in the partitioned table data is not of the
part_col=valformat.
Editing Schema
To edit the schema:
Step 1: Click the Details button for the required table.
Step 2: In the Edit Schema section, click the required column and perform the required changes. You can also add columns, upload schema, and perform bulk edit.
Additional Information
- If records are not adhering to schema and there is a datatype mismatch, the job will fail.
For more details, refer to our Knowledge Base and Best Practices!
For help, contact our support team!
© UNIPHORE TECHNOLOGIES 2025 | Confidential