Infoworks Release Notes
Release Notes

v5.4.1.13

Date of Release: February 2024

Enhancement

JIRA IDIssue
IPD-25542*Support Enhanced Flexibility Mode for Dataproc Clusters.
IPD-21466

Support for read isolation on sqlserver.

NOTE Set key query_hint value READUNCOMMITTED at source/table Advanced Configuration for read isolation on sqlserver source.

IPD-24768Disable / Hide the graph on the "Usage and License Details" page.
IPD-24010“Custom Tags” field mandatory in the UI and Rest API.
IPD-27319To capture additional logs.

Resolved Issues

NOTE The "*" symbol next to ID refers to the issues that have been resolved in the current release.

JIRA IDIssue
IPD-25580*

Jobs failing randomly during cluster creation with File not found error after changing to Dataproc 2.0.

NOTE The advanced configuration dt_classpath_include_unique_jars=false should be removed once upgraded to 5.4.1.13.

IPD-22878Recrawl metadata throws "Provided Table Ids are not present in the Source"
IPD-24824Enum Constant error Microsoft Access Source.
IPD-24751Import workflow configuration API is not configuring the sync to target node properly.
IPD-24697Workflow got completed but status is not getting updated.
IPD-24671Delay in viewing pipeline list in Workflow editor pipeline task.
IPD-24636Custom tag field is not present in Onboard Data page.
IPD-24511dt_bigquery_session_project_id error on Dataproc.
IPD-24574Pipeline Jobs list not loading and reporting error.
IPD-24556The ingestion job completed its execution but the workflow task was marked as failed.
IPD-24503Job execution time vs Cluster creation time.
IPD-24493Workflow polling failed with "502 Gateway Time-out" Error.
IPD-24472Pipeline job was completed but Workflow status wasn't updated.
IPD-24431Segmentation ingestion job job_object.json file doesn't capture table details.
IPD-24430Unable to delete or inactive the advanced key, if the key has trailing space.
IPD-24407Export to BQ completed successfully, but pipeline job marked as failed.
IPD-24308Jobs not getting picked up by Hangman.
IPD-24258API Calls failed with "504 Gateway Time-out" Error.
IPD-23719Pipeline versions are getting emptied on production.
IPD-24007View run is not loading workflow run page beyond recent 20 workflow runs limit.
IPD-23613Last Modified Date in workflow is shown incorrectly.
IPD-23636Pipeline Jobs list not visible.
IPD-23663File Archival process is not archiving files with only header records post upgrade to 5.4.
IPD-23647The staging table created by Infoworks during pipeline build will be persisted in the BigQuery if the load job fails.
IPD-23754Records were missing in the BigQuery target table even though the pipeline was completed successfully.
IPD-23822Schema name is null/empty for incremental pipelines in BigQuery environment on using use_ingestion_time_as_watermark key
IPD-23831Around 50% of the pipeline documents have pipeline id and active version id same in production.
IPD-23899Refresh token not generated for newly created users in 5.4.1.6.
IPD-23515Ingestion job on BQ environment is running queries on GCP project present in service account JSON instead of parent_project
IPD-23526Issues while using partitioned externally created BQ table in our pipelines.
IPD-23452Rest API - Bug in 5.4.1 config migration where few keys are in camelCase (should be in snake_case).
IPD-23418BigQuery pipelines will not be failing with java.lang.IllegalArgumentException: Provided query is null or empty error after the upgrade.
IPD-23327Fixed the Issue where Inactive spark advanced configs are taking effect in 5.4.1.4.
IPD-23263Infoworks Scheduler not submitting jobs on high load time duration.
IPD-23498Project ID field issue on the teradata source configured on the BigQuery data environment.
IPD-23216Unable to unlock locked entities.
IPD-22964The connection.schema_registry_authentication_details.password field is now part of the iw_migration script.
IPD-22983For a source table fetched from a Confluent source with incremental mode as Append, the pipeline source query does not bring the entire dataset every time anymore irrespective of the provided query.
IPD-22963For the 5.3.x and 5.4.x versions, when the Incremental Load is enabled and the Sync Type is set to append, the second build of the pipeline does not copy the duplicate records anymore.
IPD-22038Fixed the unlock functionality for the Admin users.
IPD-22113You can now create pipelines via the API by using Environment Name, Environment Storage Name, and Environment Compute Template Name.
IPD-22817Thebatch_engine key is now validating the user input during the pipeline creation via API.
IPD-22721For Confluent Kafka source, the streaming_group_id_prefix configuration is now working as expected.
IPD-22615The Partition and the Clustering details are now appearing in the BigQuery table created via Infoworks pipeline.
IPD-22036Pipeline build now succeeds even when the target table for the BigQuery external target already exists and is clustered.
IPD-22090

For the Delimited File target, the timestamp format can be configured as per the user requirements. This will be applicable for all the timestamp columns for that table.

timestampFormat=yyyy-MM-dd HH:mm:ss.SSS

timestampFormat=yyyy-MM-dd HH:mm:ss

IPD-22351Infoworks has added advanced configuration for setting BigQuery Session Project IDs for Data Transformation and Ingestion jobs (dt_bigquery_session_project_id/ingestion_bigquery_session_project_id)
IPD-21449The Import SQL API is now picking the correct table (even if a table with the same schema/table name is present in multiple data environments).
IPD-21534The Initialize & Ingest and Truncate jobs can now reset the value of the last_merged_watermark key.
IPD-21584The Import SQL command is now able to fetch queries that contain backtick (`).
IPD-21700Fixed the pipeline deletion issue.
IPD-21792The duplicate tables are not allowed to be onboarded anymore on HIVE Metadata sync source.

Upgrade

Assuming IW_HOME variable is set to /opt/infoworks

Prerequisite

To support rollback after metadata migration, you need to take backup of metadata. Following are the steps:

Step 1: Install/Download MongoDB tool: mongodump. (if needed).

Step 2: Create a directory to store the database backup dump using the below command.

Command
Copy

Step 3: Use the below command to take a dump (backup) of the databases from the mongodb server.

If MongoDB is hosted on Atlas

Command
Copy

If MongoDB is installed with Infoworks on the same VM

Command
Copy

Procedure

For upgrading from 5.4.1/5.4.1.x to 5.4.1.13, execute the following commands:

Step 1: Use the deployer to upgrade from 5.4.1 to 5.4.1.13.

Step 2: Go to $IW_HOME/scripts folder of the machine.

Step 3: To ensure that there is no pre-existing update script, execute the following command:

[[ -f update_5.4.1.13.sh ]] && rm update_5.4.1.13.sh

Step 4: Download the update_5.4.1.13.sh

wget https://iw-saas-setup.s3.us-west-2.amazonaws.com/5.4/update_5.4.1.13.sh

Step 5: Give update.sh executable permission

chmod +x update_5.4.1.13.sh

Step 6 (Optional): If the patch requires Mongo Metadata to be migrated, run export METADB_MIGRATION=Y. This ensures that the metadata will be migrated, else run export METADB_MIGRATION=N.

Alternatively, you can enter it in the prompt while running the script.

Step 7: Update the package to the hotfix

source $IW_HOME/bin/env.sh

./update_5.4.1.13.sh -v 5.4.1.13-ubuntu2004

You will receive a "Please select whether metadb migration needs to be done([Y]/N)" message. If you need to perform metadb migration, enter Y, else, enter N.

Post Upgrade Steps

Steps to follow after upgrading Infoworks to 5.4.1.13:

NOTE Please make sure to take a backup of dataproc_defaults.json file before performing any change.

The dataproc_defaults.json file needs to be updated post upgrade to 5.4.1.13. This file is present in /opt/infoworks/conf directory. Here /opt/infoworks is the IW_HOME.

Following changes are required to be added in the dataproc_defaults.json file. To update the above file, change directory using cd /opt/infoworks/conf and run vi dataproc_defaults.json

Step 1: Add the property config.masterConfig.diskConfig.numLocalSsds : 0

Step 2: Add the property config.workerConfig.diskConfig.numLocalSsds : 0

Step 3: Add the object config.secondaryWorkerConfig :

Command
Copy

Screenshot before update - No key with name secondaryWorkerConfig present inside config property.

Screenshot after update - Added the secondaryWorkerConfig object inside config property

Step 4: Add the array num_local_ssds : [0,1,2,3,4,5,6,7,8,16,24]

Screenshot before update - No field for num_local_ssds present

Screenshot after update - Added new Json array for num_local_ssds

UI and Platform services will need to be restarted after applying this configuration change.

Steps to Enable EFM on Dataproc

To add Secondary workers to a Dataproc cluster, select Enable Autoscale checkbox and then select the Enable Secondary Worker checkbox.

The Secondary worker type can be one of - spot VMs, standard preemptible VMs, or non-preemptible VMs.

As per the Dataproc documentation, following properties need to be added to enable EFM:

--properties=dataproc:efm.spark.shuffle=primary-worker \ --properties=dataproc:efm.mapreduce.shuffle=hcfs \

To add these properties, head over to the Advanced Configurations in the Compute section and and the following key:

Key: iw_environment_cluster_dataproc_config

Value: efm.spark.shuffle=primary-worker;efm.mapreduce.shuffle=hcfs

Additionally, the YARN graceful decommission time must be set to zero when EFM is enabled. To set that add the following advanced configuration:

Key: gracefulDecommissionTimeout

Value: 0 (zero)

Additional Notes

  • The number of allowed local SSDs might differ based on the selected machine type. Please refer to - https://cloud.google.com/compute/docs/disks/local-ssd to check the allowed values.
  • Clusters having local SSDs cannot be stopped.
  • Cluster having secondary workers cannot be stopped. To stop, the secondary workers need to be scaled down to zero.
  • Existing clusters cannot be updated from single-node to multi-node or vice-versa.
  • When autoscale is enabled and advanced configurations for EFM are set, the Secondary workers must be enabled, else cluster creation will fail. This is because Primary workers cannot be autoscaled when Spark primary worker shuffle is enabled.

Rollback

Prerequisite

To rollback the migrated metadata:

Step 1: Install/Download MongoDB tool: mongorestore. (if needed)

Step 2: Switch to the directory where the backup is saved on the local system.

Command
Copy

Step 3: Use the below command to restore the dump (backup) of the databases to the Mongodb Server.

If MongoDB is hosted on Atlas

Command
Copy

If MongoDB is installed with Infoworks on the same VM

Command
Copy

Procedure

To go back to previous checkpoint version:

Step 1: In a web browser, go to your Infoworks system, scroll-down to the bottom, and click the Infoworks icon.

Step 2: The Infoworks Manifest Information page opens in a new tab. Scroll down and check the Last Checkpoint Version.

Step 3: ssh to Infoworks VM and switch to {{IW_USER}}.

Step 4: Initialize the variables in the bash shell.

full_version=5.4.1.13

major_version=$(echo $full_version | cut -d "." -f 1-2)

previous_version=<Previous Version> # Last Checkpoint Version from step 1

os_suffix=<OS Suffix> # One of [ ubuntu2004 amazonlinux2 rhel8 ]

Step 5: Download the required deployer for the current applied patch.

https://iw-saas-setup.s3-us-west-2.amazonaws.com/${major_version}/deploy_${full_version}.tar.gz

Step 6: Execute the SCP command for the above mentioned files to the following path.

NOTE Remove the previously downloaded copy of deploy_${full_version}.tar.gz file in ${IW_HOME}/scripts/ directory.

${IW_HOME}/scripts/.

Step 7: Extract the deployed tar file in case it does not exist.

cd ${IW_HOME}/scripts

[[ -d iw-installer ]] && rm -rf iw-installer

tar xzf deploy_${full_version}.tar.gz

cd iw-installer

Step 8: Initialize the environment variables.

source ${IW_HOME}/bin/env.sh

export IW_PLATFORM=saas

Step 9: Run the Rollback command.

./rollback.sh -v ${previous_version}-${os_suffix}