Infoworks Replicator 4.0
Getting Started

Additional Configurations

Configurations can be added in the mr.conf file for batch and incremental replication. The mr.conf follows the the key=value format and the file_tansfer.xml follows the Hadoop configuration files.

Following are the configurations:

  • use.temp.path: This configuration indicates the temporary path specified while creating the destination cluster entity in Infoworks ADE. The data is returned to the temp directory and then the file is renamed to the actual path. This is not applicable for encryption zones and hence must be set to false. The default value is true.
  • zookeeper.connection.string: If this property is set, dynamic throttling obtains the latest configuration properties from zookeeper servers specified in this connection string. This value is not set by default. For more details, see the Throttling section.
  • infoworks.replication.encryption.zones: Set this to a JSON array of encryption zones. If this value is not set and checksum checking is ON, transfer to encryption zone fails. Sample value: ["/user/hive/warehouse/tpcds_bin_partitioned_parquet_3.db","/user/ec2-user/encr"]
  • infoworks.replication.encryption.zones.rb.cksum: This configuration is only applicable to encryption zones.

When this configuration is OFF (by default), the checksum validation is performed as follows:

  • Before file transfer the in-memory checksum of the file is calculated.
  • The transfer is started and while transferring the in-memory checksum is calculated again.
  • After transfer, the two checksums are compared. If the values do not match, the transfer is marked as failure and the transferred file is deleted.

When this configuration is turned ON, the checksum validation is performed as follows:

  • Before file transfer the in-memory checksum of the file is calculated.
  • The file is read back to calculate the second checksum.
  • After transfer, the two checksums are compared. If the values do not match, the transfer is marked as failure and the transferred file is deleted.

TRUNCATE_OVERWRITE: The value for this key is set to TRUE by default. This key performs the following functions:

  • Compute Diff – The job computes and identifies the differences between the source and the destination for the job. The data for the tables in the underlying job will be deleted in the destination.
  • Copy Data – The job copies/replicates the entire data.
  • Update Metadata – The relevant Hive metadata is updated for the selected tables in the job.

Limitation

Infoworks Replicator does not support replication of Hive Managed Tables from ADLS (Azure Data Lake Storage) as source to ADLS as destination. This is because, by default the location of the Managed Table on the source ADLS is adl://home/ and the destination misinterprets the home directory of the source with the home directory of the destination. This issue does not occur with the External Tables created with a fully qualified path.

Cluster Configurations

KeyValue DescriptionDefault
SECURE_CLUSTER_PROPERTIESKerberos Configurations*
DATABASE_FILTERRegex for schema filter.*
MAX_BANDWIDTH_MBInteger2,147,483,647 (INT_MAX)

Workflow Configurations

KeyValue DescriptionDefault
MR_CONFIG_FILEPath to mr.conf file. Example: $IW_HOME/conf/replicator/mr.conf
PROXY_BYPASS_PROPERTIES$IW_HOME/conf/replicator/bypass.conf
BATCH_JOB_OUTPUT_DIRsource_root_hdfs/stageoutput
VELOCITY_LOG_LOCATION$IW_HOME/logs/replicator/velocity.log
JOB_BANDWIDTH_MBInteger
BANDWIDTHInteger100 (in MB’s)