Title
Message
Create new category
What is the title of your new category?
Edit page index title
What is the title of the page index?
Edit category
What is the new title of your category?
Edit link
What is the new title and URL of your link?
Setting Spark Configurations
Copy Markdown
Open in ChatGPT
Open in Claude
- dt_spark_configfile for interactive mode must point to the full path of the spark-defaults.conf file in the edge node for interactive pipelines. The default value is: /etc/spark2/conf/spark-defaults.conf.
- spark-defaults.conf must include all relevant properties to connect to Yarn.
- dt_spark_configfile_batch for batch mode must point to the full path of the spark-defaults.conf file in the edge node for batch pipelines, defaults to /etc/spark2/conf/spark-defaults.conf.
- A hive-site.xml with data similar to below must be present in the spark conf directory:
xxxxxxxxxx<configuration><property><name>hive.metastore.uris</name><value>thrift://ip-172-30-1-66.ec2.internal:9083</value></property></configuration>- export SPARK_DIST_CLASSPATH=/usr/hdp/2.5.5.0-157/spark2/jars/ must be set in
env.shof Infoworks (the location is the location of the spark jars). - Recommended settings in spark-defaults.conf might have different settings for interactive and batch mode.
- spark.sql.warehouse.dir must be set to /apps/hive/warehouse (or any other location equivalent to the hive warehouse location).
- spark.sql.hive.convertMetastoreParquet false // for parquet ingested tables to be read
- spark.mapreduce.input.fileinputformat.input.dir.recursive true // for parquet ingested tables to be read
- spark.hive.mapred.supports.subdirectories true // for parquet ingested tables to be read
- spark.mapred.input.dir.recursive true // for parquet ingested tables to be read
- spark.sql.shuffle.partitions // to control number of tasks for reduce phase
- spark.dynamicAllocation.enabled true // if dynamic allocation is needed
- spark.shuffle.service.enabled true // if dynamic allocation is needed
- spark.executor.memory // according to workload
- spark.executor.cores // according to workload
For encoded partitions, the spark.sql.hive.convertMetastoreParquet value must be set to true.
Some pipelines might need different configurations apart from the above configurations like,
- dt_batch_sparkapp_settings: Any settings that need to be changed at the application master level (such as spark.dynamicAllocation.enabled, spark.executor.memory etc).
- dt_batch_spark_settings: Any settings that need to be changed at sparksession level (such as spark.sql.crossJoin.enabled etc).
Supporting Non-Infoworks Hive Tables
To use non-Infoworks Hive tables in pipelines, ensure that Hive ingestion is performed on the tables.
To successfully design a pipeline with non-Infoworks Hive tables, which incrementally updates an existing Data Transformation target, you can use the following control variables in a filter:
- $.highWatermark which will be set to the time the last incremental load finished.
- $.lowWatermark which is initially set to the epoch start (Jan 1, 1970) for the first build. Later builds will have same value as the highWatermark of the previous build. For example, to load the ORDERS incrementally, you must add a filter as follows: The watermark variables operate similar to the Load source data incrementally option for the Infoworks managed tables.
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
Last updated on
Was this page helpful?
Next to read:
Source Table Watermark ConfigurationFor more details, refer to our Knowledge Base and Best Practices!
For help, contact our support team!
© UNIPHORE TECHNOLOGIES 2025 | Confidential
Discard Changes
Do you want to discard your current changes and overwrite with the template?
Archive Synced Block
Message
Create new Template
What is this template's title?
Delete Template
Message