- dt_spark_configfile for interactive mode must point to the full path of the spark-defaults.conf file in the edge node for interactive pipelines. The default value is: /etc/spark2/conf/spark-defaults.conf.
- spark-defaults.conf must include all relevant properties to connect to Yarn.
- dt_spark_configfile_batch for batch mode must point to the full path of the spark-defaults.conf file in the edge node for batch pipelines, defaults to /etc/spark2/conf/spark-defaults.conf.
- A hive-site.xml with data similar to below must be present in the spark conf directory:
- export SPARK_DIST_CLASSPATH=/usr/hdp/2.5.5.0-157/spark2/jars/ must be set in
env.sh
of Infoworks (the location is the location of the spark jars).
- Recommended settings in spark-defaults.conf might have different settings for interactive and batch mode.
- spark.sql.warehouse.dir must be set to /apps/hive/warehouse (or any other location equivalent to the hive warehouse location).
- spark.sql.hive.convertMetastoreParquet false // for parquet ingested tables to be read
- spark.mapreduce.input.fileinputformat.input.dir.recursive true // for parquet ingested tables to be read
- spark.hive.mapred.supports.subdirectories true // for parquet ingested tables to be read
- spark.mapred.input.dir.recursive true // for parquet ingested tables to be read
- spark.sql.shuffle.partitions // to control number of tasks for reduce phase
- spark.dynamicAllocation.enabled true // if dynamic allocation is needed
- spark.shuffle.service.enabled true // if dynamic allocation is needed
- spark.executor.memory // according to workload
- spark.executor.cores // according to workload
| |
---|
For encoded partitions, the spark.sql.hive.convertMetastoreParquet value must be set to true.
|
Some pipelines might need different configurations apart from the above configurations like,
- dt_batch_sparkapp_settings: Any settings that need to be changed at the application master level (such as spark.dynamicAllocation.enabled, spark.executor.memory etc).
- dt_batch_spark_settings: Any settings that need to be changed at sparksession level (such as spark.sql.crossJoin.enabled etc).
To use non-Infoworks Hive tables in pipelines, ensure that Hive ingestion is performed on the tables.
To successfully design a pipeline with non-Infoworks Hive tables, which incrementally updates an existing Data Transformation target, you can use the following control variables in a filter:
- $.highWatermark which will be set to the time the last incremental load finished.
- $.lowWatermark which is initially set to the epoch start (Jan 1, 1970) for the first build. Later builds will have same value as the highWatermark of the previous build. For example, to load the ORDERS incrementally, you must add a filter as follows: The watermark variables operate similar to the Load source data incrementally option for the Infoworks managed tables.
NOTE
Global init script must not set any Spark configurations.