Spark pipelines can be configured to run in client mode on edge node or can be submitted via Apache Livy. By default, if no configuration is specified, Spark pipelines run in the client mode on the edge node.
Perform the following to configure Spark pipelines to run in client mode on edge node:
job.dispatcher.type=native
Following are the steps to configure Spark pipelines to run in cluster mode:
job.dispatcher_type=spark
The ${IW_HOME}
path on HDFS can be different from the ${IW_HOME}
path on the edge node local file system. IW HOME
on HDFS in the above configuration will be used as ${IW_HOME}
for pipeline jobs running in cluster mode. So, ensure to copy the ${IW_HOME}/conf
folder from the edge node local file system to ${IW_HOME}
on HDFS.
${IW_HOME}/conf
folder from local to Hadoop ${IW_HOME}
(IW HOME
on HDFS).${IW_HOME}/conf/conf.properties
file in Hadoop, ensure that the configuration file paths are prefixed with hdfs:dt_spark_configfile_batch
, in the Hadoop ${IW_HOME}/conf/conf.properties
file must point to the HDFS path of the dt_spark_defaults.conf
file (dt_spark_configfile_batch=hdfs:/<df_spark_default_conf>
).${IW_HOME}/conf/conf.properties
file in the edge node: dt_hdfs_lib_base_path=<HDFS lib base path>
Spark 2.1 does not allow having same jar name multiple times, even in different paths. If an error occurs, add the following configuration in the ${IW_HOME}/conf/conf.properties
file in the edge node: dt_classpath_include_unique_jars=true