Infoworks 6.1.3
Prepare Data

Submitting Spark Pipelines

Spark pipelines can be configured to run in client mode on edge node or can be submitted via Apache Livy. By default, if no configuration is specified, Spark pipelines run in the client mode on the edge node.

Configuring Spark to Run in Client Mode on Edge Node

Perform the following to configure Spark pipelines to run in client mode on edge node:

  • Add the following configuration in the pipeline Advanced Configuration option: job.dispatcher.type=native

Configuring Spark Pipeline to Run in Cluster Mode

Following are the steps to configure Spark pipelines to run in cluster mode:

  • Add the following configuration in the pipeline Advanced Configuration option: job.dispatcher_type=spark
  • Add the following configurations in the dt_spark_defaults.conf file on the edge node:
Copy

NOTES

The ${IW_HOME} path on HDFS can be different from the ${IW_HOME} path on the edge node local file system. IW HOME on HDFS in the above configuration will be used as ${IW_HOME} for pipeline jobs running in cluster mode. So, ensure to copy the ${IW_HOME}/conf folder from the edge node local file system to ${IW_HOME} on HDFS.

  • Copy the ${IW_HOME}/conf folder from local to Hadoop ${IW_HOME} (IW HOME on HDFS).
  • In cluster mode, the pipeline job runs in Yarn cluster and reads all configuration files from HDFS. While specifying configuration files in the ${IW_HOME}/conf/conf.properties file in Hadoop, ensure that the configuration file paths are prefixed with hdfs:
  • The configuration, dt_spark_configfile_batch, in the Hadoop ${IW_HOME}/conf/conf.properties file must point to the HDFS path of the dt_spark_defaults.conf file (dt_spark_configfile_batch=hdfs:/<df_spark_default_conf>).
  • When running in cluster mode, the pipeline job uploads lib jars on HDFS. By default, the same HDFS path and local path is used while uploading jars from local. For example, if jar path on local is file:/opt/info/lib/df/* then, the path, hdfs:/opt/info/lib/df/*, will be created on HDFS, and jars from file:/opt/info/lib/df/* will be uploaded to hdfs:/opt/info/lib/df/*. To change the base HDFS lib path, add the following configuration in the ${IW_HOME}/conf/conf.properties file in the edge node: dt_hdfs_lib_base_path=<HDFS lib base path>

Spark 2.1 does not allow having same jar name multiple times, even in different paths. If an error occurs, add the following configuration in the ${IW_HOME}/conf/conf.properties file in the edge node: dt_classpath_include_unique_jars=true