Infoworks Path: The path where the Infoworks must be installed, for example, /opt/infoworks.
Ports: Infoworks requires port 80 open for interacting with the Hadoop cluster outside the Virtual Private Cloud (VPC) network. Infoworks services use the proxy ports 2999-3009, 5672, 7070, 7071, 7080, 7001, 7005, 7006, 8005, 27017, 3030, 3011, 3012, 23011, 23012 (if platform HA is enabled) on the Infoworks server. These ports can communicate within the VPC and does not require to be open outside the internal network.
Access to Database: To ingest data from RDBMS, all the nodes in the cluster must have access to the database.
Softwares: The Infoworks requires the software, services, and libraries listed below:
External Libraries: The following libraries must be downloaded and installed by the customers:
Ensure that Infoworks Hive users have privilege to perform the following:
HBase Privileges
HDFS Privileges
Hive Privileges
The Infoworks must be configured before installation to access all services that are expected to be used. In turn, each service that is to be consumed by the Infoworks must listen on an interface accessible on all nodes of the Hadoop cluster. MongoDB will be configured by the Infoworks installer to perform this. For the following services, the Infoworks requires the host addresses to be configured before installation and the clients for the same must be installed on the node where the Infoworks will run:
Infoworks product is installed on the local filesystem in a pre-defined directory (typically /opt/infoworks). On the Infoworks Server Node, the following structure is created.
Main folder: /opt/infoworks
| Subfolder | Content |
|---|---|
| apricot-meteor/ | #Infoworks UI, Job Executors and State Machine |
| bin/ | #Infoworks binaries and shell scripts |
| conf/ | #Configuration folder |
| dt/ | #DataTransformation |
| logs/ | #Logs for Infoworks Services |
| lib/ | #Dependencies |
| orchestrator-engine | #Orchestrator |
| resources/ | #3rd party tools used by infoworks python, ant, nodejs |
| RestAPI/ | #Ad-Hoc Query and Scheduler servers |
| temp/ | #Temporary generated files |
The Infoworks runs under a separate user. The Infoworks user must meet one of the following conditions:
Target folder permissions
Ownership of the top-level directories
The top-level directories must be created by an authorized user and the ownership must be assigned to the Infoworks user, or Infoworks group, or the user performing the job, if impersonation is enabled.
Hadoop, HBase, and by extension, the Infoworks requires all nodes in the cluster to have the Network Time Protocol (NTP) installed and synchronized.
To install and perform one-time synchronization of NTP on RHEL (Please refer to the NTPD man-pages for more details), use:
xxxxxxxxxx# yum install ntpd# ntpd –qg# chkconfig ntpd on# service ntpd start #(or restart if already installed)To perform synchronization at any time, use:
xxxxxxxxxx# service ntpd stop# sudo ntpdate -s time.nist.gov# service ntpd startHardware requirements are estimated using the expected size of the data and the load on the cluster. In general, the Master (Servers - Primary and Secondary NameNode, Hive Server2, HBase Master, and Spark Master), and Slave (Datanode, HBase Region Servers, and Spark Workers) nodes have different configuration requirements. Contact the Infoworks team to come up with the precise needs.
The recommended configurations are as follows:
| Server | Configuration |
|---|---|
| For Masters | 16 vCPU, 64 GB RAM, 1 TB Storage Disk |
| For Slaves (Datanodes) | 32 vCPU, 128 GB RAM, 4-8 TB Storage Disk |
| For Infoworks Servers | 32 vCPU, 256 GB RAM, 1 TB Storage Disk |