How to update the default autoscaling policy for Dataproc Ephemeral clusters

How to update the default autoscaling policy for Dataproc Ephemeral clusters ?

Problem:

I would like to use a custom autoscaling policy for my Dataproc Cluster for Ephemeral jobs or I would like to use secondary worker nodes for the Dataproc Cluster

Solution:

Infoworks provides a pre ingestion job hook that can be used to run a bash script before beginning the ingestion job.

In the below steps, we would leverage the pre ingestion job hook to replace the default autoscaling policy with a user-defined custom autoscaling policy.

Steps:

  1. Create a custom autoscaling policy on the GCP console and take a note of the autoscaling policy ID

<https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling>

  1. Create a bash script like below,
Copy

--autoscaling-policy=autoscale-015a243a33b20d5eba4e5e98

Replace with your actual autoscaling policy ID from step 1

--region=us-central1

Replace with your actual region for the Dataproc Cluster

  1. Create a pre ingestion job hook and upload the bash script.

<https://docs.infoworks.io/infoworks-5.1.2/admin-and-operations/extensions#managing-job-hooks>

  1. Add the ingestion hook to the Infoworks source where you would like to use the custom autoscaling policy

    <https://docs.infoworks.io/infoworks-5.1.2/admin-and-operations/extensions#using-a-job-hook>

Note:

  1. The above script updates the autoscaling policy only for ephemeral clusters

  2. A pre ingestion job hook is applied for all tables in the source and cannot be applied individually for table

Affects Version:

Infoworks 5.0, 5.1.X

VariableType to search · ESC to discard
GlossaryType to search · ESC to discard
InsertType to search · ESC to discard
No matches