Troubleshoot schema related errors for replicated tables

Problem statement:Pipeline using replicated tables as source tables may fail with below error similar to the show below.

Caused by: java.lang.ClassCastException: org.apache.orc.storage.serde2.io.DateWritable cannot be cast to org.apache.hadoop.io.Text

Root cause:The error occurs whenever there is mismatch in the schema defined in the table ddl and the schema of the underlying ORC files. This can happen for the replicated tables in the following scenarios.

Scenario 1: On-premise table is dropped and re-created with a different schema (example: data type of column changed). One can verify this by comparing the CreateTime (from below command) of this table with replicated table from dataproc.

command: describe formatted ;

Solution:

  1. Drop the replicated table from Dataproc.

drop table ;

  1. Re-run the replicated table with below config.

Key: TRUNCATE_OVERWRITE

Value: true

VariableType to search · ESC to discard
GlossaryType to search · ESC to discard
InsertType to search · ESC to discard
No matches