在参考相关博客,完成datax hdfswriter支持decimal类型需求后,在数据使用时发现了新的问题:
写出的orc文件中,decimal类型字段序列化时使用的是默认精度:decimal(38,18)。在某些情况下,会触发精度转换异常的报错,导致计算任务失败。
java
public class HiveDecimal implements Comparable<HiveDecimal> {
public static final int SYSTEM_DEFAULT_PRECISION = 38;
public static final int SYSTEM_DEFAULT_SCALE = 18;
}
markdown
df.printSchema()
root
|-- xxxx_rate: decimal(38,18) (nullable = true)
因此需要在hdfswriter写出orc文件时,为decimal字段指定需要的precision和scale。
查看了decimal默认的objectInspector的实例化逻辑,并没有传入decimal字段的precision和scale的参数位置。
java
import org.apache.hadoop.hive.common.type.HiveDecimal;
...
public List<ObjectInspector> getColumnTypeInspectors(List<Configuration> columns){
...
case DECIMAL:
// 创建DECIMAL的objectInspector
objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(HiveDecimal.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
break;
}
深入ObjectInspectorFactory.getReflectionObjectInspector中,查看Inspector的实例化逻辑
java
public static ObjectInspector getReflectionObjectInspector(Type t, ObjectInspectorOptions options) {
ObjectInspector oi = (ObjectInspector)objectInspectorCache.get(t);
if (oi == null) {
// 没有缓存,进行第一次实例化
oi = getReflectionObjectInspectorNoCache(t, options);
objectInspectorCache.put(t, oi);
}
...
return oi;
}
第一次实例化,根据指定数据的class类型,实例化对应的objectInspector
java
private static ObjectInspector getReflectionObjectInspectorNoCache(Type t, ObjectInspectorOptions options) {
...
if (!(t instanceof Class)) {
throw new RuntimeException(ObjectInspectorFactory.class.getName() + " internal error:" + t);
} else {
Class<?> c = (Class)t;
if (PrimitiveObjectInspectorUtils.isPrimitiveJavaType(c)) {
return PrimitiveObjectInspectorFactory.getPrimitiveJavaObjectInspector(PrimitiveObjectInspectorUtils.getTypeEntryFromPrimitiveJavaType(c).primitiveCategory);
} else if (PrimitiveObjectInspectorUtils.isPrimitiveJavaClass(c)) {
// org.apache.hadoop.hive.common.type.HiveDecimal
// 不属于基本数据类型或对应的装箱类型,只是原始Java类
// 并且没有支持自定义精度传参的迹象,需要另寻他法。
return PrimitiveObjectInspectorFactory.getPrimitiveJavaObjectInspector(PrimitiveObjectInspectorUtils.getTypeEntryFromPrimitiveJavaClass(c).primitiveCategory);
} else if (PrimitiveObjectInspectorUtils.isPrimitiveWritableClass(c)) {
return PrimitiveObjectInspectorFactory.getPrimitiveWritableObjectInspector(PrimitiveObjectInspectorUtils.getTypeEntryFromPrimitiveWritableClass(c).primitiveCategory);
} else if (Enum.class.isAssignableFrom(c)) {
return PrimitiveObjectInspectorFactory.getPrimitiveJavaObjectInspector(PrimitiveCategory.STRING);
} else {
...
}
}
在查询gpt后,得到了潜在的实现逻辑,根据这个实现建议,测试一下下:

java
import org.apache.hadoop.hive.common.type.HiveDecimal;
...
public List<ObjectInspector> getColumnTypeInspectors(List<Configuration> columns){
...
case DECIMAL:
// 创建支持自定义精度的DECIMAL的objectInspector
Integer precision = eachColumnConf.getInt(Key.PRECISION);
Integer scale = eachColumnConf.getInt(Key.SCALE);
DecimalTypeInfo typeInfo = TypeInfoFactory.getDecimalTypeInfo(precision, scale);
objectInspector = PrimitiveObjectInspectorFactory.getPrimitiveJavaObjectInspector(typeInfo);
break;
}
重新编译打包后,测试指定decimal的precision和scale的job.json。
json
{
...
"writeMode": "append",
"column": [
{
"type": "decimal",
"name": "xxxx_rate",
"precision": 38,
"scale": 0
},
...
]
}
测试spark读取写出的orc文件,查看对应字段的schema信息,读取到了正确的精度信息,问题解决:
markdown
df = spark.read.orc("/user/hive/warehouse/xxxx/xxxxx")
df.printSchema()
root
|-- xxxx_rate: decimal(38,0) (nullable = true)
参考文章: