【真实经验分享】OGG抽取进程报错 ORA-07445 [kgherrordmp()+986] ORA-00600 [17114]分析步骤

报错

今天收到一个问题需要处理。数据库报错如下

bash 复制代码

ORA-07445: exception encountered: core dump [kgherrordmp()+986] [SIGSEGV] [ADDR:0x0] [PC:0xFFFFFFF] [Address not mapped to object] []
ORA-00600: internal error code, arguments: [17114], [0xFFFFFFFFFFFF], [], [], [], [], [], [], [], [], [], []

分析

一般情况下，看到ORA-07445 ORA-00600 ，大家会先去查看MOS文档，看看是不是遇到什么BUG了。然后会发现，这个问题在MOS中找到不相似的BUG。这时候就需要对日志做深度的分析。

查看日志

查看了trc日志，发现是一个OGG的抽取进程报错，trc日志里面报错有很多，一般我们会查看日志里面对应的第一个incident的trc日志，后续的报错一般是属于被影响的，其日志的价值就比较低。

第一个incident的trc部分日志截取如下：

bash 复制代码

  Heaps
    Stop reason: low mem, MaxMem 1047527424, CacheSize 1047595008
    MemSize 1043541312, LWM 1037041664
    Global cache
      [ 0]: bsize 32 counted 100 bcount 100 size 3200
      [ 1]: bsize 64 counted 166 bcount 166 size 10624
      [ 3]: bsize 192 counted 80 bcount 80 size 15360
      [ 4]: bsize 256 counted 114 bcount 114 size 29184
      [ 5]: bsize 320 counted 3 bcount 3 size 960
      [ 6]: bsize 384 counted 2 bcount 2 size 768
      [ 7]: bsize 448 counted 2 bcount 2 size 896
      [ 8]: bsize 512 counted 1 bcount 1 size 512
      [ 9]: bsize 576 counted 193 bcount 193 size 111168
      [10]: bsize 640 counted 1968 bcount 1968 size 1259520
      [11]: bsize 704 counted 6 bcount 6 size 4224
      [12]: bsize 768 counted 7 bcount 7 size 5376
      [13]: bsize 832 counted 3 bcount 3 size 2496
      [14]: bsize 896 counted 2 bcount 2 size 1792
      [15]: bsize 960 counted 3 bcount 3 size 2880
      [16]: bsize 1024 counted 152 bcount 152 size 155648
      [17]: bsize 1088 counted 4 bcount 4 size 4352
      [18]: bsize 1152 counted 4 bcount 4 size 4608
      [19]: bsize 1216 counted 3 bcount 3 size 3648
      [20]: bsize 1280 counted 2 bcount 2 size 2560
      [21]: bsize 1344 counted 1 bcount 1 size 1344
      [22]: bsize 1408 counted 6 bcount 6 size 8448
      [23]: bsize 1472 counted 6 bcount 6 size 8832
      [24]: bsize 1536 counted 2 bcount 2 size 3072
      [26]: bsize 1664 counted 2 bcount 2 size 3328
      [27]: bsize 1728 counted 3 bcount 3 size 5184
      [28]: bsize 1792 counted 1 bcount 1 size 1792
      [30]: bsize 1920 counted 3 bcount 3 size 5760
      [31]: bsize 1984 counted 2 bcount 2 size 3968
      [32]: bsize 2048 counted 1 bcount 1 size 2048
      [34]: bsize 2176 counted 6 bcount 6 size 13056
      [35]: bsize 2240 counted 2 bcount 2 size 4480
      [36]: bsize 2304 counted 1 bcount 1 size 2304
      [39]: bsize 2496 counted 3 bcount 3 size 7488
      [40]: bsize 2560 counted 2 bcount 2 size 5120
      [41]: bsize 2624 counted 7 bcount 7 size 18368
      [42]: bsize 2688 counted 2 bcount 2 size 5376
      [43]: bsize 2752 counted 4 bcount 4 size 11008
      [46]: bsize 2944 counted 2 bcount 2 size 5888
      [48]: bsize 3072 counted 1 bcount 1 size 3072
      [49]: bsize 3136 counted 2 bcount 2 size 6272
      [52]: bsize 3328 counted 1 bcount 1 size 3328
      [54]: bsize 3456 counted 2 bcount 2 size 6912
      [56]: bsize 3584 counted 5 bcount 5 size 17920
      [58]: bsize 3712 counted 2 bcount 2 size 7424
      [59]: bsize 3776 counted 2 bcount 2 size 7552
      [62]: bsize 3968 counted 3 bcount 3 size 11904
      [63]: bsize 4032 counted 2 bcount 2 size 8064
      [64]: bsize 4096 counted 2 bcount 2 size 8192
      [65]: bsize 8192 counted 273 bcount 273 size 2236416
    Local cache
    pid 5, Total 341107008 Used 341107008
    pid 6, Total 343027104 Used 343027104
    pid 7, Total 353783072 Used 353783072
    total freeable 4053696

以下报错中可以明确的发现，是内存不足导致的。问题在于内存为什么只有1G呢，stream pool配置是远大于1G的，为什么用不上？

查看官方文档https://docs.oracle.com/en/middleware/goldengate/core/19.1/ggcab/managing-server-resources-oracle.html发现：

By default, one Extract requests the logmining server to run with MAX_SGA_SIZE of 1GB.

每个日志挖掘的进程默认是1G的大小，所以当抽取事务繁忙的表，有可能会导致内存不足。

解决方法

把对应的OGG的抽取进程，分解成多个抽取进程，这样可以有效的降低内存不足的风险。