ORA-00600 [kjctr_pbmsg:badbmsg2] - 数据库编程

NCE CRASH WITH ORA-00600 [KJCTR_PBMSG:BADBMSG2] in 11.2.0.3
This looks like some form of underlying infrastructure/network issue, please work with customer to have this checked and tested.
Bug 17452853 : LNX64-12.1-EF,DB INST CRASH WITH LMS4 HIT ORA-600 [KJCTR_PBMSG:BADBMSG2] in 12.1.0.2
Bug 17049773 Diagnostic enhancement to give additional parameter in error ORA-600 [ kjctr_pbmsg:badbmsg2] in 12.1.0.1
Note: This fix will not address the root cause of the error but the additional information may help with diagnosis of the cause.
Bug 13917456 : LNX64-12.1-UD: ASM LMD HIT ORA-00600 KJCTR_PBMSG:BADBMSG2 IN NON-UPGRADED NODES in 12.1.0.0.2
It may occurred in upgrading stage from 11.2.0.3 to 12.1 . Not related with this SR.

4. 至此，我需要检查问题发生时的AWR，oswatcher和全部的LMS, LMD, LMON,LMHB and DIAG日志，看是否有跟多的信息记录。
同时也通过cluvfy和ORAchk来检查RAC的整体环境。

--. AWR report 22:00~23:00 on Aug 11 from both nodes.

--. Deploy the oswatcher, then collect the current OS information, when the database workload is high.

--. All the LMS, LMD, LMON,LMHB and DIAG from both nodes.

--. CVU output:

cluvfy stage -pre crsinst -n -verbose

--. Please run oraCheck as root.

ORAchk - Health Checks for the Oracle Stack (Doc ID 1268927.2)

5. 在检查AWR的时候，发现有"gc blocks lost"，这个错误理论上，如果私网正常的话，是不会出现的，它的出现，基本就可以说明，私网是不稳定的

awrrpt_2_29557_29558.html

Snap Id Snap Time Sessions Cursors/Session

Begin Snap: 29557 11-Aug-14 22:00:45 563 1.3

End Snap: 29558 11-Aug-14 23:01:00 551 1.3

Elapsed: 60.24 (mins)

DB Time: 4,835.90 (mins)

Top 5 Timed Foreground Events

Event Waits Time(s) Avg wait (ms) % DB time Wait Class

db file sequential read 6,269,185 185,621 30 63.97 User I/O

DB CPU 42,433 14.62

gc current grant 2-way 3,251,636 25,671 8 8.85 Cluster

db file scattered read 550,524 9,873 18 3.40 User I/O

gc cr multi block request 637,442 6,790 11 2.34 Cluster

Instance Activity Stats

Statistic Total per Second per Trans

gc blocks lost 269 0.07 0.01 <<<<<<<<<<<<

awrrpt_1_29557_29558.html

Snap Id Snap Time Sessions Cursors/Session

Begin Snap: 29557 11-Aug-14 22:00:44 2470 1.0

End Snap: 29558 11-Aug-14 23:00:59 2500 1.0

Elapsed: 60.25 (mins)

DB Time: 4,549.47 (mins)

Top 5 Timed Foreground Events

Event Waits Time(s) Avg wait (ms) % DB time Wait Class

db file sequential read 8,180,795 154,504 19 56.60 User I/O

DB CPU 44,994 16.48

gc current grant 2-way 3,699,003 29,357 8 10.75 Cluster

db file scattered read 677,065 10,190 15 3.73 User I/O

gc cr multi block request 718,327 7,856 11 2.88 Cluster

Statistic Total per Second per Trans

gc blocks lost 410 0.11 0.01 <<<<<<<<<<<<

6. 对于这个错误，更加证明私网的问题可能性，最终结论如下

The Bugs 16240464 and 18015296 are raised for the similar issue and both the bugs are closed as "Vendor OS Problem".
The bug confirmed that this issue is cause because of logical block corruption during network transfer over the interconnect or Infrastructure issue.

The ORA-00600 [kjctr_pbmsg:badbmsg2] error is purely a result of unstable network.
From the AWR reports it is confirmed that we were seeing block lost during the problematic time frame. This is one of the evidence that network is either saturated or causing packets to be corrupted.

By the way, Checked the AWR report. Found "gc blocks lost".
Please involve the OS team and Network team to identify the root cause of the issue. The below note will helpful for the network issue.
Troubleshooting gc block lost and Poor Network Performance in a RAC Environment (Doc ID 563566.1)

7. 这个问题的处理其实还缺少更有力的证据，就是oswatcher日志，如果有问题出现时的oswatcher日志，会让私网问题暴露的更清晰，毕竟整个问题分析过程中遇到的"gc blocks lost"和ORA-00600 [kjctr_pbmsg:badbmsg2]错误，

ORA-00600 [kjctr_pbmsg:badbmsg2](二)