11.2.0.3RAC（VCS）节点crash以及hang的问题分析 - 数据库编程

昨天某个客户的一套双节RAC其中一个节点crash,同时最后导致另外一个节点也hang住,只能shutdown abort.
且出现shutdown abort实例之后，还有部分进程无法通过kill -9 进行kill的情况。其中有lgwr，arch等进程.

首先我们来看下，在下午出现crash的节点的alert log信息：

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 Tue Apr 22 17:16:04 2014 Deleted Oracle managed file /oa1/arch/AUTHORCL/archivelog/2014_03_14/o1_mf_2_4878_9l51y1cc_.arc Deleted Oracle managed file /oa1/arch/AUTHORCL/archivelog/2014_03_14/o1_mf_2_4879_9l529hc6_.arc Archived Log entry 10847 added for thread 1 sequence 5314 ID 0xffffffffae21a60f dest 1: Tue Apr 22 17:25:05 2014 IPC Send timeout detected. Sender: ospid 27573 [oracle@xhdb-server3 (LMON)] Receiver: inst 2 binc 95439 ospid 13752 Communications reconfiguration: instance_number 2 Tue Apr 22 17:26:49 2014 LMON (ospid: 27573) has not called a wait for 89 secs. Errors in file /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/trace/authorcl1_lmhb_27613.trc (incident=14129): ORA-29770: global enqueue process LMON (OSID 27573) is hung for more than 70 seconds Incident details in: /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/incident/incdir_14129/authorcl1_lmhb_27613_i14129.trc Tue Apr 22 17:26:58 2014 Sweep [inc][14129]: completed Sweep [inc2][14129]: completed ERROR: Some process(s) is not making progress. LMHB (ospid: 27613) is terminating the instance. Please check LMHB trace file for more details. Please also check the CPU load, I/O load and other system properties for anomalous behavior ERROR: Some process(s) is not making progress. Tue Apr 22 17:26:58 2014 System state dump requested by (instance=1, osid=27613 (LMHB)), summary=[abnormal instance termination]. LMHB (ospid: 27613): terminating the instance due to error 29770 System State dumped to trace file /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/trace/authorcl1_diag_27561.trc Tue Apr 22 17:27:00 2014 ORA-1092 : opitsk aborting process Tue Apr 22 17:27:01 2014 License high water mark = 144 Tue Apr 22 17:27:08 2014 Termination issued to instance processes. Waiting for the processes to exit Instance termination failed to kill one or more processes Instance terminated by LMHB, pid = 27613 Tue Apr 22 17:27:15 2014 USER (ospid: 1378): terminating the instance Termination issued to instance processes. Waiting for the processes to exit Tue Apr 22 17:27:25 2014 Instance termination failed to kill one or more processes Instance terminated by USER, pid = 1378 Tue Apr 22 21:51:56 2014 Adjusting the default value of parameter parallel_max_servers from 640 to 135 due to the value of parameter processes (150) Starting ORACLE instance (normal)

我们可以看到，最早在Apr 22 17:25:05 2014 时间点,即抛出LMON IPC send timeout的错误了。

Receiver: inst 2 binc 95439 ospid 13752 这里的receiver进程为节点2的13752进程，即节点2的LMON进程。

对于LMON进程，主要是监控RAC的GES信息，当然其作用不仅仅局限于此，其还负责检查集群中各个Node的健康
情况，当有节点出现故障是，负责进行reconfig以及GRD(global resource Directory)的恢复等等。我们知道
RAC的脑裂机制，如果IO fencing是Oracle本身来完成，也就是说由CLusterware来完成。那么Lmon进程检查到
实例级别出现脑裂时，会通知clusterware来进行脑裂操作，然而其并不会等待clusterware的处理结果。当等待
超过一定时间，那么LMON进程会自动触发IMR(instance membership recovery)，这实际上也就是我们所说的
Instance membership reconfig。

其次，lmon进程主要通过2种心跳机制来检查判断集群节点的健康状态：
1) 网络心跳（主要是通过ping进行检测）
2) 控制文件磁盘心跳，其实就是每个节点的ckpt进程每3s更新一次controlfile的机制。

所以这里大家可以看出，Lmon进程是需要操作controlfile的。否则无法判断第2点。
虽然从上面的错误来看，该实例是被LMHB进程给终止掉的，这里我们需要说明一下，LMBH进程的原理。
LMBH进程是Oracle 11R2版本引入的一个进程，该进程的作用的监控LMD,LMON,LCK,LMS等核心进程，防

11.2.0.3RAC（VCS）节点crash以及hang的问题分析(一)