昨天某个客户的一套双节RAC其中一个节点crash,同时最后导致另外一个节点也hang住,只能shutdown abort.
且出现shutdown abort实例之后,还有部分进程无法通过kill -9 进行kill的情况。其中有lgwr,arch等进程.
首先我们来看下,在下午出现crash的节点的alert log信息:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
Tue Apr 22 17:16:04 2014 Deleted Oracle managed file /oa1/arch/AUTHORCL/archivelog/2014_03_14/o1_mf_2_4878_9l51y1cc_.arc Deleted Oracle managed file /oa1/arch/AUTHORCL/archivelog/2014_03_14/o1_mf_2_4879_9l529hc6_.arc Archived Log entry 10847 added for thread 1 sequence 5314 ID 0xffffffffae21a60f dest 1: Tue Apr 22 17:25:05 2014 IPC Send timeout detected. Sender: ospid 27573 [oracle@xhdb-server3 (LMON)] Receiver: inst 2 binc 95439 ospid 13752 Communications reconfiguration: instance_number 2 Tue Apr 22 17:26:49 2014 LMON (ospid: 27573) has not called a wait for 89 secs. Errors in file /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/trace/authorcl1_lmhb_27613.trc (incident=14129): ORA-29770: global enqueue process LMON (OSID 27573) is hung for more than 70 seconds Incident details in: /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/incident/incdir_14129/authorcl1_lmhb_27613_i14129.trc Tue Apr 22 17:26:58 2014 Sweep [inc][14129]: completed Sweep [inc2][14129]: completed ERROR: Some process(s) is not making progress. LMHB (ospid: 27613) is terminating the instance. Please check LMHB trace file for more details. Please also check the CPU load, I/O load and other system properties for anomalous behavior ERROR: Some process(s) is not making progress. Tue Apr 22 17:26:58 2014 System state dump requested by (instance=1, osid=27613 (LMHB)), summary=[abnormal instance termination]. LMHB (ospid: 27613): terminating the instance due to error 29770 System State dumped to trace file /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/trace/authorcl1_diag_27561.trc Tue Apr 22 17:27:00 2014 ORA-1092 : opitsk aborting process Tue Apr 22 17:27:01 2014 License high water mark = 144 Tue Apr 22 17:27:08 2014 Termination issued to instance processes. Waiting for the processes to exit Instance termination failed to kill one or more processes Instance terminated by LMHB, pid = 27613 Tue Apr 22 17:27:15 2014 USER (ospid: 1378): terminating the instance Termination issued to instance processes. Waiting for the processes to exit Tue Apr 22 17:27:25 2014 Instance termination failed to kill one or more processes Instance terminated by USER, pid = 1378 Tue Apr 22 21:51:56 2014 Adjusting the default value of parameter parallel_max_servers from 640 to 135 due to the value of parameter processes (150) Starting ORACLE instance (normal) |
我们可以看到,最早在Apr 22 17:25:05 2014 时间点,即抛出LMON IPC send timeout的错误了。
Receiver: inst 2 binc 95439 ospid 13752 这里的receiver进程为节点2的13752进程,即节点2的LMON进程。
对于LMON进程,主要是监控RAC的GES信息,当然其作用不仅仅局限于此,其还负责检查集群中各个Node的健康
情况,当有节点出现故障是,负责进行reconfig以及GRD(global resource Directory)的恢复等等。我们知道
RAC的脑裂机制,如果IO fencing是Oracle本身来完成,也就是说由CLusterware来完成。那么Lmon进程检查到
实例级别出现脑裂时,会通知clusterware来进行脑裂操作,然而其并不会等待clusterware的处理结果。当等待
超过一定时间,那么LMON进程会自动触发IMR(instance membership recovery),这实际上也就是我们所说的
Instance membership reconfig。
其次,lmon进程主要通过2种心跳机制来检查判断集群节点的健康状态:
1) 网络心跳 (主要是通过ping进行检测)
2) 控制文件磁盘心跳,其实就是每个节点的ckpt进程每3s更新一次controlfile的机制。
所以这里大家可以看出,Lmon进程是需要操作controlfile的。否则无法判断第2点。
虽然从上面的错误来看,该实例是被LMHB进程给终止掉的,这里我们需要说明一下,LMBH进程的原理。
LMBH进程是Oracle 11R2版本引入的一个进程,该进程的作用的监控LMD,LMON,LCK,LMS等核心进程,防