Previously detected and output hangs are not displayed again. Instead, the 'Victim Information' section will indicate that the victim is from an 'Existing Hang' under the 'Previous Hang' column. 'Verified Hangs' below indicate one or more hangs that were found and identify the final blocking session and instance on which they occurred. Since the current hang resolution state is 'PROCESS', any hangs requiring session or process termination will be automatically resolved. Any hangs with a 'Hang Resolution Action' of 'Unresolvable' will be ignored. These types of hangs will either be resolved by another layer in the RDBMS or cannot be resolved as they may require external intervention. Deadlocks (also named cycles) are currently NOT resolved even if hang resolution is enabled. The 'Hang Type' of DLCK in the 'Verified Hangs' output identifies these hangs. Below that are the complete hang chains from the time the hang was detected. The following information will assist Oracle Support Services in further analysis of the root cause of the hang. *** 2014-04-22 17:22:01.537 Verified Hangs in the System Root Chain Total Hang Hang Hang Inst Root #hung #hung Hang Hang Resolution ID Type Status Num Sess Sess Sess Conf Span Action ----- ---- -------- ---- ----- ----- ----- ------ ------ ------------------- 2 HANG VICSELTD 2 833 2 2 HIGH LOCAL IGNRD:InstKillNotA Hang Ignored Reason: Since instance termination is not allowed, automatic hang resolution cannot be performed. inst# SessId Ser# OSPID PrcNm Event ----- ------ ----- --------- ----- ----- 2 291 59157 10646 M000 DFS lock handle ----大家注意这里的sid和ser#以及PrcNm 2 833 1 13788 CKPT control file sequential read
这里提到M000进程,大家应该知道这是跟AWR快照有关系的进程,该进程其实是被CKPT所阻塞,我们也可以来看下该进程
的情况,如下:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | *** 2014-04-22 17:27:00.778 |
从这里看,root sess却是833,也就是我们Node2节点的CKPT进程。 到这里或许有人会说,问题的原因
应该很明确了,由于Node2 ckpt的异常,到这Node2 节点Lmon进程异常,由于需要和Node1的Lmon进程
进行通信,导致Node1 的lmon进程出现IPc send timeout的情况。
其实不然,到最后至始至终我们都没有完全弄清楚为何CKPT进程会等待这么长时间 ?
到这里我不得不怀疑IO的问题了,再回过头来分析Node1的diag trace时,发现一个有趣的事情:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |