TX packets:58 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:45226 (44.1 KiB) TX bytes:9567 (9.3 KiB)
Interrupt:169 Base address:0x18a4
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:49025 errors:0 dropped:0 overruns:0 frame:0
TX packets:49025 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:11292111 (10.7 MiB) TX bytes:11292111 (10.7 MiB)
我们看一下网卡ip地址,被收回的私有eth1网卡ip现在已经恢复了,这是因为刚刚节点2进行了重启操作。重启后会初始化所有网卡,被我们禁用的eth1网卡被重新启用,重新恢复ip。
检查CRS进程状态,全都是健康的
[root@rac2 cssd]# crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
检查集群,实例,数据库,监听,ASM服务状态,也都是完好无损,全部启动了
[root@rac2 cssd]# crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora....B1.inst application ONLINE ONLINE rac1
ora....B2.inst application ONLINE ONLINE rac2
ora....DB1.srv application ONLINE ONLINE rac1
ora.....TAF.cs application ONLINE ONLINE rac1
ora.RACDB.db application ONLINE ONLINE rac1
ora....SM1.asm application ONLINE ONLINE rac1
ora....C1.lsnr application ONLINE ONLINE rac1
ora.rac1.gsd application ONLINE ONLINE rac1
ora.rac1.ons application ONLINE ONLINE rac1
ora.rac1.vip application ONLINE ONLINE rac1
ora....SM2.asm application ONLINE ONLINE rac2
ora....C2.lsnr application ONLINE ONLINE rac2
ora.rac2.gsd application ONLINE ONLINE rac2
ora.rac2.ons application ONLINE ONLINE rac2
ora.rac2.vip application ONLINE ONLINE rac2
RAC故障分析并解决的整个过程到此结束
三 模拟OCR磁盘不可用时,RAC会出现什么现象 给出故障定位的整个过程
OCR磁盘:OCR磁盘中注册了RAC所有的资源信息,包含集群、数据库、实例、监听、服务、ASM、存储、网络等等,只有被OCR磁盘注册的资源才能被CRS集群管理,CRS进程就是按照OCR磁盘中记录的资源来管理的,在我们的运维过程中可能会发生OCR磁盘信息丢失的情况,例如 在增减节点时,添加 or 删除OCR磁盘时可能都会发生。接下来我们模拟一下当OCR磁盘信息丢失时,如果定位故障并解决。
实验
1.检查OCR磁盘和CRS进程
(1)检查OCR磁盘,只有OCR磁盘没有问题,CRS进程才可以顺利管理
[root@rac2 cssd]# ocrcheck
Status of Oracle Cluster Registry is as follows :
Version : 2
Total space (kbytes) : 104344
Used space (kbytes) : 4344
Available space (kbytes) : 100000
ID : 1752469369
Device/File Name : /dev/raw/raw1 这个就是OCR磁盘所属的裸设备
Device/File integrity check succeeded
Device/File not configured
Cluster registry integrity check succeeded 完整检查完毕没有问题
(2)检查CRS状态
[root@rac2 cssd]# crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
集群进程全部健康
(3)关闭CRS守护进程
[root@rac2 sysconfig]# crsctl stop crs
Stopping resources. 停止资源
Successfully stopped CRS resources 停止CRS进程
Stopping CSSD. 停止CSSD进程
Shutting down CSS daemon.
Shutdown request successfully issued.
关闭请求执行成功
[root@rac2 sysconfig]# crsctl check crs
Failure 1 contacting CSS daemon 连接CSS守护进程失败
Cannot communicate with CRS 无法与CRS通信
Cannot communicate with EVM 无法与EVM通信
2.用root用户导出OCR磁盘内容进行OCR备份
[root@rac2 sysconfig]# ocrconfig -export /home/oracle/ocr.exp
[oracle@rac2 ~]$ pwd
/home/oracle
[oracle@rac2 ~]$ ll
total 108
-rw-r--r-- 1 root root 98074 Jul 18 11:20 ocr.exp 已经生成OCR导出文件
3.重启CRS守护进程
[root@rac2 sysconfig]# crsctl start crs
Attempting to start CRS stack 尝试启动CRS
The CRS stack will be started shortly CRS即将启动
检查CRS状态
[root@rac2 sysconfig]# crsctl check crs 很好,我们重新启动后就变正常了
CSS appears healthy
CRS appears healthy
EVM appears healthy
4.使用裸设备命令0字节覆盖OCR磁盘内容模拟丢失状态
[root@rac2 sysconfig]# dd if=/dev/ze