某生产库exadata,oracle 11.2.0.4,客户说物理备库与主库不能进行同步,如是要客户把生产库与备库的alert_sid.log文件打包传给我。查看了一个主库的alert.log日志文件发现从15号开始网络就断断续续的,信息如下:
Wed Jun 15 14:25:48 2016 *********************************************************************** Fatal NI connect error 12543, connecting to: (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.137.8.101)(PORT=1521))) (CONNECT_DATA=(SERVICE_NAME=syyjm)(CID=(PROGRAM=oracle)(HOST=yb01dbadm01.dnsserver) (USER=oracle)))) VERSION INFORMATION: TNS for Linux: Version 11.2.0.4.0 - Production TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production Time: 15-JUN-2016 14:25:48 Tracing not turned on. Tns error struct: ns main err code: 12543 TNS-12543: TNS:destination host unreachable ns secondary err code: 12560 nt main err code: 513 TNS-00513: Destination host unreachable nt secondary err code: 113 nt OS err code: 0 *********************************************************************** Fatal NI connect error 12543, connecting to: (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.137.8.101)(PORT=1521))) (CONNECT_DATA=(SERVICE_NAME=syyjm)(CID=(PROGRAM=oracle)(HOST=yb01dbadm01.dnsserver) (USER=oracle)))) VERSION INFORMATION: TNS for Linux: Version 11.2.0.4.0 - Production TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production Time: 15-JUN-2016 14:25:51 Tracing not turned on. Tns error struct: ns main err code: 12543 TNS-12543: TNS:destination host unreachable ns secondary err code: 12560 nt main err code: 513 TNS-00513: Destination host unreachable nt secondary err code: 113 nt OS err code: 0 Wed Jun 15 14:26:02 2016 ...省略...
对应的备库这个时间点的alert.log日志文件信息如下:
Wed Jun 15 14:24:47 2016 *********************************************************************** Fatal NI connect error 12170. Wed Jun 15 14:24:47 2016 *********************************************************************** VERSION INFORMATION: TNS for Linux: Version 11.2.0.4.0 - Production Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production Fatal NI connect error 12170. Time: 15-JUN-2016 14:24:47 VERSION INFORMATION: TNS for Linux: Version 11.2.0.4.0 - Production Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production Tracing not turned on. Time: 15-JUN-2016 14:24:47 Tns error struct: Tracing not turned on. ns main err code: 12535 Tns error struct: ns main err code: 12535 TNS-12535: TNS:operation timed out TNS-12535: TNS:operation timed out ns secondary err code: 12560 ns secondary err code: 12560 nt main err code: 505 nt main err code: 505 TNS-00505: Operation timed out TNS-00505: Operation timed out nt secondary err code: 110 nt secondary err code: 110 nt OS err code: 0 nt OS err code: 0 Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.137.8.73)(PORT=12282)) Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.137.8.73)(PORT=12091)) RFS[4]: Possible network disconnect with primary database RFS[5]: Possible network disconnect with primary database Wed Jun 15 14:30:55 2016 Killing 1 processes with pids 25908 (idle RFS by thread/sequence) in order to allow the in-transit gap to be queued. Requested by OS process 58349 Wed Jun 15 14:30:57 2016 RFS[10]: Assigned to RFS process 58352 RFS[10]: Selected log 8 for thread 1 sequence 40085 dbid 643576469 branch 867103448 Wed Jun 15 14:30:58 2016 Archived Log entry 3997 added for thread 1 sequence 40085 ID 0x265be592 dest 1: Wed Jun 15 14:30:59 2016 Primary database is in MAXIMUM PERFORMANCE mode RFS[11]: Assigned to RFS process 58363 RFS[11]: Selected log 7 for thread 1 sequence 40087 dbid 643576469 branch 867103448 RFS[10]: Selected log 8 for thread 1 sequence 40086 dbid 643576469 branch 867103448 Wed Jun 15 14:31:01 2016 Killing 1 processes with pids 25910 (idle RFS by thread/sequence) in order to retry receiving a log after reattaching. Requested by OS process 58367 Wed Jun 15 14:31:01 2016 Archived Log entry 3998 added for thread 1 sequence 40086 ID 0x265be592 dest 1: RFS[12]: Assigned to RFS process 58367 RFS[12]: Selected log 10 for thread 2 sequence 35718 dbid 643576469 branch 867103448
从信息RFS[4]: Possible network disconnect with primary database可以看到备库不能与主库进行连接。从信息Killing 1 processes with pids 25908 (idle RFS by thread/sequence) in order to allow the in-transit gap to be queued. Requested by OS process 58349,可知因为不能从主库接受归档日志,且超时而被进程58349给kill掉了,而58349进程是另一个RFS进程用于接受主库所传输过来的归档日志。也就是说,因为网络连接出现了问题,RFS进程在不断的kill,restart重复这样的操作
在2016-06-15 14:34:25这个时间点由于网络恢复日志传输恢复正常,但在2016-05-15 16:59:00这个时间点网络又断了,不能进行日志传输
Wed Jun 15 14:34:25 2016 ARC3: Standby redo logfile selected for thread 1 sequence 40085 for destination LOG_ARCHIVE_DEST_2 Wed Jun 15 14:34:27 2016 Thread 1 advanced to log sequence 40087 (LGWR switch) Current log# 1 seq# 40087 mem# 0: +DATAC1/yyjm/onlinelog/group_1.261.867103449 Wed Jun 15 14:34:27 2016 Archived Log entry 139166 added for thread 1 sequence 40086 ID 0x265be592 dest 1: Wed Jun 15 14:34:27 2016 LNS: Standby redo logfile selected for thread 1 sequence 40087 for destination LOG_ARCHIVE_DEST_2 ARC3: Standby redo logfile selected for thread 1 sequence 40086 for destination LOG_ARCHIVE_DEST_2 Wed Jun 15 15:58:19 2016 Thread 1 advanced to log sequence 40088 (LGWR switch) Current log# 2 seq# 40088 mem# 0: +DATAC1/yyjm/onlinelog/group_2.262.867103449 Wed Jun 15 15:58:19 2016 LNS: Standby redo logfile selected for thread 1 sequence 40088 for destination LOG_ARCHIVE_DEST_2 Wed Jun 15 15:58:20 2016 Archived Log entry 139173 added for thread 1 sequence 40087 ID 0x265be592 dest 1: Wed Jun 15 16:59:00 2016 *********************************************************************** Fatal NI connect error 12170. VERSION INFORMATION: TNS for Linux: Version 11.2.0.4.0 - Production Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production Time: 15-JUN-2016 16:59:00 Tracing not turned on. Tns error struct: ns main err code: 12535 TNS-12535: TNS:operation timed out ns secondary err code: 12560 nt main err code: 505 TNS-00505: Operation timed out nt secondary err code: 110 nt OS err code: 0 Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.137.32.4)(PORT=1222)) Wed Jun 15 17:05:01 2016
从备库日志信息来看,备库其实在2016-06-15 16:08:43这个时间点网络就已经断开了
Wed Jun 15 16:08:43 2016 RFS[9]: Possible network disconnect with primary database Wed Jun 15 16:08:43 2016 RFS[6]: Possible network disconnect with primary database Wed Jun 15 17:56:36 2016
这个DG环境的网络时断时连的情况从2016-06-15一直到2016-06-19号,从备库的日志信息可以看到如下信息,不能在备库创建归档日志文件,从信息RFS[23]: No standby redo logfiles available for thread 1 与RFS[24]: No standby redo logfiles available for thread 2,以及ORA-00312: online log 9 thread 1: ‘/data/syyjm/datafile/standy03’可知因为备库的standby log的standy03出现了损坏。
Sun Jun 19 12:15:21 2016 RFS[23]: Assigned to RFS process 18969 RFS[23]: No standby redo logfiles available for thread 1 Creating archive destination file : /arch/syyjm/1_40277_867103448.dbf (12905 blocks) Sun Jun 19 12:15:56 2016 Archiver process freed from errors. No longer stopped Sun Jun 19 12:15:56 2016 Primary database is in MAXIMUM PERFORMANCE mode Re-archiving standby log 11 thread 2 sequence 36153 RFS[24]: Assigned to RFS process 18976 RFS[24]: No standby redo logfiles available for thread 2 ...省略... Sun Jun 19 12:21:03 2016 Unable to create archive log file '/arch/syyjm/1_40254_867103448.dbf' ARC3: Error 19504 Creating archive log file to '/arch/syyjm/1_40254_867103448.dbf' ARCH: Archival stopped, error occurred. Will continue retrying ORACLE Instance syyjm - Archival Error ORA-16038: log 9 sequence# 40254 cannot be archived ORA-19504: failed to create file "" ORA-00312: online log 9 thread 1: '/data/syyjm/datafile/standy03' Sun Jun 19 12:21:03 2016 ...省略... Sun Jun 19 12:31:03 2016 Suppressing further error logging of LOG_ARCHIVE_DEST_1. Suppressing further error logging of LOG_ARCHIVE_DEST_1. ARCH: Archival stopped, error occurred. Will continue retrying ORACLE Instance syyjm - Archival Error ORA-16038: log 9 sequence# 40254 cannot be archived ORA-19504: failed to create file "" ORA-00312: online log 9 thread 1: '/data/syyjm/datafile/standy03' Sun Jun 19 12:31:03 2016 ARCH: Archival stopped, error occurred. Will continue retrying ORACLE Instance syyjm - Archival Error ORA-16014: log 9 sequence# 40254 not archived, no available destinations ORA-00312: online log 9 thread 1: '/data/syyjm/datafile/standy03' ...省略... Wed Jun 22 11:22:11 2016 ARCH: Archival stopped, error occurred. Will continue retrying ORACLE Instance syyjm - Archival Error ORA-16038: log 9 sequence# 40254 cannot be archived ORA-19504: failed to create file "" ORA-00312: online log 9 thread 1: '/data/syyjm/datafile/standy03' Wed Jun 22 11:22:11 2016 ARCH: Archival stopped, error occurred. Will continue retrying ORACLE Instance syyjm - Archival Error ORA-16014: log 9 sequence# 40254 not archived, no available destinations ORA-00312: online log 9 thread 1: '/data/syyjm/datafile/standy03'
从主库alert日志文件也可以看到与不能进行对远程目录进行归档操作
Wed Jun 22 11:05:22 2016 Thread 1 advanced to log sequence 41141 (LGWR switch) Current log# 2 seq# 41141 mem# 0: +DATAC1/yyjm/onlinelog/group_2.262.867103449 FAL[server, ARC0]: FAL archive failed, see trace file. ARCH: FAL archive failed. Archiver continuing ORACLE Instance yyjm1 - Archival Error. Archiver continuing. FAL[server, ARC1]: FAL archive failed, see trace file. ARCH: FAL archive failed. Archiver continuing ORACLE Instance yyjm1 - Archival Error. Archiver continuing. Wed Jun 22 11:05:23 2016 Archived Log entry 142098 added for thread 1 sequence 41140 ID 0x265be592 dest 1:
通过主库与备库的alert.log文件信息可以清楚的知道原因是因为DG环境的网络时断时连引起了备库的RFS进程在重复kill,restart操作过程中损坏了standby log文件,解决方法很简单就是重建备库被损不的standby log文件。