Oracle-TNS-12535: TNS:operation timed out-摩杜云开发者社区

1.告警出现

查看相关时段的alert日志，发现了如下的错误：

***********************************************************************
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for Linux: Version 11.2.0.4.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
Time: 25-OCT-2017 05:00:12
Tracing not turned on.
Tns error struct:
ns main err code: 12535

TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505

TNS-00505: Operation timed out
nt secondary err code: 110
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=x.x.x.x)(PORT=4459))
Wed Oct 25 05:18:13 2017
***********************************************************************

2.MOS的解释

查看MOS后找到如下文章：

Alert Log Errors: 12170 TNS-12535/TNS-00505: Operation Timed Out (Doc ID 1628949.1)
Fatal NI Connect Error 12170, ‘TNS-12535: TNS:operation timed out’ Reported in 11g Alert Log (Doc ID 1286376.1)

主要内容如下：

CAUSE
The 'nt secondary err code' identifies the underlying network transport, such as (TCP/IP) timeout limit. In the current case 60 identifies Windows underlying transport layer.
The "nt secondary err code" will be different based on the operating system:
Linux x86 or Linux x86-64: "nt secondary err code: 110"
HP-UX : "nt secondary err code: 238"
AIX: "nt secondary err code: 78"
Solaris: "nt secondary err code: 145"

The alert.log message indicates that a connection was terminated AFTER it was established to the instance. In this case, it was terminated 2 hours and 3 minutes after the listener

handed the connection to the database. 

This would indicate an issue with a firewall where a maximum idle time setting is in place.
The connection would not necessarily be "idle". This issue can arise during a long running query or when using JDBC Thin connection pooling.
If there is no data 'on the wirte for lengthy periods of time for any reason, the firewall might terminate the connection.

SOLUTION
The non-Oracle solution would be to remove or increase the firewall setting for maximum idle time. In cases where this is not feasible,
Oracle offers the following suggestion:

The following parameter, set at the **RDBMS_HOME/network/admin/sqlnet.ora, can resolve this kind of problem.
DCD or SQLNET.EXPIRE_TIME can mimic data transmission between the server and the client during long periods of idle time.

SQLNET.EXPIRE_TIME=n Where <n> is a non-zero value set in minutes.

Once this change is in place, there is NO need to restart the listener or the database.
The change will be in place for all newly spawned server processes following the change.
Be aware that connections that were established prior to this setting would not be affected by this change. Therefore, you may continue to experience some timeouts until all remote
connection are established with this setting in place.


See the following : Note 257650.1 Resolving Problems with Connection Idle Timeout With Firewall

从文中可看出，这可能是防火墙策略或者JDBC连接池的原因，服务器自动将长时间没有响应的会话关闭，可以通过设置参数 SQLNET.EXPIRE_TIME来解决。

文中也列出了不同操作系统报出的不同nt secondary err code。

Linux x86 or Linux x86-64: "nt secondary err code: 110"
HP-UX : "nt secondary err code: 238"
AIX: "nt secondary err code: 78"
Solaris: "nt secondary err code: 145"

Linux x86或Linux x86-64: "二级错误码: 110"
HP-UX: "二级错误码: 238"
AIX: "二级错误码: 78"
Solaris: "二级错误码: 145"

问题的基本诊断思路如下：

1.查找问题时间段的alert日志和监听日志，其位置分别是$ORACLE__BASE/diag/rdbms/db_name/$ORACLE_SID/trace/和$ORACLE_BASE/diag/tnslsnr/hostname/listener/trace/

2.如果alert日志显示如上报错，但listener日志显示正常，那么说明确实是数据库以外的问题，按如上文档的solution解决即可。

3.如果监听日志显示了其他报错，那么再根据报错找到相应的解决办法。

3.参数相关的理解

原理:

通过设定参数为非零值(分钟)开启Dead Connection Detection(DCD)，用于检测、标记僵死而没有断开的会话session，再由PMON进行清理，释放资源。

PMON的作用是监控数据库的可用性、数据连接、崩溃恢复等方面.

作用：

1.保持连接上有流量。因为长连接、长时间、无流量而被中间的防火墙等安全设备掐断连接。在连接已经被防火墙等安全设备掐断的情况下，虽然，客户端进程还存在，但连接已经无效，DCD主动将服务端连接退出。由PMON进行资源的再清理释放。

2.探测客户端进程是否已经异常终止。如果发送的探测包没有反馈，则认为客户端异常终止，将主动回滚事务，主动销毁连接。避免客户端异常终止时，数据库端未感知到，锁资源未释放而影响其他会话/连接拿不到资源的问题。

影响：

1.此参数的设置不会主动断开正常的连接，只是起到探活、保持流量和清理异常连接的作用。

2.在大量实践上，并未发现有异常，或其他副作用。

最坏的情况是:

最坏的情况是: Dead Connection Detection(DCD 断开连接检测)不起作用，没有按照预期发送探测包，或者发送的探测包被部分防火墙忽略，长时间不活动的长连接还是被防火墙等安全设备掐断。

4.参数的变更

进入文件文件夹下并编辑sqlnet.ora，如文件存在在末尾添加参数即可，如文件不存在直接编辑添加参数

su - oracle
cd $ORACLE_HOME/network/admin
vi sqlnet.ora
SQLNET.EXPIRE_TIME=10

添加完成需重新reload监听服务

su - grid
crsctl stat res -t
lsnrctl reload listener
lsnrctl reload listener_scan