某公安系统开发人员在plsql中执行查询时报ORA-04030: 在尝试分配 123416 字节 (QERHJ hash-joi,kllcqas:kllsltba) 时进程内存不足。数据库是aix上的rac 11.2.0.4。当时朋友找到我,我当时给他的建议是直接把报错的查询直接在数据库服务器上执行看是否报错,因为在服务器上用sqlplus来执行是没有经过监听程序的,但后面给我的答复是在服务器上执行也报错。所以就远程连接来处理。
查看alert.log文件发现如下错误信息:
Errors in file /u01/app/oracle/diag/rdbms/test/test/incident/incdir_156708/test_ora_15859922_i156708.trc ORA-04030: 在尝试分配 123416 字节 (QERHJ hash-joi,kllcqas:kllsltba) 时进程内存不足
从alert.log文件查看到的非缺省值参数信息如下 :
System parameters with non-default values: processes = 1500 sessions = 2272 _realfree_heap_pagesize_hint= 256K _use_realfree_heap = TRUE spfile = "+DATA_DG/xtrk/spfilextrk.ora" sga_target = 14G control_files = "+DATA_DG/xtrk/controlfile/current.261.851142581" control_files = "+DATA_DG/xtrk/controlfile/current.260.851142581" db_block_size = 8192 _external_scn_rejection_threshold_hours= 1 compatible = "11.2.0.4.0" log_archive_format = "%t_%s_%r.dbf" cluster_database = TRUE db_create_file_dest = "+DATA_DG" db_recovery_file_dest = "+DATA_DG" db_recovery_file_dest_size= 400G thread = 1 undo_tablespace = "UNDOTBS1" instance_number = 1 remote_login_passwordfile= "EXCLUSIVE" db_domain = "" dispatchers = "(PROTOCOL=TCP) (SERVICE=xtrkXDB)" remote_listener = "test-scan:1521" audit_file_dest = "/u01/app/oracle/admin/test/adump" audit_trail = "DB" db_name = "test" open_cursors = 600 pga_aggregate_target = 10G diagnostic_dest = "/u01/app/oracle"
如是查看错误跟踪文件:
Dump continued from file: /u01/app/oracle/diag/rdbms/xtrk/xtrk1/trace/xtrk1_ora_15859922.trc ORA-04030: 在尝试分配 123416 字节 (QERHJ hash-joi,kllcqas:kllsltba) 时进程内存不足 ========= Dump for incident 156708 (ORA 4030) ======== ----- Beginning of Customized Incident Dump(s) ----- ======================================= TOP 10 MEMORY USES FOR THIS PROCESS --------------------------------------- *** 2014-08-27 15:07:09.388 77% 86 MB, 727 chunks: "kllcqas:kllsltba " SQL QERHJ hash-joi ds=11160a378 dsprt=110bcb0f8 7% 8449 KB, 80 chunks: "QERHJ Bit vector " SQL QERHJ hash-joi ds=11160a378 dsprt=110bcb0f8 6% 6771 KB, 4 chunks: "kllcqc:kllcqslt " SQL QERHJ hash-joi ds=11160a378 dsprt=110bcb0f8 5% 5880 KB, 407 chunks: "free memory " SQL QERHJ hash-joi ds=11160a378 dsprt=110bcb0f8 2% 2056 KB, 2 chunks: "kllcqgf:kllsltba " SQL kxs-heap-w ds=110bcb0f8 dsprt=1109f0120 1% 651 KB, 28 chunks: "permanent memory " pga heap ds=110005210 dsprt=0 0% 376 KB, 1 chunk : "kfk_kfkio_freeq " KFK_IO_SUBHEAP ds=1108dae38 dsprt=110005210 0% 336 KB, 1 chunk : "kfkosd_p " KFK_IO_SUBHEAP ds=1108dae38 dsprt=110005210 0% 259 KB, 8 chunks: "free memory " pga heap ds=110005210 dsprt=0 0% 253 KB, 2 chunks: "free memory " top call heap ds=11011f360 dsprt=0
从上面报错进程内存分配信息来看报错的chunks: “kllcqas:kllsltba ” 大小是86M
======================================= PRIVATE MEMORY SUMMARY FOR THIS PROCESS --------------------------------------- ****************************************************** PRIVATE HEAP SUMMARY DUMP 111 MB total: 110 MB commented, 652 KB permanent 606 KB free (0 KB in empty extents), 109 MB, 1 heap: "session heap " 92 KB free held ------------------------------------------------------ Summary of subheaps at depth 1 110 MB total: 110 MB commented, 96 KB permanent 94 KB free (0 KB in empty extents), 108 MB, 1 heap: "kxs-heap-w " 84 KB free held ------------------------------------------------------ Summary of subheaps at depth 2 109 MB total: 108 MB commented, 45 KB permanent 32 KB free (4 KB in empty extents), 106 MB, 4 heaps: "QERHJ hash-joi " 10 KB free held ------------------------------------------------------ Summary of subheaps at depth 3 106 MB total: 100 MB commented, 41 KB permanent 5880 KB free (0 KB in empty extents), 91 MB, 727 chunks: "kllcqas:kllsltba " 5257 KB free held 8786 KB, 80 chunks: "QERHJ Bit vector " 336 KB free held
从111 MB total可知出错进程只分配了111M的内存
========================================= REAL-FREE ALLOCATOR DUMP FOR THIS PROCESS ----------------------------------------- Dump of Real-Free Memory Allocator Heap [0x110959bf0] mag=0xfefe0001 flg=0x5000003 fds=0x0 blksz=262144 blkdstbl=0x110959c00, iniblk=8192 maxblk=524288 numsegs=15 In-use num=402 siz=116391936, Freeable num=0 siz=0, Free num=0 siz=0 ========================================== INSTANCE-WIDE PRIVATE MEMORY USAGE SUMMARY ------------------------------------------ Dumping Work Area Table (level=1) ===================================== Global SGA Info --------------- global target: 10240 MB auto target: 8997 MB max pga: 2047 MB pga limit: 4095 MB pga limit known: 0 pga limit errors: 0 pga inuse: 275 MB pga alloc: 343 MB pga freeable: 11 MB pga freed: 4150 MB pga to free: 0 broker request: 0 pga auto: 32 MB pga manual: 0 MB pga alloc (max): 3043 MB pga auto (max): 32 MB pga manual (max): 0 MB # workareas : 4 # workareas(max): 5 ================================ PER-PROCESS PRIVATE MEMORY USAGE -------------------------------- Private memory usage per Oracle process
当出现ora-04030时该实例分配的总PGA大小如下
------------------------- Top 10 processes: ------------------------- (percentage is of 420 MB total allocated memory) 27% pid 51: 111 MB used of 112 MB allocated < = CURRENT PROC 5% pid 13: 18 MB used of 22 MB allocated 5% pid 14: 18 MB used of 22 MB allocated 5% pid 36: 18 MB used of 22 MB allocated 5% pid 38: 18 MB used of 22 MB allocated 5% pid 39: 18 MB used of 22 MB allocated 5% pid 40: 18 MB used of 22 MB allocated 4% pid 10: 16 MB used of 17 MB allocated 4% pid 12: 13 MB used of 17 MB allocated 4% pid 20: 12 MB used of 15 MB allocated
从上面的percentage is of 420 MB total allocated memory可知这个实例的PGA只分配了420M。从27% pid 51: 111 MB used of 112 MB allocated < = CURRENT PROC可知当前进程分配了112M,使用了111M。而pid=51,那么接下来查看pid=51的详细会话信息:
------------------------------------ Begin session detail for pid 51 sid: 430 ser: 5 audsid: 400022 user: 239/TC_RKXT flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/- flags2: (0x40009) -/-/INC pid: 51 O/S info: user: grid, term: UNKNOWN, ospid: 15859922 image: oracle@xtrkdb1 client details: O/S info: user: Administrator, term: RKXXWEB, ospid: 4120:2956 machine: RKXX\RKXXWEB program: plsqldev.exe application name: PL/SQL Developer, hash value=1190136663 action name: SQL Window - New, hash value=3399691616 current SQL: SELECT b.master_relation 户关系, b.hu_id_new 户号,a.name 姓名,a.pid 身份证号码,b.when_in_logged 入户时间, b.in_category 入户理由, b.dob, a.photo_flag 有无相片, substr(ORGNAME,7,10) 所属派出所, b.quxcun_id,c.name 居委会名称 from t_person a, t_huji b, tc_jcyw.t_org o ,tc_jcyw.t_quxcun c where a.person_id = b.person_id and b.org_id = o.sunit_code and b.quxcun_id=c.quxcun_id and b.zxbz = '0' and a.dob < =to_date( '1996.07.01','yyyy.mm.dd' ) and not Exists(select * from t_pid_accept m where m.iscancel is null and m.pid = a.pid ) End session detail for pid 51 ------------------------------------ ================ SWAP INFORMATION ---------------- swap info: free_mem = 22025.57M rsv = 128.00M alloc = 44.48M avail = 32768.00M swap_free = 32723.52M ----- End of Customized Incident Dump(s) ----- *** 2014-08-27 15:07:09.417 dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0) ----- Current SQL Statement for this session (sql_id=1gbsquwh58w4k) ----- SELECT b.master_relation 户关系, b.hu_id_new 户号,a.name 姓名,a.pid 身份证号码,b.when_in_logged 入户时间, b.in_category 入户理由, b.dob, a.photo_flag 有无相片, substr(ORGNAME,7,10) 所属派出所, b.quxcun_id,c.name 居委会名称 from t_person a, t_huji b, tc_jcyw.t_org o ,tc_jcyw.t_quxcun c where a.person_id = b.person_id and b.org_id = o.sunit_code and b.quxcun_id=c.quxcun_id and b.zxbz = '0' and a.dob <=to_date( '1996.07.01','yyyy.mm.dd' ) and not Exists(select * from t_pid_accept m where m.iscancel is null and m.pid = a.pid )
从上面的信息可知swap大小为32768M,SGA为14G,PGA为10G,当这个实例的总PGA为分配了420M,当这个会话进程只分配111M内存时不应该会报告ora-04030错误.这时我还是亲自在服务器上执行上面的报错查询发现不报错,而开始我要朋友做测试,他是要别人做的,但给我的答案是报错,看来还是得亲自测试。这种情况在MOS有一篇文章(ID 758131.1)就是描述这种情况的并且说在aix下是一个特定的问题。造成这个问题的原因可能是因为操作系统资源限制,说当CRS启动它的资源时是使用root用户可能会出现ora-04030错误,如果是资源通过sqlplus,lsnrctl来启动则没有操作系统资源限制。如果我查看/etc/security/limits文件内容如下:
# cat limits * * Sizes are in multiples of 512 byte blocks, CPU time is in seconds * * fsize - soft file size in blocks * core - soft core file size in blocks * cpu - soft per process CPU time limit in seconds * data - soft data segment size in blocks * stack - soft stack segment size in blocks * rss - soft real memory usage in blocks * nofiles - soft file descriptor limit * fsize_hard - hard file size in blocks * core_hard - hard core file size in blocks * cpu_hard - hard per process CPU time limit in seconds * data_hard - hard data segment size in blocks * stack_hard - hard stack segment size in blocks * rss_hard - hard real memory usage in blocks * nofiles_hard - hard file descriptor limit * * The following table contains the default hard values if the * hard values are not explicitly defined: * * Attribute Value * ========== ============ * fsize_hard set to fsize * cpu_hard set to cpu * core_hard -1 * data_hard -1 * stack_hard 8388608 * rss_hard -1 * nofiles_hard -1 * * NOTE: A value of -1 implies "unlimited" * default: fsize = 2097151 core = 2097151 cpu = -1 data = 262144 # rss = 65536 rss = -1 stack = 65536 nofiles = 2000 root: fsize = -1 data = -1 stack = -1 core = -1 rss = -1 nofiles = -1 stack = -1 stack_hard = -1 data_hard = -1 cpu = -1 core_hard = -1 cpu_hard = -1 fsize_hard = -1 oracle: fsize = -1 data = -1 stack = -1 core = -1 rss = -1 nofiles = -1 stack = -1 stack_hard = -1 data_hard = -1 cpu = -1 core_hard = -1 cpu_hard = -1 fsize_hard = -1
发现我朋友已经修改了root用户和oracle用户的权限但为什么还是报ora-04030错误了。原因为在oracle 11g rac安装不像以前的版本,现在安装Grid Infrastructure 和Oracle数据库软件推荐使用不同的用户来安装。通过查看这里用grid用户来安装Grid Infrastructure ,oracle用户来安装的oracle数据库软。下面的limits文件朋友只修改了root,oracle两个用户,但并没有修改grid用户对操作系统资源的使用限制。而在rac中,监听程序是集群软件来启动的也就是由grid用户来管理的。由于修改grid用户对操作系统资源的使用限制由于在limits文件中增加以下内容:
grid: fsize = -1 data = -1 stack = -1 core = -1 rss = -1 nofiles = -1 stack = -1 stack_hard = -1 data_hard = -1 cpu = -1 core_hard = -1 cpu_hard = -1 fsize_hard = -1
由于重启rac,在重启之后再执行上面报错语句一切正常了,造成这个问题的根本原因是没有修改oracle,grid用户对操作系统资源的使用限制所造成的,而这个问题应该是可以避免的,因为对这种操作系统资源的使用限制其实在安装rac之前是要做修改的在安装文档中也是有的,所以安装也是需要仔细阅读官方文档的。