后续会开启一个新的系列:“服务监控与优化”。
前段之间组内大佬分享了《服务可观测性治理实践》,受益匪浅。服务监控和性能优化在日常研发工作中扮演着关键角色,但与某些框架或技术点不同,这并不是可以轻松从网上找到 Demo 和资料来学习的内容。相反,它更依赖于我们在工作中积累的技术经验和知识。与之前的“做一名有经验的开发人员”系列一样,本系列旨在总结从同事和行业专家身上学到的经验,并分享一些我们自己的思考和见解。
问题
我们线上的服务一直存在内存占用异常高的问题,需要解决这个问题:
服务器配置:4C16G
JVM 参数:
java -server -XX:NativeMemoryTracking=detail -Xms12g -Xmx12g -XX:ConcGCThreads=3 -XX:ParallelGCThreads=6 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1ReservePercent=15 -XX:MaxMetaspaceSize=512m -XX:MetaspaceSize=512m -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/work/logs/applogs/ -Xloggc:/home/work/logs/applogs/gc_%t.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintCommandLineFlags -XX:+PrintHeapAtGC -XX:+UseFastAccessorMethods
现象
[root@j5q9t]# ps -p 1 -o rss,vsz
RSS VSZ
14148388 37710676
RSS 是实际占用内存,VSZ 是虚拟内存,也就是说当前 Java 进程占用了 13.5G 的内存。
再看看堆的情况:
[arthas@1]$ dashboard
ID NAME GROUP PRIORITY STATE %CPU DELTA_TIME TIME INTERRUPTE DAEMON
628 DubboServerHandler-10.126.131.22 main 5 WAITING 0.0 0.000 10:2.508 false true
625 DubboServerHandler-10.126.131.22 main 5 WAITING 0.0 0.000 8:58.821 false true
626 DubboServerHandler-10.126.131.22 main 5 WAITING 0.0 0.000 8:44.927 false true
623 DubboServerHandler-10.126.131.22 main 5 WAITING 0.0 0.000 8:33.244 false true
-1 VM Thread - -1 - 0.0 0.000 8:20.135 false true
629 DubboServerHandler-10.126.131.22 main 5 RUNNABLE 0.0 0.000 8:13.524 false true
-1 C2 CompilerThread1 - -1 - 0.0 0.000 8:12.141 false true
639 DubboServerHandler-10.126.131.22 main 5 WAITING 0.0 0.000 8:0.280 false true
-1 C2 CompilerThread0 - -1 - 0.0 0.000 7:52.803 false true
25 BatchSpanProcessor_WorkerThread- main 5 TIMED_WAIT 0.0 0.000 7:36.311 false true
631 DubboServerHandler-10.126.131.22 main 5 WAITING 0.0 0.000 7:26.614 false true
Memory used total max usage GC
heap 1784M 12288M 12288M 14.52% gc.g1_young_generation.count 619
g1_eden_space 868M 7716M -1 11.25% gc.g1_young_generation.time(ms) 28494
g1_survivor_space 28M 28M -1 100.00% gc.g1_old_generation.count 0
g1_old_gen 888M 4544M 12288M 7.23% gc.g1_old_generation.time(ms) 0
nonheap 466M 535M 1256M 37.15%
code_cache 148M 162M 240M 62.05%
metaspace 286M 333M 512M 55.88%
Runtime
os.name Linux
os.version 4.19.91-2048.3.miks7.x86_64
java.version 1.8.0_202
java.home /home/work/app/openjdk/jre
systemload.average 12.81
ID NAME GROUP PRIORITY STATE %CPU DELTA_TIME TIME INTERRUPTE DAEMON
-1 C2 CompilerThread1 - -1 - 10.13 0.503 8:12.645 false true
605 DubboServerHandler-10.126.131.22 main 5 RUNNABLE 4.92 0.244 3:40.428 false true
640 DubboServerHandler-10.126.131.22 main 5 TIMED_WAIT 4.52 0.224 4:8.773 false true
633 DubboServerHandler-10.126.131.22 main 5 WAITING 4.22 0.209 7:7.690 false true
641 DubboServerHandler-10.126.131.22 main 5 TIMED_WAIT 3.76 0.186 6:34.094 false true
616 DubboServerHandler-10.126.131.22 main 5 WAITING 3.52 0.174 5:41.089 false true
610 DubboServerHandler-10.126.131.22 main 5 WAITING 2.88 0.142 4:5.364 false true
-1 C2 CompilerThread0 - -1 - 2.23 0.110 7:52.914 false true
-1 C1 CompilerThread2 - -1 - 1.8 0.089 3:40.105 false true
32762 Timer-for-arthas-dashboard-0c5f3 system 5 RUNNABLE 1.16 0.057 0:0.067 false true
25 BatchSpanProcessor_WorkerThread- main 5 TIMED_WAIT 1.0 0.049 7:36.361 false true
Memory used total max usage GC
heap 2188M 12288M 12288M 17.81% gc.g1_young_generation.count 619
g1_eden_space 1272M 7716M -1 16.49% gc.g1_young_generation.time(ms) 28494
g1_survivor_space 28M 28M -1 100.00% gc.g1_old_generation.count 0
g1_old_gen 888M 4544M 12288M 7.23% gc.g1_old_generation.time(ms) 0
nonheap 466M 535M 1256M 37.14%
code_cache 148M 162M 240M 61.97%
metaspace 286M 333M 512M 55.90%
Runtime
os.name Linux
os.version 4.19.91-2048.3.miks7.x86_64
java.version 1.8.0_202
java.home /home/work/app/openjdk/jre
systemload.average 11.95
ID NAME GROUP PRIORITY STATE %CPU DELTA_TIME TIME INTERRUPTE DAEMON
610 DubboServerHandler-10.126.131.22 main 5 WAITING 6.49 0.324 4:5.688 false true
633 DubboServerHandler-10.126.131.22 main 5 WAITING 5.92 0.296 7:7.986 false true
616 DubboServerHandler-10.126.131.22 main 5 TIMED_WAIT 4.18 0.208 5:41.298 false true
-1 C2 CompilerThread1 - -1 - 3.73 0.186 8:12.832 false true
641 DubboServerHandler-10.126.131.22 main 5 WAITING 3.27 0.163 6:34.257 false true
605 DubboServerHandler-10.126.131.22 main 5 WAITING 2.97 0.148 3:40.576 false true
629 DubboServerHandler-10.126.131.22 main 5 WAITING 1.85 0.092 8:13.648 false true
-1 C2 CompilerThread0 - -1 - 1.46 0.072 7:52.986 false true
640 DubboServerHandler-10.126.131.22 main 5 WAITING 1.44 0.072 4:8.846 false true
25 BatchSpanProcessor_WorkerThread- main 5 TIMED_WAIT 1.13 0.056 7:36.417 false true
-1 C1 CompilerThread2 - -1 - 0.78 0.038 3:40.143 false true
Memory used total max usage GC
heap 2640M 12288M 12288M 21.48% gc.g1_young_generation.count 619
g1_eden_space 1724M 7716M -1 22.34% gc.g1_young_generation.time(ms) 28494
g1_survivor_space 28M 28M -1 100.00% gc.g1_old_generation.count 0
g1_old_gen 888M 4544M 12288M 7.23% gc.g1_old_generation.time(ms) 0
nonheap 466M 535M 1256M 37.14%
code_cache 148M 162M 240M 61.95%
metaspace 286M 333M 512M 55.90%
Runtime
os.name Linux
os.version 4.19.91-2048.3.miks7.x86_64
java.version 1.8.0_202
java.home /home/work/app/openjdk/jre
systemload.average 12.11
ID NAME GROUP PRIORITY STATE %CPU DELTA_TIME TIME INTERRUPTE DAEMON
-1 C2 CompilerThread1 - -1 - 11.74 0.586 8:13.418 false true
640 DubboServerHandler-10.126.131.22 main 5 WAITING 4.16 0.208 4:9.054 false true
619 DubboServerHandler-10.126.131.22 main 5 WAITING 3.76 0.188 5:54.790 false true
618 DubboServerHandler-10.126.131.22 main 5 WAITING 3.67 0.183 5:44.978 false true
633 DubboServerHandler-10.126.131.22 main 5 RUNNABLE 3.17 0.158 7:8.145 false true
641 DubboServerHandler-10.126.131.22 main 5 RUNNABLE 2.56 0.127 6:34.385 false true
616 DubboServerHandler-10.126.131.22 main 5 WAITING 2.54 0.126 5:41.425 false true
610 DubboServerHandler-10.126.131.22 main 5 TIMED_WAIT 2.2 0.110 4:5.798 false true
629 DubboServerHandler-10.126.131.22 main 5 WAITING 1.88 0.093 8:13.742 false true
-1 C2 CompilerThread0 - -1 - 1.87 0.093 7:53.080 false true
-1 VM Thread - -1 - 1.75 0.087 8:20.288 false true
Memory used total max usage GC
heap 3012M 12288M 12288M 24.51% gc.g1_young_generation.count 619
g1_eden_space 2096M 7716M -1 27.16% gc.g1_young_generation.time(ms) 28494
g1_survivor_space 28M 28M -1 100.00% gc.g1_old_generation.count 0
g1_old_gen 888M 4544M 12288M 7.23% gc.g1_old_generation.time(ms) 0
nonheap 466M 535M 1256M 37.15%
code_cache 148M 162M 240M 62.03%
metaspace 286M 333M 512M 55.91%
Runtime
os.name Linux
os.version 4.19.91-2048.3.miks7.x86_64
java.version 1.8.0_202
java.home /home/work/app/openjdk/jre
systemload.average 12.42
ID NAME GROUP PRIORITY STATE %CPU DELTA_TIME TIME INTERRUPTE DAEMON
641 DubboServerHandler-10.126.131.22 main 5 TIMED_WAIT 5.38 0.269 6:34.654 false true
605 DubboServerHandler-10.126.131.22 main 5 RUNNABLE 3.84 0.191 3:40.781 false true
610 DubboServerHandler-10.126.131.22 main 5 WAITING 3.54 0.176 4:5.975 false true
633 DubboServerHandler-10.126.131.22 main 5 WAITING 3.4 0.170 7:8.315 false true
619 DubboServerHandler-10.126.131.22 main 5 TIMED_WAIT 2.38 0.118 5:54.908 false true
-1 C2 CompilerThread1 - -1 - 2.35 0.117 8:13.536 false true
-1 C2 CompilerThread0 - -1 - 1.94 0.096 7:53.177 false true
626 DubboServerHandler-10.126.131.22 main 5 RUNNABLE 1.77 0.088 8:45.062 false true
618 DubboServerHandler-10.126.131.22 main 5 RUNNABLE 1.48 0.073 5:45.052 false true
-1 C1 CompilerThread2 - -1 - 1.1 0.054 3:40.242 false true
25 BatchSpanProcessor_WorkerThread- main 5 TIMED_WAIT 0.98 0.048 7:36.508 false true
Memory used total max usage GC
heap 3380M 12288M 12288M 27.51% gc.g1_young_generation.count 619
g1_eden_space 2464M 7716M -1 31.93% gc.g1_young_generation.time(ms) 28494
g1_survivor_space 28M 28M -1 100.00% gc.g1_old_generation.count 0
g1_old_gen 888M 4544M 12288M 7.23% gc.g1_old_generation.time(ms) 0
nonheap 466M 535M 1256M 37.15%
code_cache 148M 162M 240M 62.02%
metaspace 286M 333M 512M 55.91%
Runtime
os.name Linux
os.version 4.19.91-2048.3.miks7.x86_64
java.version 1.8.0_202
java.home /home/work/app/openjdk/jre
systemload.average 11.99
ID NAME GROUP PRIORITY STATE %CPU DELTA_TIME TIME INTERRUPTE DAEMON
619 DubboServerHandler-10.126.131.22 main 5 TIMED_WAIT 5.37 0.268 5:55.177 false true
626 DubboServerHandler-10.126.131.22 main 5 WAITING 5.0 0.249 8:45.312 false true
616 DubboServerHandler-10.126.131.22 main 5 WAITING 4.24 0.211 5:41.674 false true
618 DubboServerHandler-10.126.131.22 main 5 TIMED_WAIT 4.07 0.203 5:45.255 false true
-1 C2 CompilerThread0 - -1 - 4.04 0.202 7:53.379 false true
605 DubboServerHandler-10.126.131.22 main 5 WAITING 4.01 0.200 3:40.981 false true
633 DubboServerHandler-10.126.131.22 main 5 TIMED_WAIT 3.48 0.173 7:8.489 false true
-1 C2 CompilerThread1 - -1 - 3.21 0.160 8:13.697 false true
641 DubboServerHandler-10.126.131.22 main 5 RUNNABLE 2.5 0.124 6:34.779 false true
164 prometheus-http-1-5 main 5 WAITING 2.25 0.112 1:7.978 false true
610 DubboServerHandler-10.126.131.22 main 5 RUNNABLE 1.89 0.094 4:6.070 false true
Memory used total max usage GC
heap 3920M 12288M 12288M 31.90% gc.g1_young_generation.count 619
g1_eden_space 3004M 7716M -1 38.93% gc.g1_young_generation.time(ms) 28494
g1_survivor_space 28M 28M -1 100.00% gc.g1_old_generation.count 0
g1_old_gen 888M 4544M 12288M 7.23% gc.g1_old_generation.time(ms) 0
nonheap 466M 535M 1256M 37.15%
code_cache 148M 162M 240M 62.03%
metaspace 286M 333M 512M 55.91%
Runtime
os.name Linux
os.version 4.19.91-2048.3.miks7.x86_64
java.version 1.8.0_202
java.home /home/work/app/openjdk/jre
systemload.average 12.07
可以看到堆已经分配了 12G,非堆有 0.5G 左右,那么还有 1G 内存被谁用了呢,再通过 jcmd
看看(注意要配置 -XX:NativeMemoryTracking=detail
参数,会对服务有一定的性能影响,毕竟它需要额外的处理和记录本机内存使用信息,但结合经验来看,除非对性能有极高的要求,否则这个影响可以忽略):
[root@j5q9t]# jcmd 1 VM.native_memory scale=MB
1:
Native Memory Tracking:
Total: reserved=15809MB, committed=14721MB
- Java Heap (reserved=12288MB, committed=12288MB)
(mmap: reserved=12288MB, committed=12288MB)
- Class (reserved=1330MB, committed=345MB)
(classes #50974)
(malloc=10MB #147948)
(mmap: reserved=1320MB, committed=334MB)
- Thread (reserved=1080MB, committed=1080MB)
(thread #1072)
(stack: reserved=1075MB, committed=1075MB)
(malloc=3MB #5366)
(arena=1MB #2140)
- Code (reserved=275MB, committed=197MB)
(malloc=31MB #35600)
(mmap: reserved=244MB, committed=165MB)
- GC (reserved=532MB, committed=532MB)
(malloc=44MB #103632)
(mmap: reserved=488MB, committed=488MB)
- Compiler (reserved=2MB, committed=2MB)
(malloc=2MB #5196)
- Internal (reserved=211MB, committed=211MB)
(malloc=210MB #111485)
- Symbol (reserved=50MB, committed=50MB)
(malloc=46MB #496701)
(arena=4MB #1)
- Native Memory Tracking (reserved=15MB, committed=15MB)
(malloc=1MB #11032)
(tracking overhead=14MB)
- Arena Chunk (reserved=2MB, committed=2MB)
(malloc=2MB)
- Unknown (reserved=24MB, committed=0MB)
(mmap: reserved=24MB, committed=0MB)
可以看到线程就占了 1G,总共 1072 个线程,大概每个线程 1M。
jstack
看看线程情况:
现状分析 & 方案
综上可以得出如下结论:
- 内存数据都对的上,进程占用 13.5G,12G 堆,0.5G 非堆,1G 线程栈。没有发现明显的内存泄漏的情况,也没发现有堆外内存泄漏的情况 (因为我们之前配置了
-XX:+DisableExplicitGC
,曾一度怀疑是堆外内存泄漏) - 至于线程,我觉得数量有点多,有一定的优化空间,但每个线程要 1M 的确大了(默认就是 1M)
于是解决思路如下:
- 减少堆大小
- 减少线程栈大小
- 去除
-XX:+DisableExplicitGC
-
ConcGCThreads
和ParallelGCThreads
调大
效果分析
9.14 号进行了参数调整后发现的确有效果,内存使用率由之前的 80% 左右降至 60% 左右:
暂时看效果还是不错的。
再观察一下内存情况:
[vgjvm]# ps -p 1 -o rss,vsz
RSS VSZ
10784848 78837012
[vgjvm]# jcmd 1 VM.native_memory scale=MB
1:
Native Memory Tracking:
Total: reserved=10983MB, committed=9899MB
- Java Heap (reserved=8192MB, committed=8192MB)
(mmap: reserved=8192MB, committed=8192MB)
- Class (reserved=1301MB, committed=311MB)
(classes #45511)
(malloc=9MB #125984)
(mmap: reserved=1292MB, committed=302MB)
- Thread (reserved=511MB, committed=511MB)
(thread #979)
(stack: reserved=506MB, committed=506MB)
(malloc=3MB #4904)
(arena=1MB #1954)
- Code (reserved=273MB, committed=196MB)
(malloc=30MB #33325)
(mmap: reserved=244MB, committed=167MB)
- GC (reserved=397MB, committed=397MB)
(malloc=61MB #122529)
(mmap: reserved=336MB, committed=336MB)
- Compiler (reserved=3MB, committed=3MB)
(malloc=3MB #4978)
- Internal (reserved=227MB, committed=227MB)
(malloc=227MB #129419)
- Symbol (reserved=47MB, committed=47MB)
(malloc=43MB #471868)
(arena=4MB #1)
- Native Memory Tracking (reserved=15MB, committed=15MB)
(malloc=1MB #10583)
(tracking overhead=14MB)
- Unknown (reserved=16MB, committed=0MB)
(mmap: reserved=16MB, committed=0MB)
[arthas@1]$ dashboard
ID NAME GROUP PRIORITY STATE %CPU DELTA_TIME TIME INTERRUPTE DAEMON
-1 VM Thread - -1 - 0.0 0.000 9:44.358 false true
25 BatchSpanProcessor_WorkerThread- main 5 TIMED_WAIT 0.0 0.000 5:56.554 false true
-1 C2 CompilerThread1 - -1 - 0.0 0.000 5:9.458 false true
-1 C2 CompilerThread0 - -1 - 0.0 0.000 5:8.866 false true
11 com.alibaba.nacos.client.Worker. main 5 TIMED_WAIT 0.0 0.000 4:9.608 false true
281 MQ-AsyncTraceDispatcher-Thread-f main 5 TIMED_WAIT 0.0 0.000 3:44.278 false true
357 MQ-AsyncTraceDispatcher-Thread-e main 5 TIMED_WAIT 0.0 0.000 3:43.311 false true
473 MQ-AsyncTraceDispatcher-Thread-b main 5 TIMED_WAIT 0.0 0.000 3:39.944 false true
396 MQ-AsyncTraceDispatcher-Thread-a main 5 TIMED_WAIT 0.0 0.000 3:38.826 false true
454 MQ-AsyncTraceDispatcher-Thread-1 main 5 TIMED_WAIT 0.0 0.000 3:37.686 false true
319 MQ-AsyncTraceDispatcher-Thread-1 main 5 TIMED_WAIT 0.0 0.000 3:37.064 false true
Memory used total max usage GC
heap 3706M 8192M 8192M 45.25% gc.g1_young_generation.count 1542
g1_eden_space 1472M 4928M -1 29.87% gc.g1_young_generation.time(ms) 63122
g1_survivor_space 60M 60M -1 100.00% gc.g1_old_generation.count 0
g1_old_gen 2174M 3204M 8192M 26.55% gc.g1_old_generation.time(ms) 0
nonheap 444M 508M 1256M 35.39%
code_cache 141M 164M 240M 59.15%
metaspace 273M 309M 512M 53.39%
Runtime
os.name Linux
os.version 4.19.91-2048.3.miks7.x86_64
java.version 1.8.0_202
java.home /home/work/app/openjdk/jre
systemload.average 54.35
这里存在一个问题,已分配的内存小于实际物理内存的使用量,约为 633MB。需要进一步分析,找出造成这数百MB 差距的原因。
总结
本次对系统内存进行了一次粗粒度的优化,但仔细看系统仍然有很大的优化空间,如果有进一步的优化进展,将分享并与大家进行讨论。
实际上,在网上有很多关于内存持续升高问题的分析,尽管原因不尽相同,但它们为我们提供了宝贵的思路,尤其当我们自己遇到困惑时,这些分享能够为我们提供新的启发。而这正是经验分享的真正价值所在。
欢迎关注公众号: