服务监控与优化之一次对服务内存占用过高的排查
  6DOeteBHjp6w 2023年11月02日 123 0


后续会开启一个新的系列:“服务监控与优化”。

前段之间组内大佬分享了《服务可观测性治理实践》,受益匪浅。服务监控和性能优化在日常研发工作中扮演着关键角色,但与某些框架或技术点不同,这并不是可以轻松从网上找到 Demo 和资料来学习的内容。相反,它更依赖于我们在工作中积累的技术经验和知识。与之前的“做一名有经验的开发人员”系列一样,本系列旨在总结从同事和行业专家身上学到的经验,并分享一些我们自己的思考和见解。

问题

我们线上的服务一直存在内存占用异常高的问题,需要解决这个问题:

服务监控与优化之一次对服务内存占用过高的排查_sed

服务器配置:4C16G

JVM 参数:

java -server -XX:NativeMemoryTracking=detail -Xms12g -Xmx12g -XX:ConcGCThreads=3 -XX:ParallelGCThreads=6 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1ReservePercent=15 -XX:MaxMetaspaceSize=512m -XX:MetaspaceSize=512m -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/work/logs/applogs/  -Xloggc:/home/work/logs/applogs/gc_%t.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintCommandLineFlags -XX:+PrintHeapAtGC -XX:+UseFastAccessorMethods

现象

[root@j5q9t]# ps -p 1 -o rss,vsz  
  RSS    VSZ
14148388 37710676

RSS 是实际占用内存,VSZ 是虚拟内存,也就是说当前 Java 进程占用了 13.5G 的内存。

再看看堆的情况:

[arthas@1]$ dashboard
ID    NAME                             GROUP           PRIORITY    STATE      %CPU       DELTA_TIME TIME       INTERRUPTE DAEMON     
628   DubboServerHandler-10.126.131.22 main            5           WAITING    0.0        0.000      10:2.508   false      true       
625   DubboServerHandler-10.126.131.22 main            5           WAITING    0.0        0.000      8:58.821   false      true       
626   DubboServerHandler-10.126.131.22 main            5           WAITING    0.0        0.000      8:44.927   false      true       
623   DubboServerHandler-10.126.131.22 main            5           WAITING    0.0        0.000      8:33.244   false      true       
-1    VM Thread                        -               -1          -          0.0        0.000      8:20.135   false      true       
629   DubboServerHandler-10.126.131.22 main            5           RUNNABLE   0.0        0.000      8:13.524   false      true       
-1    C2 CompilerThread1               -               -1          -          0.0        0.000      8:12.141   false      true       
639   DubboServerHandler-10.126.131.22 main            5           WAITING    0.0        0.000      8:0.280    false      true       
-1    C2 CompilerThread0               -               -1          -          0.0        0.000      7:52.803   false      true       
25    BatchSpanProcessor_WorkerThread- main            5           TIMED_WAIT 0.0        0.000      7:36.311   false      true       
631   DubboServerHandler-10.126.131.22 main            5           WAITING    0.0        0.000      7:26.614   false      true       
Memory                       used     total     max      usage     GC                                                                
heap                         1784M    12288M    12288M   14.52%    gc.g1_young_generation.count     619                              
g1_eden_space                868M     7716M     -1       11.25%    gc.g1_young_generation.time(ms)  28494                            
g1_survivor_space            28M      28M       -1       100.00%   gc.g1_old_generation.count       0                                
g1_old_gen                   888M     4544M     12288M   7.23%     gc.g1_old_generation.time(ms)    0                                
nonheap                      466M     535M      1256M    37.15%                                                                      
code_cache                   148M     162M      240M     62.05%                                                                      
metaspace                    286M     333M      512M     55.88%                                                                      
Runtime                                                                                                                              
os.name                                                            Linux                                                             
os.version                                                         4.19.91-2048.3.miks7.x86_64                                       
java.version                                                       1.8.0_202                                                         
java.home                                                          /home/work/app/openjdk/jre                                        
systemload.average                                                 12.81                                                             
ID    NAME                             GROUP           PRIORITY    STATE      %CPU       DELTA_TIME TIME       INTERRUPTE DAEMON     
-1    C2 CompilerThread1               -               -1          -          10.13      0.503      8:12.645   false      true       
605   DubboServerHandler-10.126.131.22 main            5           RUNNABLE   4.92       0.244      3:40.428   false      true       
640   DubboServerHandler-10.126.131.22 main            5           TIMED_WAIT 4.52       0.224      4:8.773    false      true       
633   DubboServerHandler-10.126.131.22 main            5           WAITING    4.22       0.209      7:7.690    false      true       
641   DubboServerHandler-10.126.131.22 main            5           TIMED_WAIT 3.76       0.186      6:34.094   false      true       
616   DubboServerHandler-10.126.131.22 main            5           WAITING    3.52       0.174      5:41.089   false      true       
610   DubboServerHandler-10.126.131.22 main            5           WAITING    2.88       0.142      4:5.364    false      true       
-1    C2 CompilerThread0               -               -1          -          2.23       0.110      7:52.914   false      true       
-1    C1 CompilerThread2               -               -1          -          1.8        0.089      3:40.105   false      true       
32762 Timer-for-arthas-dashboard-0c5f3 system          5           RUNNABLE   1.16       0.057      0:0.067    false      true       
25    BatchSpanProcessor_WorkerThread- main            5           TIMED_WAIT 1.0        0.049      7:36.361   false      true       
Memory                       used     total     max      usage     GC                                                                
heap                         2188M    12288M    12288M   17.81%    gc.g1_young_generation.count     619                              
g1_eden_space                1272M    7716M     -1       16.49%    gc.g1_young_generation.time(ms)  28494                            
g1_survivor_space            28M      28M       -1       100.00%   gc.g1_old_generation.count       0                                
g1_old_gen                   888M     4544M     12288M   7.23%     gc.g1_old_generation.time(ms)    0                                
nonheap                      466M     535M      1256M    37.14%                                                                      
code_cache                   148M     162M      240M     61.97%                                                                      
metaspace                    286M     333M      512M     55.90%                                                                      
Runtime                                                                                                                              
os.name                                                            Linux                                                             
os.version                                                         4.19.91-2048.3.miks7.x86_64                                       
java.version                                                       1.8.0_202                                                         
java.home                                                          /home/work/app/openjdk/jre                                        
systemload.average                                                 11.95                                                             
ID    NAME                             GROUP           PRIORITY    STATE      %CPU       DELTA_TIME TIME       INTERRUPTE DAEMON     
610   DubboServerHandler-10.126.131.22 main            5           WAITING    6.49       0.324      4:5.688    false      true       
633   DubboServerHandler-10.126.131.22 main            5           WAITING    5.92       0.296      7:7.986    false      true       
616   DubboServerHandler-10.126.131.22 main            5           TIMED_WAIT 4.18       0.208      5:41.298   false      true       
-1    C2 CompilerThread1               -               -1          -          3.73       0.186      8:12.832   false      true       
641   DubboServerHandler-10.126.131.22 main            5           WAITING    3.27       0.163      6:34.257   false      true       
605   DubboServerHandler-10.126.131.22 main            5           WAITING    2.97       0.148      3:40.576   false      true       
629   DubboServerHandler-10.126.131.22 main            5           WAITING    1.85       0.092      8:13.648   false      true       
-1    C2 CompilerThread0               -               -1          -          1.46       0.072      7:52.986   false      true       
640   DubboServerHandler-10.126.131.22 main            5           WAITING    1.44       0.072      4:8.846    false      true       
25    BatchSpanProcessor_WorkerThread- main            5           TIMED_WAIT 1.13       0.056      7:36.417   false      true       
-1    C1 CompilerThread2               -               -1          -          0.78       0.038      3:40.143   false      true       
Memory                       used     total     max      usage     GC                                                                
heap                         2640M    12288M    12288M   21.48%    gc.g1_young_generation.count     619                              
g1_eden_space                1724M    7716M     -1       22.34%    gc.g1_young_generation.time(ms)  28494                            
g1_survivor_space            28M      28M       -1       100.00%   gc.g1_old_generation.count       0                                
g1_old_gen                   888M     4544M     12288M   7.23%     gc.g1_old_generation.time(ms)    0                                
nonheap                      466M     535M      1256M    37.14%                                                                      
code_cache                   148M     162M      240M     61.95%                                                                      
metaspace                    286M     333M      512M     55.90%                                                                      
Runtime                                                                                                                              
os.name                                                            Linux                                                             
os.version                                                         4.19.91-2048.3.miks7.x86_64                                       
java.version                                                       1.8.0_202                                                         
java.home                                                          /home/work/app/openjdk/jre                                        
systemload.average                                                 12.11                                                             
ID    NAME                             GROUP           PRIORITY    STATE      %CPU       DELTA_TIME TIME       INTERRUPTE DAEMON     
-1    C2 CompilerThread1               -               -1          -          11.74      0.586      8:13.418   false      true       
640   DubboServerHandler-10.126.131.22 main            5           WAITING    4.16       0.208      4:9.054    false      true       
619   DubboServerHandler-10.126.131.22 main            5           WAITING    3.76       0.188      5:54.790   false      true       
618   DubboServerHandler-10.126.131.22 main            5           WAITING    3.67       0.183      5:44.978   false      true       
633   DubboServerHandler-10.126.131.22 main            5           RUNNABLE   3.17       0.158      7:8.145    false      true       
641   DubboServerHandler-10.126.131.22 main            5           RUNNABLE   2.56       0.127      6:34.385   false      true       
616   DubboServerHandler-10.126.131.22 main            5           WAITING    2.54       0.126      5:41.425   false      true       
610   DubboServerHandler-10.126.131.22 main            5           TIMED_WAIT 2.2        0.110      4:5.798    false      true       
629   DubboServerHandler-10.126.131.22 main            5           WAITING    1.88       0.093      8:13.742   false      true       
-1    C2 CompilerThread0               -               -1          -          1.87       0.093      7:53.080   false      true       
-1    VM Thread                        -               -1          -          1.75       0.087      8:20.288   false      true       
Memory                       used     total     max      usage     GC                                                                
heap                         3012M    12288M    12288M   24.51%    gc.g1_young_generation.count     619                              
g1_eden_space                2096M    7716M     -1       27.16%    gc.g1_young_generation.time(ms)  28494                            
g1_survivor_space            28M      28M       -1       100.00%   gc.g1_old_generation.count       0                                
g1_old_gen                   888M     4544M     12288M   7.23%     gc.g1_old_generation.time(ms)    0                                
nonheap                      466M     535M      1256M    37.15%                                                                      
code_cache                   148M     162M      240M     62.03%                                                                      
metaspace                    286M     333M      512M     55.91%                                                                      
Runtime                                                                                                                              
os.name                                                            Linux                                                             
os.version                                                         4.19.91-2048.3.miks7.x86_64                                       
java.version                                                       1.8.0_202                                                         
java.home                                                          /home/work/app/openjdk/jre                                        
systemload.average                                                 12.42                                                             
ID    NAME                             GROUP           PRIORITY    STATE      %CPU       DELTA_TIME TIME       INTERRUPTE DAEMON     
641   DubboServerHandler-10.126.131.22 main            5           TIMED_WAIT 5.38       0.269      6:34.654   false      true       
605   DubboServerHandler-10.126.131.22 main            5           RUNNABLE   3.84       0.191      3:40.781   false      true       
610   DubboServerHandler-10.126.131.22 main            5           WAITING    3.54       0.176      4:5.975    false      true       
633   DubboServerHandler-10.126.131.22 main            5           WAITING    3.4        0.170      7:8.315    false      true       
619   DubboServerHandler-10.126.131.22 main            5           TIMED_WAIT 2.38       0.118      5:54.908   false      true       
-1    C2 CompilerThread1               -               -1          -          2.35       0.117      8:13.536   false      true       
-1    C2 CompilerThread0               -               -1          -          1.94       0.096      7:53.177   false      true       
626   DubboServerHandler-10.126.131.22 main            5           RUNNABLE   1.77       0.088      8:45.062   false      true       
618   DubboServerHandler-10.126.131.22 main            5           RUNNABLE   1.48       0.073      5:45.052   false      true       
-1    C1 CompilerThread2               -               -1          -          1.1        0.054      3:40.242   false      true       
25    BatchSpanProcessor_WorkerThread- main            5           TIMED_WAIT 0.98       0.048      7:36.508   false      true       
Memory                       used     total     max      usage     GC                                                                
heap                         3380M    12288M    12288M   27.51%    gc.g1_young_generation.count     619                              
g1_eden_space                2464M    7716M     -1       31.93%    gc.g1_young_generation.time(ms)  28494                            
g1_survivor_space            28M      28M       -1       100.00%   gc.g1_old_generation.count       0                                
g1_old_gen                   888M     4544M     12288M   7.23%     gc.g1_old_generation.time(ms)    0                                
nonheap                      466M     535M      1256M    37.15%                                                                      
code_cache                   148M     162M      240M     62.02%                                                                      
metaspace                    286M     333M      512M     55.91%                                                                      
Runtime                                                                                                                              
os.name                                                            Linux                                                             
os.version                                                         4.19.91-2048.3.miks7.x86_64                                       
java.version                                                       1.8.0_202                                                         
java.home                                                          /home/work/app/openjdk/jre                                        
systemload.average                                                 11.99                                                             
ID    NAME                             GROUP           PRIORITY    STATE      %CPU       DELTA_TIME TIME       INTERRUPTE DAEMON     
619   DubboServerHandler-10.126.131.22 main            5           TIMED_WAIT 5.37       0.268      5:55.177   false      true       
626   DubboServerHandler-10.126.131.22 main            5           WAITING    5.0        0.249      8:45.312   false      true       
616   DubboServerHandler-10.126.131.22 main            5           WAITING    4.24       0.211      5:41.674   false      true       
618   DubboServerHandler-10.126.131.22 main            5           TIMED_WAIT 4.07       0.203      5:45.255   false      true       
-1    C2 CompilerThread0               -               -1          -          4.04       0.202      7:53.379   false      true       
605   DubboServerHandler-10.126.131.22 main            5           WAITING    4.01       0.200      3:40.981   false      true       
633   DubboServerHandler-10.126.131.22 main            5           TIMED_WAIT 3.48       0.173      7:8.489    false      true       
-1    C2 CompilerThread1               -               -1          -          3.21       0.160      8:13.697   false      true       
641   DubboServerHandler-10.126.131.22 main            5           RUNNABLE   2.5        0.124      6:34.779   false      true       
164   prometheus-http-1-5              main            5           WAITING    2.25       0.112      1:7.978    false      true       
610   DubboServerHandler-10.126.131.22 main            5           RUNNABLE   1.89       0.094      4:6.070    false      true       
Memory                       used     total     max      usage     GC                                                                
heap                         3920M    12288M    12288M   31.90%    gc.g1_young_generation.count     619                              
g1_eden_space                3004M    7716M     -1       38.93%    gc.g1_young_generation.time(ms)  28494                            
g1_survivor_space            28M      28M       -1       100.00%   gc.g1_old_generation.count       0                                
g1_old_gen                   888M     4544M     12288M   7.23%     gc.g1_old_generation.time(ms)    0                                
nonheap                      466M     535M      1256M    37.15%                                                                      
code_cache                   148M     162M      240M     62.03%                                                                      
metaspace                    286M     333M      512M     55.91%                                                                      
Runtime                                                                                                                              
os.name                                                            Linux                                                             
os.version                                                         4.19.91-2048.3.miks7.x86_64                                       
java.version                                                       1.8.0_202                                                         
java.home                                                          /home/work/app/openjdk/jre                                        
systemload.average                                                 12.07

可以看到堆已经分配了 12G,非堆有 0.5G 左右,那么还有 1G 内存被谁用了呢,再通过 jcmd 看看(注意要配置 -XX:NativeMemoryTracking=detail 参数,会对服务有一定的性能影响,毕竟它需要额外的处理和记录本机内存使用信息,但结合经验来看,除非对性能有极高的要求,否则这个影响可以忽略):

[root@j5q9t]# jcmd 1 VM.native_memory scale=MB
1:

Native Memory Tracking:

Total: reserved=15809MB, committed=14721MB
-                 Java Heap (reserved=12288MB, committed=12288MB)
                            (mmap: reserved=12288MB, committed=12288MB) 
 
-                     Class (reserved=1330MB, committed=345MB)
                            (classes #50974)
                            (malloc=10MB #147948) 
                            (mmap: reserved=1320MB, committed=334MB) 
 
-                    Thread (reserved=1080MB, committed=1080MB)
                            (thread #1072)
                            (stack: reserved=1075MB, committed=1075MB)
                            (malloc=3MB #5366) 
                            (arena=1MB #2140)
 
-                      Code (reserved=275MB, committed=197MB)
                            (malloc=31MB #35600) 
                            (mmap: reserved=244MB, committed=165MB) 
 
-                        GC (reserved=532MB, committed=532MB)
                            (malloc=44MB #103632) 
                            (mmap: reserved=488MB, committed=488MB) 
 
-                  Compiler (reserved=2MB, committed=2MB)
                            (malloc=2MB #5196) 
 
-                  Internal (reserved=211MB, committed=211MB)
                            (malloc=210MB #111485) 
 
-                    Symbol (reserved=50MB, committed=50MB)
                            (malloc=46MB #496701) 
                            (arena=4MB #1)
 
-    Native Memory Tracking (reserved=15MB, committed=15MB)
                            (malloc=1MB #11032) 
                            (tracking overhead=14MB)
 
-               Arena Chunk (reserved=2MB, committed=2MB)
                            (malloc=2MB) 
 
-                   Unknown (reserved=24MB, committed=0MB)
                            (mmap: reserved=24MB, committed=0MB)

可以看到线程就占了 1G,总共 1072 个线程,大概每个线程 1M。

jstack 看看线程情况:

服务监控与优化之一次对服务内存占用过高的排查_Memory_02

现状分析 & 方案

综上可以得出如下结论:

  • 内存数据都对的上,进程占用 13.5G,12G 堆,0.5G 非堆,1G 线程栈。没有发现明显的内存泄漏的情况,也没发现有堆外内存泄漏的情况 (因为我们之前配置了 -XX:+DisableExplicitGC ,曾一度怀疑是堆外内存泄漏)
  • 至于线程,我觉得数量有点多,有一定的优化空间,但每个线程要 1M 的确大了(默认就是 1M)

于是解决思路如下:

  • 减少堆大小
  • 减少线程栈大小
  • 去除 -XX:+DisableExplicitGC
  • ConcGCThreadsParallelGCThreads 调大

效果分析

9.14 号进行了参数调整后发现的确有效果,内存使用率由之前的 80% 左右降至 60% 左右:

服务监控与优化之一次对服务内存占用过高的排查_sed_03

暂时看效果还是不错的。

再观察一下内存情况:

[vgjvm]# ps -p 1 -o rss,vsz 
  RSS    VSZ
10784848 78837012
[vgjvm]# jcmd 1 VM.native_memory scale=MB
1:

Native Memory Tracking:

Total: reserved=10983MB, committed=9899MB
-                 Java Heap (reserved=8192MB, committed=8192MB)
                            (mmap: reserved=8192MB, committed=8192MB) 
 
-                     Class (reserved=1301MB, committed=311MB)
                            (classes #45511)
                            (malloc=9MB #125984) 
                            (mmap: reserved=1292MB, committed=302MB) 
 
-                    Thread (reserved=511MB, committed=511MB)
                            (thread #979)
                            (stack: reserved=506MB, committed=506MB)
                            (malloc=3MB #4904) 
                            (arena=1MB #1954)
 
-                      Code (reserved=273MB, committed=196MB)
                            (malloc=30MB #33325) 
                            (mmap: reserved=244MB, committed=167MB) 
 
-                        GC (reserved=397MB, committed=397MB)
                            (malloc=61MB #122529) 
                            (mmap: reserved=336MB, committed=336MB) 
 
-                  Compiler (reserved=3MB, committed=3MB)
                            (malloc=3MB #4978) 
 
-                  Internal (reserved=227MB, committed=227MB)
                            (malloc=227MB #129419) 
 
-                    Symbol (reserved=47MB, committed=47MB)
                            (malloc=43MB #471868) 
                            (arena=4MB #1)
 
-    Native Memory Tracking (reserved=15MB, committed=15MB)
                            (malloc=1MB #10583) 
                            (tracking overhead=14MB)
 
-                   Unknown (reserved=16MB, committed=0MB)
                            (mmap: reserved=16MB, committed=0MB)
[arthas@1]$ dashboard
ID    NAME                             GROUP            PRIORITY   STATE      %CPU       DELTA_TIME  TIME       INTERRUPTE DAEMON     
-1    VM Thread                        -                -1         -          0.0        0.000       9:44.358   false      true       
25    BatchSpanProcessor_WorkerThread- main             5          TIMED_WAIT 0.0        0.000       5:56.554   false      true       
-1    C2 CompilerThread1               -                -1         -          0.0        0.000       5:9.458    false      true       
-1    C2 CompilerThread0               -                -1         -          0.0        0.000       5:8.866    false      true       
11    com.alibaba.nacos.client.Worker. main             5          TIMED_WAIT 0.0        0.000       4:9.608    false      true       
281   MQ-AsyncTraceDispatcher-Thread-f main             5          TIMED_WAIT 0.0        0.000       3:44.278   false      true       
357   MQ-AsyncTraceDispatcher-Thread-e main             5          TIMED_WAIT 0.0        0.000       3:43.311   false      true       
473   MQ-AsyncTraceDispatcher-Thread-b main             5          TIMED_WAIT 0.0        0.000       3:39.944   false      true       
396   MQ-AsyncTraceDispatcher-Thread-a main             5          TIMED_WAIT 0.0        0.000       3:38.826   false      true       
454   MQ-AsyncTraceDispatcher-Thread-1 main             5          TIMED_WAIT 0.0        0.000       3:37.686   false      true       
319   MQ-AsyncTraceDispatcher-Thread-1 main             5          TIMED_WAIT 0.0        0.000       3:37.064   false      true       
Memory                       used     total     max      usage     GC                                                                 
heap                         3706M    8192M     8192M    45.25%    gc.g1_young_generation.count      1542                             
g1_eden_space                1472M    4928M     -1       29.87%    gc.g1_young_generation.time(ms)   63122                            
g1_survivor_space            60M      60M       -1       100.00%   gc.g1_old_generation.count        0                                
g1_old_gen                   2174M    3204M     8192M    26.55%    gc.g1_old_generation.time(ms)     0                                
nonheap                      444M     508M      1256M    35.39%                                                                       
code_cache                   141M     164M      240M     59.15%                                                                       
metaspace                    273M     309M      512M     53.39%                                                                       
Runtime                                                                                                                               
os.name                                                            Linux                                                              
os.version                                                         4.19.91-2048.3.miks7.x86_64                                        
java.version                                                       1.8.0_202                                                          
java.home                                                          /home/work/app/openjdk/jre                                         
systemload.average                                                 54.35

这里存在一个问题,已分配的内存小于实际物理内存的使用量,约为 633MB。需要进一步分析,找出造成这数百MB 差距的原因。

总结

本次对系统内存进行了一次粗粒度的优化,但仔细看系统仍然有很大的优化空间,如果有进一步的优化进展,将分享并与大家进行讨论。

实际上,在网上有很多关于内存持续升高问题的分析,尽管原因不尽相同,但它们为我们提供了宝贵的思路,尤其当我们自己遇到困惑时,这些分享能够为我们提供新的启发。而这正是经验分享的真正价值所在。

欢迎关注公众号:

服务监控与优化之一次对服务内存占用过高的排查_jvm_04


【版权声明】本文内容来自摩杜云社区用户原创、第三方投稿、转载,内容版权归原作者所有。本网站的目的在于传递更多信息,不拥有版权,亦不承担相应法律责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@moduyun.com

  1. 分享:
最后一次编辑于 2023年11月08日 0

暂无评论

推荐阅读
6DOeteBHjp6w