Skip to content

节点内存泄漏 #130

@masterOcean

Description

@masterOcean

节点在 30 分钟内内存泄露直到宕机,主要是 PubMessage对象

我们生产环境 4个节点(16c 32g)组成的集群,其中一个节点在 30分钟内内存一直上升,gc 不下来,最终不可用,我们也是第一次出现。当时集群总连接数大概12W,单个节点 3W,每个连接大概 5S或10S发一次消息,通过共享连接转出到kafka。
gc log:

[2025-05-20T09:40:43.490+0800][30530][gc] GC(168430) Garbage Collection (Allocation Rate) 6626M(54%)->1966M(16%)
[2025-05-20T09:40:58.462+0800][30530][gc] GC(168431) Garbage Collection (Allocation Rate) 7112M(58%)->2040M(17%)
.....
[2025-05-20T09:44:37.140+0800][30530][gc] GC(168446) Garbage Collection (Allocation Rate) 5252M(43%)->2862M(23%)
[2025-05-20T09:44:45.639+0800][30530][gc] GC(168447) Garbage Collection (Allocation Rate) 5222M(42%)->2422M(20%)
[2025-05-20T09:44:55.678+0800][30530][gc] GC(168448) Garbage Collection (Allocation Rate) 4706M(38%)->3060M(25%)
.....
[2025-05-20T09:48:01.016+0800][30530][gc] GC(168460) Garbage Collection (Allocation Rate) 7482M(61%)->3376M(27%)
[2025-05-20T09:48:15.690+0800][30530][gc] GC(168461) Garbage Collection (Allocation Rate) 7518M(61%)->3448M(28%)
[2025-05-20T09:48:30.689+0800][30530][gc] GC(168462) Garbage Collection (Allocation Rate) 7720M(63%)->3502M(28%)
.....
[2025-05-20T09:52:06.422+0800][30530][gc] GC(168483) Garbage Collection (Allocation Rate) 6834M(56%)->4232M(34%)
[2025-05-20T09:52:16.279+0800][30530][gc] GC(168484) Garbage Collection (Allocation Rate) 6762M(55%)->4284M(35%)
[2025-05-20T09:52:26.332+0800][30530][gc] GC(168485) Garbage Collection (Allocation Rate) 6868M(56%)->4376M(36%)
.....
[2025-05-20T10:03:21.311+0800][30530][gc] GC(168626) Garbage Collection (Allocation Rate) 7308M(59%)->7280M(59%)
[2025-05-20T10:03:25.278+0800][30530][gc] GC(168627) Garbage Collection (Allocation Rate) 7302M(59%)->7328M(60%)
.....
[2025-05-20T10:10:03.710+0800][30530][gc] GC(168717) Garbage Collection (Allocation Rate) 9672M(79%)->9866M(80%)
[2025-05-20T10:10:09.355+0800][30843][gc] Allocation Stall (crdt-service-scheduler) 176.979ms
[2025-05-20T10:10:09.355+0800][30755][gc] Allocation Stall (io-rpc-worker-elg-31) 107.565ms
[2025-05-20T10:10:09.355+0800][32197][gc] Allocation Stall (basekv-range-mutator) 560.162ms

坏节点的 heap 直方图

 num     #instances         #bytes  class name (module)
-------------------------------------------------------
   1:       4387991      789,541360  [Ljava.lang.Object; ([email protected])
   2:       6318061      692352160  [B ([email protected])
   3:       2070386      560754472  [J ([email protected])
   4:      14831152      474596864  java.util.concurrent.CompletableFuture ([email protected])
   5:       2057200      298501136  [I ([email protected])
   6:       4197419      268634816  java.util.concurrent.CompletableFuture$UniWhenComplete ([email protected])
   7:       2076515      215957560  com.baidu.bifromq.type.Message
   8:       3136953      175669368  java.util.concurrent.CompletableFuture$UniRelay ([email protected])
   9:       2048731      163898352  [S ([email protected])
  10:       2120880      135736320  java.util.concurrent.CompletableFuture$UniApply ([email protected])
  11:       2076511      132896704  java.util.concurrent.CompletableFuture$UniExceptionally ([email protected])
  12:       4110609      131539488  java.lang.String ([email protected])
  13:       1078152      120753024  io.netty.buffer.PooledUnsafeDirectByteBuf
  14:       2076517       99672816  com.baidu.bifromq.basescheduler.CallTask
  15:       2076510       83060400  com.baidu.bifromq.dist.client.scheduler.DistServerCall
  16:       2161485       69167520  java.util.concurrent.ConcurrentLinkedQueue$Node ([email protected])
  17:       1061848       67958272  io.netty.buffer.PooledSlicedByteBuf
  18:       1060441       67868224  java.util.concurrent.CompletableFuture$UniAccept ([email protected])
  19:       2076518       66448576  com.baidu.bifromq.dist.client.scheduler.BatcherKey
  20:       1060435       59384360  com.baidu.bifromq.mqtt.handler.MQTTSessionHandler$$Lambda$1774/0x00007fdddea97150
  21:       1060405       59382680  com.baidu.bifromq.mqtt.handler.MQTTSessionHandler$$Lambda$1777/0x00007fdddea97808
  22:       1038502       58156112  java.util.LinkedHashMap$Entry ([email protected])
  23:       1016076       56900256  java.util.concurrent.CancellationException ([email protected])
  24:        279987       51446320  [Ljava.util.HashMap$Node; ([email protected])
  25:       1060436       50900928  com.baidu.bifromq.plugin.authprovider.type.CheckResult
  26:       1060435       50900880  io.netty.handler.codec.mqtt.MqttPublishMessage
  27:       2088592       50126208  com.google.protobuf.ByteString$LiteralByteString
  28:       2076517       49836408  com.baidu.bifromq.basescheduler.BatchCallScheduler$$Lambda$1429/0x00007fddde92f5f0
  29:       1127869       45114760  java.util.HashMap$Node ([email protected])
  30:       1060439       42417560  io.netty.handler.codec.mqtt.MqttFixedHeader
  31:        212802       37065952  [Ljava.util.concurrent.ConcurrentHashMap$Node; ([email protected])
  32:        410806       36150928  io.netty.channel.DefaultChannelHandlerContext
  33:       1060437       33933984  com.baidu.bifromq.mqtt.handler.MQTTSessionHandler$$Lambda$1693/0x00007fdddea5e250
  34:       1060435       33933920  io.netty.handler.codec.mqtt.MqttPublishVariableHeader
  35:        268707       25795872  java.util.concurrent.ConcurrentHashMap ([email protected])
  36:        321774       25741920  java.util.LinkedHashMap ([email protected])
  37:       1060435       25450440  com.baidu.bifromq.mqtt.handler.MQTTSessionHandler$$Lambda$1778/0x00007fdddea97c78
  38:        776047       24833504  io.netty.util.Recycler$DefaultHandle
  39:       1016077       24385848  java.util.concurrent.CompletableFuture$AltResult ([email protected])
  40:        526340       21053600  java.util.concurrent.ConcurrentHashMap$Node ([email protected])
  41:         45411       13804944  com.baidu.bifromq.mqtt.handler.v3.MQTT3TransientSessionHandler
  42:        137358       10988640  java.util.TreeMap ([email protected])
  43:         45864       10273536  io.netty.channel.epoll.EpollSocketChannel
  44:        393943        9454632  java.util.concurrent.atomic.AtomicLong ([email protected])
  45:        194612        9341376  com.google.protobuf.MapField
  46:         45609        8756928  io.netty.handler.traffic.TrafficCounter
  47:         91987        8094856  io.netty.util.concurrent.ScheduledFutureTask
  48:        139228        7796768  com.baidu.bifromq.type.ClientInfo
  49:        182432        7297280  io.netty.util.DefaultAttributeMap$DefaultAttribute
  50:        220074        7042368  java.util.concurrent.ConcurrentHashMap$KeySetView ([email protected])
  51:        138238        6635424  com.baidu.bifromq.inbox.storage.proto.TopicFilterOption
  52:        194612        6227584  com.google.protobuf.MapField$MutabilityAwareMap
  53:         96349        6166336  java.util.HashMap ([email protected])
  54:         44444        6044384  com.baidu.bifromq.inbox.storage.proto.InboxMetadata
  55:          7600        5168000  io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue
  56:         45605        4742920  com.baidu.bifromq.mqtt.handler.TenantSettings
  57:        195219        4685256  java.util.concurrent.atomic.AtomicReference ([email protected])
  58:        194612        4670688  com.google.protobuf.MapField$ImmutableMessageConverter
  59:         45871        4403616  io.netty.channel.DefaultChannelPipeline$HeadContext

正常节点的堆内存直方图

 num     #instances         #bytes  class name (module)
-------------------------------------------------------
   1:        311863      113943464  [Ljava.lang.Object; ([email protected])
   2:       2077519      107863216  [B ([email protected])
   3:       1940738       62103616  java.lang.String ([email protected])
   4:        906314       50753584  java.util.LinkedHashMap$Entry ([email protected])
   5:        216048       37429824  [Ljava.util.concurrent.ConcurrentHashMap$Node; ([email protected])
   6:        420222       36979536  io.netty.channel.DefaultChannelHandlerContext
   7:        263687       34769200  [Ljava.util.HashMap$Node; ([email protected])
   8:        273474       26253504  java.util.concurrent.ConcurrentHashMap ([email protected])
   9:        306730       24538400  java.util.LinkedHashMap ([email protected])
  10:        525528       21021120  java.util.concurrent.ConcurrentHashMap$Node ([email protected])
  11:         46471       14127184  com.baidu.bifromq.mqtt.handler.v3.MQTT3TransientSessionHandler
  12:        140444       11235520  java.util.TreeMap ([email protected])
  13:         46915       10508960  io.netty.channel.epoll.EpollSocketChannel
  14:        421213       10109112  java.util.concurrent.atomic.AtomicLong ([email protected])
  15:         46653        8957376  io.netty.handler.traffic.TrafficCounter
  16:        176317        8463216  com.google.protobuf.MapField
  17:         94029        8274552  io.netty.util.concurrent.ScheduledFutureTask
  18:        186600        7464000  io.netty.util.DefaultAttributeMap$DefaultAttribute
  19:        223556        7153792  java.util.concurrent.ConcurrentHashMap$KeySetView ([email protected])
  20:        141252        6780096  com.baidu.bifromq.inbox.storage.proto.TopicFilterOption
  21:        120211        6731816  com.baidu.bifromq.type.ClientInfo
  22:         98398        6297472  java.util.HashMap ([email protected])
  23:         45324        6164064  com.baidu.bifromq.inbox.storage.proto.InboxMetadata
  24:        176317        5642144  com.google.protobuf.MapField$MutabilityAwareMap
  25:        230407        5529768  java.util.concurrent.atomic.AtomicReference ([email protected])
  26:        164961        5278752  java.util.concurrent.CompletableFuture ([email protected])
  27:          7440        5059200  io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue
  28:         46640        4850560  com.baidu.bifromq.mqtt.handler.TenantSettings
  29:         46922        4504512  io.netty.channel.DefaultChannelPipeline$HeadContext
  30:         46653        4478688  io.netty.handler.codec.mqtt.MqttDecoder
  31:         46653        4478688  io.netty.handler.traffic.ChannelTrafficShapingHandler
  32:        176317        4231608  com.google.protobuf.MapField$ImmutableMessageConverter
  33:         46922        4129136  io.netty.channel.DefaultChannelPipeline$TailContext
  34:         46891        4126408  io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl
  35:         46640        4104320  com.baidu.bifromq.mqtt.handler.v3.MQTT3ConnectHandler
  36:        118016        3776512  com.baidu.bifromq.mqtt.service.LocalDistService$TopicFilter
  37:        118014        3776448  com.baidu.bifromq.mqtt.service.LocalDistService$LocalRoutes
  38:         93994        3759760  java.net.InetAddress$InetAddressHolder ([email protected])
  39:         46922        3753760  io.netty.channel.DefaultChannelPipeline
  40:         46915        3753200  io.netty.channel.epoll.EpollSocketChannel$EpollSocketChannelUnsafe
  41:         46915        3753200  io.netty.channel.epoll.EpollSocketChannelConfig
  42:         93280        3731200  com.baidu.bifromq.mqtt.session.MQTTSessionAuthProvider
  43:        152437        3658488  java.util.LinkedHashMap$LinkedEntrySet ([email protected])
  44:         62679        3510024  java.util.TreeMap$Entry ([email protected])
  45:          8682        3504408  [I ([email protected])
  46:         18591        3456320  java.lang.Class ([email protected])
  47:         93293        3358496  [Lcom.baidu.bifromq.mqtt.handler.condition.Condition;
  48:         46640        3358080  com.baidu.bifromq.mqtt.handler.ConditionalSlowDownHandler
  2774:             2             96  io.netty.handler.codec.mqtt.MqttPublishMessage

我们怀疑是 DistServerCallScheduler 中,在 batcher 里grpc 超时阻塞,MqttPublishMessage全都添加到 Batcher.callTaskBuffers 中,这是个 ConcurrentLinkedQueue,是无界的。

Environment

  • Version: [3.2.1]
  • JVM Version: [OpenJDK17,启动参数 -Xms12g -Xmx12g -XX:MetaspaceSize=512m -XX:MaxMetaspaceSize=512m -XX:MaxDirectMemorySize=12g]
  • Hardware Spec: [15c32g, 4个节点]
  • OS: [腾讯云OS]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions