-
Notifications
You must be signed in to change notification settings - Fork 75
Open
Description
节点在 30 分钟内内存泄露直到宕机,主要是 PubMessage对象
我们生产环境 4个节点(16c 32g)组成的集群,其中一个节点在 30分钟内内存一直上升,gc 不下来,最终不可用,我们也是第一次出现。当时集群总连接数大概12W,单个节点 3W,每个连接大概 5S或10S发一次消息,通过共享连接转出到kafka。
gc log:
[2025-05-20T09:40:43.490+0800][30530][gc] GC(168430) Garbage Collection (Allocation Rate) 6626M(54%)->1966M(16%)
[2025-05-20T09:40:58.462+0800][30530][gc] GC(168431) Garbage Collection (Allocation Rate) 7112M(58%)->2040M(17%)
.....
[2025-05-20T09:44:37.140+0800][30530][gc] GC(168446) Garbage Collection (Allocation Rate) 5252M(43%)->2862M(23%)
[2025-05-20T09:44:45.639+0800][30530][gc] GC(168447) Garbage Collection (Allocation Rate) 5222M(42%)->2422M(20%)
[2025-05-20T09:44:55.678+0800][30530][gc] GC(168448) Garbage Collection (Allocation Rate) 4706M(38%)->3060M(25%)
.....
[2025-05-20T09:48:01.016+0800][30530][gc] GC(168460) Garbage Collection (Allocation Rate) 7482M(61%)->3376M(27%)
[2025-05-20T09:48:15.690+0800][30530][gc] GC(168461) Garbage Collection (Allocation Rate) 7518M(61%)->3448M(28%)
[2025-05-20T09:48:30.689+0800][30530][gc] GC(168462) Garbage Collection (Allocation Rate) 7720M(63%)->3502M(28%)
.....
[2025-05-20T09:52:06.422+0800][30530][gc] GC(168483) Garbage Collection (Allocation Rate) 6834M(56%)->4232M(34%)
[2025-05-20T09:52:16.279+0800][30530][gc] GC(168484) Garbage Collection (Allocation Rate) 6762M(55%)->4284M(35%)
[2025-05-20T09:52:26.332+0800][30530][gc] GC(168485) Garbage Collection (Allocation Rate) 6868M(56%)->4376M(36%)
.....
[2025-05-20T10:03:21.311+0800][30530][gc] GC(168626) Garbage Collection (Allocation Rate) 7308M(59%)->7280M(59%)
[2025-05-20T10:03:25.278+0800][30530][gc] GC(168627) Garbage Collection (Allocation Rate) 7302M(59%)->7328M(60%)
.....
[2025-05-20T10:10:03.710+0800][30530][gc] GC(168717) Garbage Collection (Allocation Rate) 9672M(79%)->9866M(80%)
[2025-05-20T10:10:09.355+0800][30843][gc] Allocation Stall (crdt-service-scheduler) 176.979ms
[2025-05-20T10:10:09.355+0800][30755][gc] Allocation Stall (io-rpc-worker-elg-31) 107.565ms
[2025-05-20T10:10:09.355+0800][32197][gc] Allocation Stall (basekv-range-mutator) 560.162ms
坏节点的 heap 直方图
num #instances #bytes class name (module)
-------------------------------------------------------
1: 4387991 789,541360 [Ljava.lang.Object; ([email protected])
2: 6318061 692352160 [B ([email protected])
3: 2070386 560754472 [J ([email protected])
4: 14831152 474596864 java.util.concurrent.CompletableFuture ([email protected])
5: 2057200 298501136 [I ([email protected])
6: 4197419 268634816 java.util.concurrent.CompletableFuture$UniWhenComplete ([email protected])
7: 2076515 215957560 com.baidu.bifromq.type.Message
8: 3136953 175669368 java.util.concurrent.CompletableFuture$UniRelay ([email protected])
9: 2048731 163898352 [S ([email protected])
10: 2120880 135736320 java.util.concurrent.CompletableFuture$UniApply ([email protected])
11: 2076511 132896704 java.util.concurrent.CompletableFuture$UniExceptionally ([email protected])
12: 4110609 131539488 java.lang.String ([email protected])
13: 1078152 120753024 io.netty.buffer.PooledUnsafeDirectByteBuf
14: 2076517 99672816 com.baidu.bifromq.basescheduler.CallTask
15: 2076510 83060400 com.baidu.bifromq.dist.client.scheduler.DistServerCall
16: 2161485 69167520 java.util.concurrent.ConcurrentLinkedQueue$Node ([email protected])
17: 1061848 67958272 io.netty.buffer.PooledSlicedByteBuf
18: 1060441 67868224 java.util.concurrent.CompletableFuture$UniAccept ([email protected])
19: 2076518 66448576 com.baidu.bifromq.dist.client.scheduler.BatcherKey
20: 1060435 59384360 com.baidu.bifromq.mqtt.handler.MQTTSessionHandler$$Lambda$1774/0x00007fdddea97150
21: 1060405 59382680 com.baidu.bifromq.mqtt.handler.MQTTSessionHandler$$Lambda$1777/0x00007fdddea97808
22: 1038502 58156112 java.util.LinkedHashMap$Entry ([email protected])
23: 1016076 56900256 java.util.concurrent.CancellationException ([email protected])
24: 279987 51446320 [Ljava.util.HashMap$Node; ([email protected])
25: 1060436 50900928 com.baidu.bifromq.plugin.authprovider.type.CheckResult
26: 1060435 50900880 io.netty.handler.codec.mqtt.MqttPublishMessage
27: 2088592 50126208 com.google.protobuf.ByteString$LiteralByteString
28: 2076517 49836408 com.baidu.bifromq.basescheduler.BatchCallScheduler$$Lambda$1429/0x00007fddde92f5f0
29: 1127869 45114760 java.util.HashMap$Node ([email protected])
30: 1060439 42417560 io.netty.handler.codec.mqtt.MqttFixedHeader
31: 212802 37065952 [Ljava.util.concurrent.ConcurrentHashMap$Node; ([email protected])
32: 410806 36150928 io.netty.channel.DefaultChannelHandlerContext
33: 1060437 33933984 com.baidu.bifromq.mqtt.handler.MQTTSessionHandler$$Lambda$1693/0x00007fdddea5e250
34: 1060435 33933920 io.netty.handler.codec.mqtt.MqttPublishVariableHeader
35: 268707 25795872 java.util.concurrent.ConcurrentHashMap ([email protected])
36: 321774 25741920 java.util.LinkedHashMap ([email protected])
37: 1060435 25450440 com.baidu.bifromq.mqtt.handler.MQTTSessionHandler$$Lambda$1778/0x00007fdddea97c78
38: 776047 24833504 io.netty.util.Recycler$DefaultHandle
39: 1016077 24385848 java.util.concurrent.CompletableFuture$AltResult ([email protected])
40: 526340 21053600 java.util.concurrent.ConcurrentHashMap$Node ([email protected])
41: 45411 13804944 com.baidu.bifromq.mqtt.handler.v3.MQTT3TransientSessionHandler
42: 137358 10988640 java.util.TreeMap ([email protected])
43: 45864 10273536 io.netty.channel.epoll.EpollSocketChannel
44: 393943 9454632 java.util.concurrent.atomic.AtomicLong ([email protected])
45: 194612 9341376 com.google.protobuf.MapField
46: 45609 8756928 io.netty.handler.traffic.TrafficCounter
47: 91987 8094856 io.netty.util.concurrent.ScheduledFutureTask
48: 139228 7796768 com.baidu.bifromq.type.ClientInfo
49: 182432 7297280 io.netty.util.DefaultAttributeMap$DefaultAttribute
50: 220074 7042368 java.util.concurrent.ConcurrentHashMap$KeySetView ([email protected])
51: 138238 6635424 com.baidu.bifromq.inbox.storage.proto.TopicFilterOption
52: 194612 6227584 com.google.protobuf.MapField$MutabilityAwareMap
53: 96349 6166336 java.util.HashMap ([email protected])
54: 44444 6044384 com.baidu.bifromq.inbox.storage.proto.InboxMetadata
55: 7600 5168000 io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue
56: 45605 4742920 com.baidu.bifromq.mqtt.handler.TenantSettings
57: 195219 4685256 java.util.concurrent.atomic.AtomicReference ([email protected])
58: 194612 4670688 com.google.protobuf.MapField$ImmutableMessageConverter
59: 45871 4403616 io.netty.channel.DefaultChannelPipeline$HeadContext
正常节点的堆内存直方图
num #instances #bytes class name (module)
-------------------------------------------------------
1: 311863 113943464 [Ljava.lang.Object; ([email protected])
2: 2077519 107863216 [B ([email protected])
3: 1940738 62103616 java.lang.String ([email protected])
4: 906314 50753584 java.util.LinkedHashMap$Entry ([email protected])
5: 216048 37429824 [Ljava.util.concurrent.ConcurrentHashMap$Node; ([email protected])
6: 420222 36979536 io.netty.channel.DefaultChannelHandlerContext
7: 263687 34769200 [Ljava.util.HashMap$Node; ([email protected])
8: 273474 26253504 java.util.concurrent.ConcurrentHashMap ([email protected])
9: 306730 24538400 java.util.LinkedHashMap ([email protected])
10: 525528 21021120 java.util.concurrent.ConcurrentHashMap$Node ([email protected])
11: 46471 14127184 com.baidu.bifromq.mqtt.handler.v3.MQTT3TransientSessionHandler
12: 140444 11235520 java.util.TreeMap ([email protected])
13: 46915 10508960 io.netty.channel.epoll.EpollSocketChannel
14: 421213 10109112 java.util.concurrent.atomic.AtomicLong ([email protected])
15: 46653 8957376 io.netty.handler.traffic.TrafficCounter
16: 176317 8463216 com.google.protobuf.MapField
17: 94029 8274552 io.netty.util.concurrent.ScheduledFutureTask
18: 186600 7464000 io.netty.util.DefaultAttributeMap$DefaultAttribute
19: 223556 7153792 java.util.concurrent.ConcurrentHashMap$KeySetView ([email protected])
20: 141252 6780096 com.baidu.bifromq.inbox.storage.proto.TopicFilterOption
21: 120211 6731816 com.baidu.bifromq.type.ClientInfo
22: 98398 6297472 java.util.HashMap ([email protected])
23: 45324 6164064 com.baidu.bifromq.inbox.storage.proto.InboxMetadata
24: 176317 5642144 com.google.protobuf.MapField$MutabilityAwareMap
25: 230407 5529768 java.util.concurrent.atomic.AtomicReference ([email protected])
26: 164961 5278752 java.util.concurrent.CompletableFuture ([email protected])
27: 7440 5059200 io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue
28: 46640 4850560 com.baidu.bifromq.mqtt.handler.TenantSettings
29: 46922 4504512 io.netty.channel.DefaultChannelPipeline$HeadContext
30: 46653 4478688 io.netty.handler.codec.mqtt.MqttDecoder
31: 46653 4478688 io.netty.handler.traffic.ChannelTrafficShapingHandler
32: 176317 4231608 com.google.protobuf.MapField$ImmutableMessageConverter
33: 46922 4129136 io.netty.channel.DefaultChannelPipeline$TailContext
34: 46891 4126408 io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl
35: 46640 4104320 com.baidu.bifromq.mqtt.handler.v3.MQTT3ConnectHandler
36: 118016 3776512 com.baidu.bifromq.mqtt.service.LocalDistService$TopicFilter
37: 118014 3776448 com.baidu.bifromq.mqtt.service.LocalDistService$LocalRoutes
38: 93994 3759760 java.net.InetAddress$InetAddressHolder ([email protected])
39: 46922 3753760 io.netty.channel.DefaultChannelPipeline
40: 46915 3753200 io.netty.channel.epoll.EpollSocketChannel$EpollSocketChannelUnsafe
41: 46915 3753200 io.netty.channel.epoll.EpollSocketChannelConfig
42: 93280 3731200 com.baidu.bifromq.mqtt.session.MQTTSessionAuthProvider
43: 152437 3658488 java.util.LinkedHashMap$LinkedEntrySet ([email protected])
44: 62679 3510024 java.util.TreeMap$Entry ([email protected])
45: 8682 3504408 [I ([email protected])
46: 18591 3456320 java.lang.Class ([email protected])
47: 93293 3358496 [Lcom.baidu.bifromq.mqtt.handler.condition.Condition;
48: 46640 3358080 com.baidu.bifromq.mqtt.handler.ConditionalSlowDownHandler
2774: 2 96 io.netty.handler.codec.mqtt.MqttPublishMessage
我们怀疑是 DistServerCallScheduler 中,在 batcher 里grpc 超时阻塞,MqttPublishMessage全都添加到 Batcher.callTaskBuffers 中,这是个 ConcurrentLinkedQueue,是无界的。
Environment
- Version: [3.2.1]
- JVM Version: [OpenJDK17,启动参数 -Xms12g -Xmx12g -XX:MetaspaceSize=512m -XX:MaxMetaspaceSize=512m -XX:MaxDirectMemorySize=12g]
- Hardware Spec: [15c32g, 4个节点]
- OS: [腾讯云OS]
Metadata
Metadata
Assignees
Labels
No labels