-
Notifications
You must be signed in to change notification settings - Fork 767
Description
vSomeip Version
v3.4.10
Boost Version
1.82
Environment
Android and QNX
Describe the bug
My automotive system has *.fidl with ~3500 attributes, one per CAN signal. My *.fdepl maps each attribute into a unique EventGroup.
Especially when resuming from suspend-to-ram it's possible that UDP SOMEIP-SD will be operational but TCP socket will be broken. This leads to tce restart() but during this time any Subscribe will receive SubscribeNack in response:
4191 105.781314 10.6.0.10 10.6.0.3 SOME/IP-SD 1408 SOME/IP Service Discovery Protocol [Subscribe]
4192 105.790868 10.6.0.3 10.6.0.10 SOME/IP-SD 1396 SOME/IP Service Discovery Protocol [SubscribeNack]
4193 105.792094 10.6.0.10 10.6.0.3 SOME/IP-SD 1410 SOME/IP Service Discovery Protocol [Subscribe]
4194 105.801525 10.6.0.10 10.6.0.3 SOME/IP-SD 1410 SOME/IP Service Discovery Protocol [Subscribe]
4195 105.802118 10.6.0.3 10.6.0.10 SOME/IP-SD 1398 SOME/IP Service Discovery Protocol [SubscribeNack]
4196 105.819610 10.6.0.3 10.6.0.10 SOME/IP-SD 1398 SOME/IP Service Discovery Protocol [SubscribeNack]
as the number of EventGroup scales to a large number, this become catastrophic to performance.
In service_discovery_impl::handle_eventgroup_subscription_nack() each EventGroup calls restart():
vsomeip/implementation/service_discovery/src/service_discovery_impl.cpp
Lines 2517 to 2521 in cf49723
| if (!its_subscription->is_selective()) { | |
| auto its_reliable = its_subscription->get_endpoint(true); | |
| if (its_reliable) | |
| its_reliable->restart(); | |
| } |
and in tcp_client_endpoint_impl::restart() while ::CONNECTING the code will "early terminate" from maximum 5 restarts:
vsomeip/implementation/endpoints/src/tcp_client_endpoint_impl.cpp
Lines 77 to 85 in cf49723
| if (!_force && self->state_ == cei_state_e::CONNECTING) { | |
| std::chrono::steady_clock::time_point its_current | |
| = std::chrono::steady_clock::now(); | |
| std::int64_t its_connect_duration = std::chrono::duration_cast<std::chrono::milliseconds>( | |
| its_current - self->connect_timepoint_).count(); | |
| if (self->aborted_restart_count_ < self->tcp_restart_aborts_max_ | |
| && its_connect_duration < self->tcp_connect_time_max_) { | |
| self->aborted_restart_count_++; | |
| return; |
thereafter the code will fall through, calling shutdown_and_close_socket_unlocked() and perform the full restart even while a connection is in progress.
As the system continues processing 1000s of SubscribeNack this will be a tight loop of 100% cpu load and multiple seconds to plow-through the workload. This can easily exceed a 2s ServiceDiscovery interval and cascade to further problems.
Reproduction Steps
My reproduction was:
- start with fully-established communication between tse and tce
- tce enters suspend-to-ram with TCP socket established
- allow tse to continue running, exceed TCP keepalive timeout, and close the TCP socket
- tce resumes from suspend-to-ram thinking TCP socket is still established, then discovers it to be closed
but any use-case where tse closes the TCP socket but UDP is functional should be sufficient.
Expected behaviour
Performance should be better.
Logs and Screenshots
No response