-
Notifications
You must be signed in to change notification settings - Fork 57
Description
It happens that triggers are not delivered for some Astarte realms.
This happens at least for Astarte v1.2.
Upon inspection on the RabbitMQ UI, no AMQP consumer is present for the queues corresponding to the aforementioned realms.
Further inspection was conducted by interacting with the Trigger Engine instance on a remote iEX shell.
We found that a AMQPMessageConsumer GenServer was correctly running for each existing realm.
Inspecting each process with Process.info(pid), :sys.get_status(pid) and :sys.get_state(pid) we saw the state of each process. The relevant processes had a nil channel state, indeed indicating that some error occurred during the connection phase.
Unfortunately the GenServer code doesn't log any error when failing the connection to AMQP. However, we noticed that each GenServer calls ExRabbitPool.checkout_channel to acquire one of the connection channels, but never call ExRabbitPool.checkin_channel to free the resource. This might be ok by design to handle ACKs, but it's an issue if channels are in a limited number.
We confirmed the suspicion by looking at how Trigger Engine configures ExRabbitPool. By default, it spans 10 connections, and 10 channels per connection.
This allows a total of 100 channels, and indeed there are only 100 AMQPMessageConsumer GenServers that have a channel checked out in their state, and a corresponding AMQP consumer in RabbitMQ.
Since we were working with an Astarte instance deployed with the Astarte Kubernetes Operator, we edited our Astarte resource with kubectl edit astarte -n astarte and added this environment variable for Trigger Engine:
apiVersion: api.astarte-platform.org/v1alpha3 kind: Astarte
spec:
components:
triggerEngine:
additionalEnv:
- name: ASTARTE_TRIGGER_ENGINE_EVENTS_CONSUMER_CONNECTION_NUMBER
value: "50"This way, the new Trigger Engine instance had 10 channels * 50 connections, for a total of 500 channels available to the AMQPMessageConsumer processes.
Since we had 145 existing realms, we confirmed that all realms correctly set up an AMQP consumer after the change and trigger delivery was working.
This approach may work as a quick solution but:
- it requires knowing in advance how many realms there are
- it requires configuring some variables that doesn't seem related to the realm count at first glance, so it's easy to get trapped into this error inadvertently
- the configuration is static, so it requires continuous manual maintenance if the realm count grows
A dynamic approach could be discussed, which can keep a fixed number of connections but distinguish between a small number of consumer processes that message to per-realm processor processes. See also https://github.com/secomind/mississippi