Writing a message-passing library and some of the things I learned. In multiple installments.
Four years ago I decided we need something like libchirp. I wanted a core, that is safe, light-weight, high-performance and as portable as possible. I had this polyglot cloud-software-toolkit in mind that bridges all language barriers in software. I was inspired by what I learned about Erlang. I thought by now we will have many daemons, bindings and upper-layer protocols, but in fact, I have an awesome C99-implementation and good bindings for python. I have created the foundation.
I wanted to have two conflicting properties on more than one occasion. For example:
For performance, memory-safety, safety against spamming a peer I designed libchirp using a fixed-size message-queue. It allocates the message-queue with a fixed-size when a connection opens and then it doesn't have to allocate anything during operation, this means existing connections stay operational in low-memory situations. Generally, libchirp handles failing malloc() gracefully.
Now if all peers are synchronous you can end-up with a lock-up situation:
I had about 5 plans to remedy the situation, but everything I tried only moved the problem farther away. I do randomized testing using hypothesis, and it was always able to find new lock-up situations. Yes, hypothesis rules! Other plans would violate my fixed-memory property and open the door for DoS-attacks somehow. You just can't have both.
Since I value the performance, memory-safety and DoS-prevention properties of libchirp, I pondered if it is indispensable for all peers to be synchronous, Synchronous in this context means you can't lose a message without an error (exception in python).
It turns out that most of the time an asymmetric approach is absolutely sufficient and if it isn't you can always use timeout-based bookkeeping of messages/requests/responses.
Rule of thumb:
If the producer requests an acknowledge (it is synchronous), then the consumer signals that it has finished the job after sending the response. So the response is put on the wire before the acknowledge, therefore by the time the consumer reads the acknowledge and there is no response it is clearly an error.
If there is no response needed an acknowledge means that consumer has done its work, for example, committed the data to a database. So for the producer, no error means the transaction was successful.
|||You might ask, why can't we release the slot of the message that triggered the acknowledge? 1. The information where to send the acknowledge to is stored in that slot. 2. The user will get a callback when the acknowledge has been sent, to identitify the callback he needs an identity which is stored in the slot. Yeah, the memory-safety property really makes things complicated, but its so worth it, believe me. Chirp is more or less as fast only calling the needed syscalls, almost no overhead. Also it keeps on sending messages when you are out-of-memory. Did I mention that we wanted to use chirp for monitoring. If your server is out-of-memory, it will be able to tell you about it. By default libchirp will allocate more memory if the message is larger than the allocated slot, but you can disable this.|