On Mon, Jul 08, 2013 at 03:39:56PM +0200, Simon Chopin wrote:
Hi,
As some of you might know, I am the student working on adapting fedmsg
for Debian as part of Google Summer of Code program.
One of the requirements for fedmsg to be part of Debian infrastructure
is to be resilient in case a network link drops, as we have services
dispatched all over the world. Currently, if a client drops out, it has
no way of catching up on what happened when it was offline.
To solve this, I was thinking of the following: all the endpoints that
must be able to replay some messages should provide two URLs, say
tcp://foo.bar:3000 and tcp+pair://foo.bar:3001, the later listening in
for PAIR-type[1] connexions. The clients on the simple URL are like the
current clients, but the PAIR socket allow the other clients to request
the missing messages.
The query would come on the $prefix.replay.$topic topic (say,
org.fedoraproject.dev.replay.buildsys.build.state.change), and specify
the IDs to resend, or a time interval (for manual queries), and the
answer(s) would come on the same topic.
To be able to detect a missing message, the "i" field would have to be
topic-bound instead of being at the endpoint level.
Thoughts?
Cheers,
Simon
[1]
https://learning-0mq-with-pyzmq.readthedocs.org/en/latest/pyzmq/patterns/...
Hi Simon, thanks for taking this up.
I like the idea of the special replay topic. That makes for a pretty
clean API for requesting replay of messages. FWIW, a patch was just
introduced in git that adds a "uuid" field to every message in
addition to "i". That could be used to request specific messages.
One problem I see is in the implementation details. How long is an
endpoint expected to hold on to its old messages before discarding
them? Whereas currently, an application that gets a fedmsg hook added
doesn't retain much extra state as a result, this replay-request
proposal would require a book keeping mechanism added to every
endpoint (in our case, every mod_wsgi/httpd process, others).
Have you considered using the datagrepper API for a replay mechanism?
https://apps.fedoraproject.org/datagrepper ? Although we haven't
implemented it in practice yet, I have been anticipating using that
more in the future. I.e., if a consumer crashes and comes back
online, it could request a list of every message during that timespan
from the central store.
Cheers-
-Ralph