> Anyway, to summarize, I really really want this to be super easy to use
and just work. I hope we can improve it further and I'd love to hear
your thoughts. Do you think my problem statements and design goals are
reasonable? Given those, do you still feel like sending the schema along
is worthwhile?
I actually no longer think it is worthwhile.
> As a consumer, I can validate the JSON in a message matches the JSON schema
in the same message, but what does that get me? It doesn't seem any
different (on the consumer side) than just parsing the JSON outright and
trying to access whatever deserialized object I get.
I completely agree with this.
Let's go through the problems you mentioned:
1. Make catching accidental schema changes as a publisher easy.
So we can just solve this by registering the scheme with the publisher
first before any content gets published and based on the scheme, the publisher
instance may check if the content intended to be sent conforms to
the scheme, which could catch some bugs before the content
is actually sent. If we require this to be done on publisher side, then
there is actually no reason to send the schema alongside the content
because the check has already been done so consumer already knows
the message is alright when it is received. What should be sent, however,
is a scheme ID, e.g. just a natural number. The scheme ID may be then
used to version the scheme, which would be available somewhere publicly
e.g. in the service docs the same way Github/Gitlab/etc publishes structures
of their webhook messages. It would be basically part of public API of
a service.
2. Make catching mis-behaving publishers on the consuming side easy.
By checking against the scheme on the publisher side, this
shouldn't be necessary. If someone somehow bypasses the
publisher check, at worst the message won't be parsable,
depending on how the message is being parsed. If someone
wants to really make sure the message is what it is supposed
to be, he/she can integrate the schema published on the service
site into the parsing logic but I don't think that's necessary
thing to do (I personally wouldn't do it in my code).
3. Make changing the schema a painless process for publishers and
consumers.
I think, the only way to do this is to send both content types simultaneously
for some time, each message being marked with its scheme ID. It would be
good if consumer always specified what scheme ID it wants to consume.
If there is a higher scheme ID available in the message, a warning could be printed
maybe even to syslog even so that consumers get the information. At the same time it should
be communicated on the service site or by other means available. I don't think it is possible
to make it any better than this.
I fail to see what's the point of packaging the schemas.
If the message content is in json, then after receiving the message,
I would like to be able to just call json.loads(msg) and work with the resulting structure
as I am used to.
Actually, what I would do in python is that I would make it a munch and then work
with it. Needing to install some additional package and instantiate some high-level
objects just seems clumsy to me in comparison.
In other programming languages, this procedure would be pretty much the same,
I believe as they all probably provide some json implementation.
You mentioned:
> In the current proposal, consumers don't interact with the JSON at all,
but with a higher-level Python API that gives publishers flexibility
when altering their on-the-wire format.
Yes, but with the current proposal if I change the on-the-wire API, I need
to make a new version of the schema, package it and somehow get it to
consumers and make them use the correct version that correctly parses
the new on-the-wire format and translates it correctly to what the consumers
are used to consume? That's seems like something very difficult to get
done.
And also I don't quite see the point. I wouldn't alter the on-the-wire
format if it is not actually what users work with and if I needed to go
through all those steps described above.
If I need to alter the on-the-wire format because application logic
has been somehow changed, then I would like to make the changes
in the high-level API as well so again there is no gain there except
more work with packaging new schemas.
My main point here is that trying to package the schemas to provide
some high-level objects seems to be redundant. I think lots of people would
just welcome to work something really simple, which is already provided in
the language standard library.
For python, If I had to install and import just a single messaging library,
say to what hub, topic, and scheme ID I want to listen and then consume
the incoming messages immediately as munches, I would be super happy.
Actually, it might be the case the scheme ID is redundant as well and
it can be just made part of the topic somehow, in which case the producer
would probably just produce the content twice on a scheme change at
least for some time. "Deprecated by <topic>" flag on an incoming message
would be nice then. Of course, the producer would need to register the two
schemas and mark one of them as deprecated. The framework would then
send two messages simultaneously for him. This might be even easier
solution to the problem. The exact publisher (producer) interface would
need to be thought through.
> The big problem is that right now the majority of messages are not
formatted in a way that makes sense and really need to be changed to be
simple, flat structures that contain the information services need and
nothing they don't. I'd like to get those fixed in a way that doesn't
require massive coordinated changes in apps.
In Copr, for example, we take this as an opportunity to change our
format. If the messaging framework will support format deprecation,
we might go that way as well to avoid sudden change. But we don't
currently have many (or maybe any) consumers so I am not sure it is
necessary for us.
I am not familiar with protocol buffers but to me that thing
seems rather useful, if you want to send the content in a compact
binary form to save as much space as possible. If we will send content,
which can be interpreted as json already, then to make some
higher-level classes and objects on that seems already unnecessary.
I think we could really just take that already existing generic framework
you were talking about (RabbitMQ?) and just make sure we can
check the content against message schemas on producer side (which is
great for catching little bugs) and that we know how a message format can
get deprecated (e.g. by adding "deprecated_by: <topic>" field into each message
by the messaging framework, which should somehow log warnings on
consumer side), also the framework could automatically
transform the messages into some language-native structures:
in python, the munches would probably be the most sexy ones.
The whole "let's package schemas" thing seems like something
we would typically do (because we are packagers) but not as something
that would solve the actual problems you have mentioned. Rather it
makes them more difficult to deal with if I am correct.
I think what you are doing is good but I think most people will
welcome less dependencies and simpler language-native structures.
So if we could make the framework go more into that direction,
that would be great.