On 06/19/2013 09:22 PM, Jakub Hrozek wrote:
On Tue, Jun 18, 2013 at 01:51:15PM +0200, Pavel Březina wrote:
>
https://fedorahosted.org/sssd/ticket/1947
>
> A fast explanation how _srv_ expansion works. _srv_ is inserted into server
> list as so called meta server. Let us consider following configuration:
>
> *Setup*
> ipa_server = _srv_, ipa.pb
> server list will contain: meta -> ipa.pb
>
> *Expansion*
> meta -> ipa.pb:389 -> ipa.pb ; remove meta
> ipa.pb:389 -> ipa.pb ; meta
>
> *Collapse*
> remove ipa.pb:389 ; insert meta
> meta -> ipa.pb
>
> The main problem is that expanded SRV servers are marked as NEUTRAL during
> online check, but they don't collapse back into a meta server.
>
> This will trigger another SRV expansion, leaving the old server in the list
> and trying to add the servers again. This is present in both master and 1.9
> (and probably older versions), although the result is slightly different.
>
> In master, we don't insert a server into server list if it is already
> present. Because state->meta is orphaned from the previous SRV expansion,
> state->meta->next is NULL and SSSD crashes later.
>
> In 1.9, we simply insert duplicate servers. Those servers are inserted after
> orphaned state->meta, state->meta is orphaned again, leaving those servers
> globally unreachable. However, it seemingly does not affect the fail over.
> You just run into d25e7c65.
>
> Here are four patches for master, and two patches for 1.9.
One question:
> case SRV_NEUTRAL: /* Request SRV lookup */
> + if (server != NULL && server != state->meta) {
> + /* A server created by expansion of meta server was marked as
> + * neutral. We have to collapse the servers and issue new
> + * SRV resolution. */
> + state->meta = collapse_srv_lookup(&server);
> + }
> +
Wouldn't a more systematic solution be to collapse the lookup from
functions that are setting the status? Or were you afraid of race
conditions where the server might still be in use by another request?
This is currently done only in fo_reset_services() which iterates over
services and servers and sets them to neutral. I think it would be safe
to do the collapse there, I'm not entirely sure though.
However, since resolve_srv_send() allows non meta server as input
parameter, it should not expect it to have correct (!SRV_NEUTRAL) state.
It is safer this way.