On 27 Feb 2019, at 10:21, Rich Megginson <rmeggins(a)redhat.com>
wrote:
On 2/26/19 4:26 PM, William Brown wrote:
>
>> On 26 Feb 2019, at 18:32, Ludwig Krispenz <lkrispen(a)redhat.com> wrote:
>>
>> Hi, I need a bit of time to read the docs and clear my thoughts, but one comment
below
>> On 02/25/2019 01:49 AM, William Brown wrote:
>>>> On 23 Feb 2019, at 02:46, Mark Reynolds <mreynolds(a)redhat.com>
wrote:
>>>>
>>>> I want to start a brief discussion about a major problem we have backend
transaction plugins and the entry caches. I'm finding that when we get into a nested
state of be txn plugins and one of the later plugins that is called fails then while we
don't commit the disk changes (they are aborted/rolled back) we DO keep the entry
cache changes!
>>>>
>>>> For example, a modrdn operation triggers the referential integrity plugin
which renames the member attribute in some group and changes that group's entry cache
entry, but then later on the memberOf plugin fails for some reason. The database
transaction is aborted, but the entry cache changes that RI plugin did are still present
:-( I have also found other entry cache issues with modrdn and BE TXN plugins, and we
know of other currently non-reproducible entry cache crashes as well related to
mishandling of cache entries after failed operations.
>>>>
>>>> It's time to rework how we use the entry cache. We basically need a
transaction style caching mechanism - we should not commit any entry cache changes until
the original operation is fully successful. Unfortunately the way the entry cache is
currently designed and used it will be a major change to try to change it.
>>>>
>>>> William wrote up this doc:
http://www.port389.org/docs/389ds/design/cache_redesign.html
>>>>
>>>> But this also does not currently cover the nested plugin scenario either
(not yet). I do know how how difficult it would be to implement William's proposal,
or how difficult it would be to incorporate the txn style caching into his design. What
kind of time frame could this even be implemented in? William what are your thoughts?
>>> I like coffee? How cool are planes? My thoughts are simple :)
>>>
>>> I think there is a pretty simple mental simplification we can make here
though. Nested transactions “don’t really exist”. We just have *recursive* operations
inside of one transaction.
>>>
>>> Once reframed like that, the entire situation becomes simpler. We have one
thread in a write transaction that can have recursive/batched operations as required,
which means that either “all operations succeed” or “none do”. Really, this is the
behaviour we want anyway, and it’s the transaction model of LMDB and other kv stores that
we could consider (wired tiger, sled in the future).
>> I think the recursive/nested transaction on the database level are not the
problem, we do this correctly already, either all or no change becomes persistent.
>> What we do not manage is modifications we do in parallel on the in memory
structure like the entry cache, changes to the EC are not managed by any txn and I do not
see how any of the database txn models would help, they do not know about ec and can abort
changes.
>> We would need to incorporate the EC into a generic txn model, or have a way to
flag ec entries as garbage for if a txn is aborted
> The issue is we allow parallel writes, which breaks the consistency guarantees of the
EC anyway. LMDB won’t allow parallel writes (it’s single write - concurrent parallel
readers), and most other modern kv stores take this approach too, so we should be
remodelling our transactions to match this IMO. It will make the process of how we reason
about the EC much much simpler I think.
Some sort of in-memory data structure with fast lookup and transactional semantics
(modify operations are stored as mvcc/cow so each read of the database with a given txn
handle sees its own view of the ec, a txn commit updates the parent txn ec view, or the
global ec view if no parent, from the copy, a txn abort deletes the txn's copy of the
ec) is needed. A quick google search turns up several hits. I'm not sure if the
B+Tree proposed at
http://www.port389.org/docs/389ds/design/cache_redesign.html has
transactional semantics, or if such code could be added to its implementation.
It does, this is a MVCC B+Tree implementation.
With LMDB, if we could make the on-disk entry representation the same as the in-memory
entry representation, then we could use LMDB as the entry cache too - the database would
be the entry cache as well.
Yes, Ludwig has suggested this because it would remove the need for an Entry Cache at all.
>
>>>> If William's design is too huge of a change that will take too long
to safely implement then perhaps we need to look into revising the existing cache design
where we use "cache_add_tentative" style functions and only apply them at the
end of the op. This is also not a trivial change.
>>> It’s pretty massive as a change - if we want to do it right. I’d say we
need:
>>>
>>> * development and testing of a MVCC/COW cache implementation (proof that it
really really works transactionally)
>>> * allow “disable/disconnect” of the entry cache, but with the higher level
txn’s so that we can prove the txn semantics are correct
>>> * re-architect our transaction calls so that they are “higher” up. An example
is that internal_modify shouldn’t start a txn, it should be given the current txn state as
an arg. Combined with the above, we can prove we haven’t corrupted our server transaction
guarantees.
>>> * integrate the transactional cache.
>>>
>>> I don’t know if I would still write a transactional cache the same way as I
proposed in that design, but I think the ideas are on the right path.
>>>
>>>> And what impact would changing the entry cache have on Ludwig's
plugable backend work?
>>> Should be none, it’s seperate layers. If anything this change is going to
make Ludwig’s work better because our current model won’t really take good advantage of
the MVCC nature of modern kv stores.
>>>
>>>> Anyway we need to start thinking about redesigning the entry cache - no
matter what approach we want to take. If anyone has any ideas or comments please share
them, but I think due to the severity of this flaw redesigning the entry cache should be
one of our next major goals in DS (1.4.1?).
>>>>
>>>> Thanks,
>>>>
>>>> Mark
>>>> _______________________________________________
>>>> 389-devel mailing list -- 389-devel(a)lists.fedoraproject.org
>>>> To unsubscribe send an email to 389-devel-leave(a)lists.fedoraproject.org
>>>> Fedora Code of Conduct:
https://getfedora.org/code-of-conduct.html
>>>> List Guidelines:
https://fedoraproject.org/wiki/Mailing_list_guidelines
>>>> List Archives:
https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproje...
>>> —
>>> Sincerely,
>>>
>>> William Brown
>>> Software Engineer, 389 Directory Server
>>> SUSE Labs
>>> _______________________________________________
>>> 389-devel mailing list -- 389-devel(a)lists.fedoraproject.org
>>> To unsubscribe send an email to 389-devel-leave(a)lists.fedoraproject.org
>>> Fedora Code of Conduct:
https://getfedora.org/code-of-conduct.html
>>> List Guidelines:
https://fedoraproject.org/wiki/Mailing_list_guidelines
>>> List Archives:
https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproje...
>> --
>> Red Hat GmbH,
http://www.de.redhat.com/, Registered seat: Grasbrunn,
>> Commercial register: Amtsgericht Muenchen, HRB 153243,
>> Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill,
Eric Shander
>> _______________________________________________
>> 389-devel mailing list -- 389-devel(a)lists.fedoraproject.org
>> To unsubscribe send an email to 389-devel-leave(a)lists.fedoraproject.org
>> Fedora Code of Conduct:
https://getfedora.org/code-of-conduct.html
>> List Guidelines:
https://fedoraproject.org/wiki/Mailing_list_guidelines
>> List Archives:
https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproje...
> —
> Sincerely,
>
> William Brown
> Software Engineer, 389 Directory Server
> SUSE Labs
> _______________________________________________
> 389-devel mailing list -- 389-devel(a)lists.fedoraproject.org
> To unsubscribe send an email to 389-devel-leave(a)lists.fedoraproject.org
> Fedora Code of Conduct:
https://getfedora.org/code-of-conduct.html
> List Guidelines:
https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives:
https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproje...
_______________________________________________
389-devel mailing list -- 389-devel(a)lists.fedoraproject.org
To unsubscribe send an email to 389-devel-leave(a)lists.fedoraproject.org
Fedora Code of Conduct:
https://getfedora.org/code-of-conduct.html
List Guidelines:
https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives:
https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproje...
—
Sincerely,
William Brown
Software Engineer, 389 Directory Server
SUSE Labs