-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
MDEV-38212 MDEV-37686 Breaks Parallel Replication #4462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 10.11
Are you sure you want to change the base?
Conversation
Re-fixing MDEV-37686. NULLifying `THD::rgi_slave` pointer in `start_new_trans` ctor harmed the parallel slave conflict detection. Now instead of `NULL`ifying of this member a finer approach is taken to optionally screen `THD::rgi_slave` when it is attempted to be accessed within `start_new_trans` context, that is when such out-of-band transaction is run by a slave thread. The start_new_trans is allowed of course to access server local non-replicated temporary tables, should it ever need that. The original MDEV-37686 aspect is tested with 12.3 branch's rpl.create_or_replace_mix2. Any possible side effects to temporary table replication is controlled by existing rpl suite tests. There is no need for a specific mtr test to prove the parallel slave side is back to normal, as `THD::rgi_slave` is again available to concurrent transactions.
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this patch basically tricks the execution context into sometimes thinking it is a master thread, and other times on a slave thread, i.e. when in the context of a sub-transaction started via start_new_trans. Then, for each usage of the rgi that may involve being tricked, we have to go through an intermediary to see if we should do this trickery or not.
There are a few issues I see with this approach:
- It generally seems bug prone to have to go through each usage of
rgithat is vulnerable to this trickery - On the master, both parent and sub-transactions both use
THD::temporary_tables; then on the slave this would mean the parent transaction would usergibased temporary tables and then the sub-transactions would useTHD::temporary_tables. This inconsistency seems bug prone. - Can there be nested
start_new_trans? This would resetis_new_transfor the deepest sub-transaction when it is commit, and parent trans would then see the wrong value.
I'd think it would be more consistent for sub-transactions to continue using rgi based temporary tables, and follow the convention that @vuvova mentions on MDEV-37686:
The code path is almost the same, but there's no crash here. Because "creating a separate transaction" includes saving and resetting the list of temporary tables in the THD.
IIUC, the original problem from MDEV-37686 was that when the sub-transaction (i.e. the one started via start_new_trans) committed, it over-stretched in what it closed because the sub-transaction had too much insight into the parent transaction context. So my thought is, wouldn't it be cleaner to just have start_new_trans correctly save/reset the temporary tables for a slave?
And for a more general question, is there another problem on the parallel slave, s.t. any sub-transactions committed via start_new_trans can't be rolled back via the retry mechanism?
|
Thanks for the questions, Brandon!
As the core part of the patch is You are up to the point when quote Serg's
So One might start thinking I bet, about why won't I made a mistake to think it was so indeed. But actually the new trans does not allocate any new So the patch deals with the slave path, and since a new
To p.1 vulnerability is that it's not future-proof wrt To p.3, nesting seems to be fine though sharing THD instance is akin of rope-walking.
I'd stay for more: |
My thought is the opposite, that
And I'm not so sure, many of the uses of
And to this, I agree it shouldn't have access to the parent transaction context, but I do think it should have access to the rgi context (which would just adapted to hide the parent transaction-specific info); if it exists. |
It does not matter what is the main trx' applier, any of them would use the same code path never intending to access As to reusing the parent |
I don't follow the logic jump from "isolated from parent transactions" to "therefore is neutral to either master or slave role". The new |
|
After our online call, I understand better what you are saying, @andrelkin. What I previously called a sub-transaction is a bit a misnomer, where it would be more effectively thought of as an asynchronous transaction, which could in theory run in its own separate thread (where there would be no As the original issue of MDEV-37686 was with temporary table access, can this patch be limited in scope to only temporary tables? I wonder if instead, the I envision something like:
The advantages I see of this approach are:
However, it still leaves the problem that |
You must've assumed presents a scenario of accessing the slave temp table repository without visiting the function. Similar one exists for DROP. However your point can be furthered. Much central role play lock/unlock temporary tables. So the current patch's require more one or two} aforementioned functions. (Actually in principle its Given that the issue can't be covered by changes around a single function, and at the same time BUT! Indeed this "instead" of yours
certainly deserves further interrogation! :-) We might as well engage existing general server worker thread for that.. |
Ah I suppose I was thinking of MDEV-37686 in-particular, where the condition that lead to the crash was And on that note, now that you mention..
it also looks like conditions revolving tmp_tables aren't just using E.g, in the same function, and I'd think we need to make those uses consistent with this fix as well. |
which makes me immediately scout on. Let it be CREATE-TABLE (it does matter that it's the slave thread is executing it) the new trans starts in where the old stats is removed. I wonder whether the removal could be carried out in parallel with the rest of CREATE handling? Sure it admits race in that the new table and old stats coexist for "short" period of time. @vuvova may cut this proposal short though... |
Re-fixing MDEV-37686.
NULLifying
THD::rgi_slavepointer instart_new_transctor harmed the parallel slave conflict detection.Now instead of
NULLifying of this member a finer approach is taken to optionally screenTHD::rgi_slavewhen it is attempted to be accessed withinstart_new_transcontext, that is when such out-of-band transaction is run by a slave thread.The start_new_trans is allowed of course to access server local non-replicated temporary tables, should it ever need that.
The original MDEV-37686 aspect is tested with 12.3 branch's
rpl.create_or_replace_mix2.Any possible side effects to temporary table replication is controlled by existing rpl suite tests.
There is no need for a specific mtr test to prove the parallel slave side is back to normal, as
THD::rgi_slaveis again available to concurrent transactions.Description
TODO: fill description here
Release Notes
TODO: What should the release notes say about this change?
Include any changed system variables, status variables or behaviour. Optionally list any https://mariadb.com/kb/ pages that need changing.
How can this PR be tested?
TODO: modify the automated test suite to verify that the PR causes MariaDB to behave as intended.
Consult the documentation on "Writing good test cases".
If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.
Basing the PR against the correct MariaDB version
mainbranch.PR quality check