-
Notifications
You must be signed in to change notification settings - Fork 162
Race-condition in creation+release of RedLock #979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I don't think it does. That looks like it only removes the lock if it matches the unique value of the lock.
From what I can see, |
If you can reproduce this reasonably reliably, then try adding some logs. |
Only thing I can think of at the moment, is that one lock could delete the key from redis, while the other lock adds the value to redis (i.e. release and acquire). If this is the case, then another lock starting up at this time would error if it tried to acquire, as it'd fail to fetch an Event to wait on. |
Thanks for the quick response. Not sure if this notation is clear, but I'll just try to write down the different situations. Ideally it would look like this:
We know all the steps belonging to
Looking at the code, I think there are more issues in the
So I feel like |
I don't think you'd see this one happening, that'd require the packets indicating a successful release getting lost for a long time while B continued it's work and somehow managed to release the lock and get a response before those packets arrived. Maybe possible theoretically, but I can't imagine it ever being observed.
The key is namespaced.
Is this not what the description refers to? Lines 11 to 12 in 58301fb
|
Again, if you could add some logs and confirm exactly what is happening, then we can come up with a solution. |
According to glide maintainers, there's no point to a connection pool for async (#975 (comment)). So, if the described race condition is currently happening, it should not be possible after switching to glide. |
True, but by default a cache has no namespace. So in order for this to work correctly, each cache would have to be given a unique namespace. This is not apparent to the user I think.
No, the description refers to a single instance of redis in contrast to a redis cluster.
Unfortunately the error occurs only so often and can only be reproduced in production environments.
I don't really understand why that should be the case. Afaik it is not possible to rely on the order of different coroutine executions with asyncio. |
I'm still not clear what the concern is here.
That code has nothing to do with Redis though... The text surely applies if you're using a memory backend too (not that I wrote it, so I'm only guessing at the intention). It would probably be useful if it did support multiple instances of an application though, so if you have any improvements, that'd be good to get implemented.
Both are waiting on a response from Redis, if there's only 1 connection then the first one to be executed will always be the first one to receive a response and get scheduled back into the loop, thus guaranteeing our order of execution. The only reason this might be the cause is if there's a connection pool and the release happens on one connection and the acquire then happens on another connection, but the packets for the first connection end up taking longer to get back to the client, resulting in the second task getting scheduled back into the loop first. |
If I just naively instantiate two independant caches and try to create RedLocks for them, both locks will share the same
The code does not directly, but the term
The RedLock implementation here only implements the first part. That's why I am pretty sure the explanation refers to instances Redis.
Maybe, but that is again making assumptions about the order of execution in asyncio of different coroutine calls. A test where the cache is mocked should be able to reproduce the race-condition in RedLock (regarded independant of any actual cache implementation), by controlling the order of execution of the asyncio coroutines. |
I think getting rid of There is one more thing I changed in our implementation though, that is more difficult to put into RedLock as a context manager: |
If you create them with the same key, sure. Isn't that exactly the same as using independent caches to get any value in the cache? I think that's just how the entire library works.
If it's a single connection, it can't happen. The execution order is entirely predictable if you know the variables.
If you have a proposed solution, then sure, go ahead.
I'm not clear what the proposed alternative is (I'm not the most familiar with all this), so feel free to propose any improvements.
I think that's a different type of lock. That'd be a RW lock, like in aiorwlock. |
Hi, Glide maintainer, please let me know if I can help with some info. |
After switching from memory cache to Redis (Valkey) we got some errors in the RedLock context-manager release:
I have looked at the code and found the following race-condition.
Assume there are two parallel calls A and B, where A blocks B. Then another third call C is triggered, exactly as A and B resolve. Then the following can happen:
aiocache/aiocache/lock.py
Lines 71 to 72 in 1e55135
aiocache/aiocache/lock.py
Line 77 in 1e55135
aiocache/aiocache/lock.py
Line 78 in 1e55135
aiocache/aiocache/lock.py
Lines 79 to 80 in 1e55135
aiocache/aiocache/lock.py
Lines 90 to 91 in 1e55135
aiocache/aiocache/lock.py
Line 94 in 1e55135
aiocache/aiocache/lock.py
Lines 95 to 96 in 1e55135
Until here everything is fine, but then a new event C already comes in, before B has finished.
aiocache/aiocache/lock.py
Line 77 in 1e55135
aiocache/aiocache/lock.py
Line 94 in 1e55135
This is already wrong, because the lock key belongs to the run of C and not of A and B.
On the other hand C will only retrieve the cached value from A, so locking at this time it not important to work.
aiocache/aiocache/lock.py
Lines 95 to 96 in 1e55135
This (probably) leads to the exception we are seeing.
We are not using the lock as a real synchronization mechanism, but to reduce redundant calculations, so when the locking does not work perfectly that is fine. It just should not throw these kind of exceptions.
So I would suggest always running
RedLock._EVENTS.pop(self.key)
on release (no matter whether the lock key was found in Valkey or not), but not fail on errors:The text was updated successfully, but these errors were encountered: