Fix: volatile store partial writes through table flipping #806
Fix: volatile store partial writes through table flipping #806
Conversation
ae27de2 to
5875cf7
Compare
|
Nice find.
Sounds complex (why no groups?) and costly, though... Has to be a simpler way than that? |
|
Thanks. The logic is mainly that promotion happens from links to leaves. The other way around would leave dangling references, while this way leaves just orphans, which are fine (cleaned up on next cycle and doesn't cause any error). I measured performance to be very close but yes, it does have a cost but I was able to optimize most of the cost away by pattern matching on old/new hits so I dont promote twice. The core issue is write atomicity. There are many ways to solve it, like a controller process, contexts, per record expire/generation, etc. However, this way solves it by basically giving the writer a new table to write long before erase hits the old table. This way, erase will never erase half of a current write. As a bonus, it kind of acts like LRU (but way less precise) and cleanup is fast since it just cleans the whole table. |
5875cf7 to
01d2796
Compare
Demonstrates the "link to link: not_found" bug: hb_cache:read gets lazy links, TTL reset fires and wipes the table, then ensure_all_loaded fails because the data behind the links is gone.
The old max-ttl wiped the entire ETS table on a timer, causing dangling links when hb_cache writes span a reset boundary. The new approach uses two tables: writes go to both, reads check "new" first with promote-on-read from "old", and every TTL/2 the old table is wiped and roles flip. Active data survives via promotion; idle data expires atomically — no partial messages, no per-item timestamps, no cleanup sweeps.
Pattern-match on ets-flip presence instead of calling get_tables.
01d2796 to
e0117d8
Compare
Bug: during max ttl reset, volatile cache can receive partial writes, causing groups and links to point to stale data. Check tests.
Proposal: dual table approach with promotion and table flip at TTL/2
It points out the issue. Feel free to fix in another way, but I believe this is the safest without any coordinator or mutex-like behaviour to make writes and reset atomics.
As a bonus, it acts kind of like LRU.