SQLite Forum

Changes to support concurrent writes to wal database distributed over network FS

Changes to support concurrent writes to wal database distributed over network FS

(1) By Joelmo on 2020-09-13 11:18:22 [link] [source]

I want to distribute a sqlite database using ceph and let multiple nodes have RW access. I believe some small changes will allow this. Split WAL into multiple files, each node have exclusive write access to one and read access to all. Only one node is allowed to merge wal files. And a lookup will then require reading all wal files, I don't think this is a huge drawback since number of network nodes usually is small. This change should allow high concurrent writes while not causing db corruption.. if I'm not missing something.

What do you think about this?

(2) By Warren Young (wyoung) on 2020-09-13 11:36:59 in reply to 1 [source]

I think you should use a distributed SQLite variant from the start. The major ones are BedrockDB, rqlite, and dqlite. It’ll save you a lot of work over reinventing the wheel, and it’ll probably be faster and more secure besides.

(3) By Gunter Hick (gunter_hick) on 2020-09-14 08:53:04 in reply to 1 [link] [source]

Since cephFS claims to be POSIX compliant, you should be able to put a journal mode SQLite file on a cephFS volume without any changes to SQLite. If cephFS supports shared memory mapped files, then WAL mode may be usable too.

Alternatively, you would need to write an SQLite VFS to interface to the rados library, hiding all the fancy changes from SQlite.

What is the rationale behind wanting to split the WAL file?

(4) By Free Ekanayaka (free.ekanayaka) on 2020-09-14 09:10:54 in reply to 1 [link] [source]

Dqlite author here. How are you going to resolve conflicts when merging WAL files? Or, alternatively, how are you going to prevent conflicts from happening in the first place?

(5) By Warren Young (wyoung) on 2020-09-14 13:43:23 in reply to 3 [link] [source]

you should be able to put a journal mode SQLite file on a cephFS volume without any changes to SQLite.

If everything works to spec, then at minimum every SQLite transaction requires two network round-trips. (Commit to journal, move data to DB.) With the HTTP-based API typical of the distributed DB layers over SQLite I linked to, that'll be one round-trip per transaction.

Since network I/O time will typically be an order of magnitude or so slower than the storage I/O in the sort of system where Ceph will be in use, that means using SQLite-on-Ceph will be at least twice as slow as the alternative.

That's all the single-node perspective, but I don't see that it breaks down when you start talking about multi-node consensus. Both schemes have to arrange a way to do that, so as long as you're comparing apples to apples, that I/O overhead should amount to approximately equal, so you can leave it factored out of the discussion.

But hey, try it. It's not difficult to work up some simple benchmarks here.

(6) By Joelmo on 2020-09-17 13:06:36 in reply to 4 [link] [source]

I think WAL 2 mode is needed (https://www.sqlite.org/cgi/src/doc/wal2/doc/wal2.md), each node uses two logs and alternates between them. Every node should switch to the alt log before the merger node starts working, merging each write in chronological order.

For relaxed consistency this requires synced clocks and timestamps for each commit. If two nodes wrote to a row at the exact same time you would have to just pick one. This method is most performant and probably ok for many applications.

For more exact consistency you will need a shared counter, replace timestamps with a counter, get and increment by one for each write. Preventing any conflict from happening would be expensive because it requires locks or exclusive access.

(7) By Joelmo on 2020-09-17 13:24:56 in reply to 3 [link] [source]

The main reason is to remove lock needed for the WAL, unsynchronized access in a distributed system leads to better performance. I believe split files is easier to implement, rather than exclusive write regions of the WAL. Separate files should also work better if the sync system split each file into objects, like ceph does.