Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Update and add further detail to README-server-edition.html. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | server-process-edition |
Files: | files | file ages | folders |
SHA3-256: |
337a0b67e30f1030fdc59f712e5914f4 |
User & Date: | dan 2018-03-30 20:42:25.654 |
Context
2018-03-31
| ||
18:43 | Fix an error in README-server-edition.html. (Leaf check-in: 754ad35cd2 user: dan tags: server-process-edition) | |
2018-03-30
| ||
20:42 | Update and add further detail to README-server-edition.html. (check-in: 337a0b67e3 user: dan tags: server-process-edition) | |
2018-03-28
| ||
15:41 | Update this branch with latest trunk changes. (check-in: df52e89fff user: dan tags: server-process-edition) | |
Changes
Changes to README-server-edition.html.
1 2 3 4 5 6 7 | <html> <center> <h1> The "server-process-edition" Branch</h1> </center> <p> | | | | | > | > > > > > | | | > > | > > > | | | > | > | | | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | <html> <center> <h1> The "server-process-edition" Branch</h1> </center> <p> The "server-process-edition" branch contains two modifications to stock SQLite that work together to provide concurrent read/write transactions using pessimistic page-level-locking. The system runs in two modes: <ul> <li><p> Single-process mode - where all clients must be within the same address space, and <li><p> Multi-process mode - where clients may be distributed between multiple OS processes. </ul> <p> The system is designed to be most efficient when used with <a href="https://www.sqlite.org/pragma.html#pragma_synchronous"> "PRAGMA synchronous=OFF"</a>, although it does not require this. <p> Up to 16 simultaneous read/write transactions controlled by page-level-locking are possible. Additionally, in single-process mode there may be any number of read-only transactions started using the "BEGIN READONLY" command. Read-only transactions do not block read-write transactions, and read-write transactions do not block read-only transactions. Read-only transactions access a consistent snapshot of the database - writes committed by other clients after the transaction has started are never visible to read-only transactions. In multi-process mode, the "BEGIN READONLY" command is equivalent to a stock "BEGIN". <p> The two features on this branch are: <ol> <li><p> An <a href=#freelist>alternative layout for the database free-page list</a>. This is intended to reduce contention between writers when allocating new database pages, either from the free-list or by extending the database file. <li><p> The <a href=#servermode>"server-mode" extension</a>, which provides read/write page-level-locking concurrency and (in single-process mode) read-only MVCC concurrency mentioned above. </ol> <h2 id=freelist> 1.0 Alternative Free-List Format </h2> <p> The alternative free-list format is very similar to the current format. It differs in the following respects: <ul> <li><p>The "total number of free pages" field in the db header is not |
︙ | ︙ | |||
76 77 78 79 80 81 82 | <p> At present, the free-list format may only be modified when the free-list is completely empty. Which, as the implementation ensures that a free-list that uses the alternative format is never completely emptied, effectively precludes changing the format from 2 (alternative) to 1 (legacy). | > > > > > | | < | | | | > > > > | > > > | > > > > > | > > > > > > | > > > > > > > > > > > | > > | > > > | > > > > | > > > > | > > > > > > > > > < | > > > | > | | > > > > > | > > > | > > > > | > | | > > > | | > > > > > > > > | > > > > > > > > > > > > > > > > > > > | > | > | > > > | > > > > > > | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | > > > > > > > > > > | > > > > > | > > | > > | > > > | > | | > | > > > > > > > > > > > > > > | | 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 | <p> At present, the free-list format may only be modified when the free-list is completely empty. Which, as the implementation ensures that a free-list that uses the alternative format is never completely emptied, effectively precludes changing the format from 2 (alternative) to 1 (legacy). <p> For databases that use the "alternative" free-list format, the read and write versions in the database header (byte offsets 18 and 19) are set to 3 for rollback mode or 4 for wal mode (instead of 1 and 2 respectively). <h2 id=servermode> 2.0 Page level locking - "Server Mode" </h2> <p> A database client automatically enters "server mode" if there exists a <i>directory</i> named "<database>-journal" in the file system alongside the database file "<database>" There is currently no provision for creating this directory, although it could be safely done for a database in rollback mode using something like: <pre> PRAGMA journal_mode = off; BEGIN EXCLUSIVE; <create directory> END; </pre> <p> As well as signalling new clients that they should enter server-mode, creating a directory named "<database>-journal" has the helpful side-effect of preventing legacy clients from accessing the database file at all. <p> If the VFS is one that takes an exclusive lock on the db file (to guarantee that no other process accesses the db file), then the system automatically enters single-process mode. Otherwise, multi-process mode. <p> In both single and multi-process modes, page-level-locking is managed by allocating a fixed-size array of "locking slots". Each locking slot is 32-bits in size. By default, the array contains 262144 (2^18) slots. Pages are assigned to locking slots using the formula (pgno % 262144) - so pages 1, 262145, 524289 etc. share a single locking slot. <p> In single-process mode, the array of locking slots is allocated on the process heap and access is protected by a mutex. In multi-process mode, it is created by memory-mapping a file on disk (similar to the *-shm file in SQLite wal mode) and access is performed using <a href="https://en.wikipedia.org/wiki/Compare-and-swap">atomic CAS primitives</a> exclusively. <p> Each time a read/write transaction is opened, the client assumes a client id between 0 and 15 for the duration of the transaction. Client ids are unique at any point in time - concurrently executing transactions must use different client ids. So there may exist a maximum of 16 concurrent read/write transactions at any one time. <p> Read/write transactions in server-mode are similar to regular SQLite transactions in rollback mode. The most significant differences are that: <ul> <li> <p>Instead of using journal file <database>-journal, server-mode clients use <database>-journal/<client-id>-journal. If there are multiple concurrent transactions, each uses a separate journal file. <li> <p>No database-wide lock is taken. Instead, individual read and write locks are taken on the pages accessed by the transaction. </ul> <p> Each locking slot is 32-bits in size. A locking slot may simultaneously support a single write-lock, up to 16 read-locks from read/write clients, and (in single process mode) up 1024 read-locks from "BEGIN READONLY" clients. Locking slot bits are used as follows: <ul> <li> <p> The least-significant 16-bits are used for read-locks taken by read/write clients. To take a read-lock, bit <client-id> of the locking slot is set. <li> <p> The next 5 bytes are used for the write-lock. If no write-lock is held on the slot, then this 5 byte integer is set to 0. Otherwise, it is set to (<i>C</i> + 1), where <i>C</i> is the <client-id> of the client holding the write-lock. <li> <p> The next 10 bits contain the total number of read-locks held by "BEGIN READONLY" clients on the locking slot. See the section below for a description of how these are used. </ul> <p> Currently, if a client requests a lock that cannot be granted due to a conflicting lock, SQLITE_BUSY is returned to the caller and either the entire transaction or statement transaction must be rolled back. See <a href=#problems>Problems and Issues</a> below for more details. <h3> 2.1 Single-Process Mode </h3> <p> Single process mode is simpler than multi-process mode because it does not have to deal with runtime client failure - it is assumed that if one client fails mid-transaction the entire process crashes. As a result the only time hot-journal rollback is required in single-process mode is as part of startup. The first client to connect to a database in single-process mode attempts to open and rollback all 16 potential hot journal files. <p> But, in order to support non-blocking "BEGIN READONLY" transactions, it is also in some ways more complicated than multi-process mode. "BEGIN READONLY" support works as follows: <ul> <li> <p>Clients executing "BEGIN READONLY" transactions are not assigned a <client-id>. Instead, they have a transaction-id that is unique within the lifetime of the process. Transaction-ids are assigned using a monotonically increasing function. <li> <p>In single-process mode, writers never spill the cache mid-transaction. Data is only written to the database as part of committing a transaction. <li> <p>As well as writing the contents of overwritten pages out to the journal file, a writer in single-process mode also accumulates a list of buffers containing the original data for each page overwritten by the current transaction in main-memory. <li> <p>When a transaction is to be committed, a writer first obtains a transaction-id (in the same way as a BEGIN READONLY client) and then adds all of its "old data" buffers to a hash table accessible to all database clients. Associated with each hash table entry is the newly assigned transaction-id. It then waits (spin-locks) for all "BEGIN READONLY" read-locks to clear on all pages that will be written out by the transaction. Following this, it commits the transaction as normal (writes out the dirty pages and zeroes the journal file header). <li> <p>When a "BEGIN READONLY" transaction reads a page, it first checks the aforementioned hash table for a suitable entry. A suitable entry is one with the right page-number and a transaction-id greater than that of the "BEGIN READONLY" transaction (i.e. one added to the hash table <i>after</i> the BEGIN READONLY transaction started). If such an entry can be found, the client uses the associated data instead of reading from the db file. Or, if no such entry is found, the client: <ol> <li> Increments the number of BEGIN READONLY read-locks on the page. <li> Reads the contents of the page from the database file. <li> Decrements the number of BEGIN READONLY read-locks on the page. </ol> <p> The mutex used to protect access to the array of locking slots and the shared hash table is relinquished for step 2 above. <li> <p>After each transaction is commited in single-process mode, the client searches the hash table for entries that can be discarded. An entry can be discarded if it has a transaction-id older than any still in use (either by BEGIN READONLY transactions or committers). </ul> <h3> 2.2 Multi-Process Mode </h3> <p> Multi-process mode differs from single-process mode in two important ways: <ul> <li> <p>Individual clients may fail mid-transaction and the system must recover from this. <li> <p>Partly as a consequence of the above, there are no convenient primitives like mutexes or malloc() with which to build complicated data structures like the hash-table used in single-process mode. As a result, there is no support for "BEGIN READONLY" transactions in multi-process mode. </ul> <p> Unlike single-process mode clients, which may be assigned a different client-id for each transaction, clients in multi-process mode are assigned a client-id when they connect to the database and do not relinquish it until they disconnect. As such, a database in multi-process server-mode supports at most 16 concurrent client connections. <p> As well as the array of locking slots, the shared-memory mapping used by clients in multi-process mode contains 16 "client slots". When a client connects, it takes a posix WRITE lock on the client slot that corresponds to its client id. This lock is not released until the client disconnects. Additionally, whenever a client starts a transaction, it sets the value in its client locking slot to 1, and clears it again after the transaction is concluded. <p> This assists with handling client failure mid-transaction in two ways: <ul> <li><p> If client A cannot obtain a lock due to a conflicting lock held by client B, it can check whether or not client B has failed by attempting a WRITE lock on its client locking slot. If successful, then client B must have failed and client A may: <ul> <li> Roll back client B's journal, and <li> By iterating through the entire locking slot array, release all locks held by client B when it failed. </ul> <li><p> When a client first connects and locks its client locking slot, it can check whether or not the previous user of the client locking slot failed mid-transaction (since if it did, the locking slot value will still be non-zero). If it did, the new owner of the client locking slot can release any locks and roll back any hot-journal before proceeding. </ul> <h3> 2.3 Required VFS Support </h3> <p> The server-mode extension requires that the VFS support various special file-control commands. Currently support is limited to the "unix" VFS. <dl> <dt> SQLITE_FCNTL_SERVER_MODE <dd><p> This is used by SQLite to query the VFS as to whether the connection should use single-process server-mode, multi-process server-mode, or continue in legacy mode. <p>SQLite invokes this file-control as part of the procedure for detecting a hot journal (after it has established that there is a file-system entry named <database>-journal and that no other process holds a RESERVED lock). If the <database>-journal directory is present in the file-system and the current VFS takes an exclusive lock on the database file (i.e. is "unix-excl"), then this file-control indicates that the connection should use single-process server-mode. Or, if the directory exists but the VFS does not take an exclusive lock on the database file, that the connection should use multi-proces server-mode. Or, if there is no directory of the required name, that the connection should use legacy mode. <dt> SQLITE_FCNTL_FILEID <dd><p> Return a 128-bit value that uniquely identifies an open file on disk from the VFS. This is used to ensure that all connections to the same database from within a process use the same shared state, even if they connect to the db using different file-system paths. <dt> SQLITE_FCNTL_SHMOPEN <dd> <dt> SQLITE_FCNTL_SHMOPEN2 <dd> <dt> SQLITE_FCNTL_SHMLOCK <dd> <dt> SQLITE_FCNTL_SHMCLOSE <dd> </dl> <h2 id=problems> 3.0 Problems and Issues </h2> <ul> <li> <p>Writer starvation might be the biggest issue. How can it be prevented? <li> <p>Blocking locks of some sort would likely improve things. The issue here is deadlock detection. <li> <p>The limit of 16 concurrent clients in multi-process mode could be raised to 27 (since the locking-slot bits used for BEGIN READONLY locks in single-process mode can be reassigned to support more read/write client read-locks). </ul> <h2> 4.0 Performance Test </h2> <p> The test uses a single table with the following schema: <pre> CREATE TABLE t1(a INTEGER PRIMARY KEY, b BLOB(16), c BLOB(16), d BLOB(400)); CREATE INDEX i1 ON t1(b); |
︙ | ︙ |
Changes to src/server.c.
︙ | ︙ | |||
136 137 138 139 140 141 142 143 144 145 146 147 148 149 | int iCommitId; /* Current commit id (or 0) */ int nAlloc; /* Allocated size of aLock[] array */ int nLock; /* Number of entries in aLock[] */ u32 *aLock; /* Array of held locks */ Server *pNext; /* Next in pCommit or pReader list */ }; struct ServerGlobal { ServerDb *pDb; /* Linked list of all ServerDb objects */ }; static struct ServerGlobal g_server; struct ServerFcntlArg { | > > > | 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | int iCommitId; /* Current commit id (or 0) */ int nAlloc; /* Allocated size of aLock[] array */ int nLock; /* Number of entries in aLock[] */ u32 *aLock; /* Array of held locks */ Server *pNext; /* Next in pCommit or pReader list */ }; /* ** Global variables used by this module. */ struct ServerGlobal { ServerDb *pDb; /* Linked list of all ServerDb objects */ }; static struct ServerGlobal g_server; struct ServerFcntlArg { |
︙ | ︙ | |||
157 158 159 160 161 162 163 | /* ** Possible values for Server.eTrans. */ #define SERVER_TRANS_NONE 0 #define SERVER_TRANS_READONLY 1 #define SERVER_TRANS_READWRITE 2 | < < < < | 160 161 162 163 164 165 166 167 168 169 170 171 172 173 | /* ** Possible values for Server.eTrans. */ #define SERVER_TRANS_NONE 0 #define SERVER_TRANS_READONLY 1 #define SERVER_TRANS_READWRITE 2 /* ** Global mutex functions used by code in this file. */ static void serverEnterMutex(void){ sqlite3_mutex_enter(sqlite3MutexAlloc(SQLITE_MUTEX_STATIC_APP1)); } static void serverLeaveMutex(void){ |
︙ | ︙ | |||
220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 | p->nClient++; } pNew->pDb = p; serverLeaveMutex(); return rc; } static int serverClientRollback(Server *p, int iClient){ ServerDb *pDb = p->pDb; ServerJournal *pJ = &pDb->aJrnl[iClient]; int bExist = 1; int rc = SQLITE_OK; if( fdOpen(pJ->jfd)==0 ){ bExist = 0; rc = sqlite3OsAccess(pDb->pVfs, pJ->zJournal, SQLITE_ACCESS_EXISTS,&bExist); if( bExist && rc==SQLITE_OK ){ int flags = SQLITE_OPEN_READWRITE|SQLITE_OPEN_MAIN_JOURNAL; rc = sqlite3OsOpen(pDb->pVfs, pJ->zJournal, pJ->jfd, flags, &flags); } | > > > > > > > > | 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 | p->nClient++; } pNew->pDb = p; serverLeaveMutex(); return rc; } /* ** Roll back journal iClient. This is a hot-journal rollback - the ** connection passed as the first argument does not currently have an ** open transaction that uses the journal (although it may have an ** open transaction that uses some other journal). */ static int serverClientRollback(Server *p, int iClient){ ServerDb *pDb = p->pDb; ServerJournal *pJ = &pDb->aJrnl[iClient]; int bExist = 1; int rc = SQLITE_OK; /* If it is not exists on disk but is not already open, open the ** journal file in question. */ if( fdOpen(pJ->jfd)==0 ){ bExist = 0; rc = sqlite3OsAccess(pDb->pVfs, pJ->zJournal, SQLITE_ACCESS_EXISTS,&bExist); if( bExist && rc==SQLITE_OK ){ int flags = SQLITE_OPEN_READWRITE|SQLITE_OPEN_MAIN_JOURNAL; rc = sqlite3OsOpen(pDb->pVfs, pJ->zJournal, pJ->jfd, flags, &flags); } |
︙ | ︙ |