Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add b-tree design notes to www/ directory.
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 8c828db5a64f137f184e830962942ebd6a12b0b3
User & Date: dan 2013-09-19 15:13:21.872
Context
2013-09-19
19:19
Further btree updates. check-in: 0b68007d89 user: dan tags: trunk
15:13
Add b-tree design notes to www/ directory. check-in: 8c828db5a6 user: dan tags: trunk
2013-09-18
15:46
Fix enough of bt_pager.c that it may be used for testing the b-tree built on top of it. check-in: 2109619482 user: dan tags: trunk
Changes
Unified Diff Ignore Whitespace Patch
Added www/bt.wiki.














































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159

<title>BT Design Overview</title>
<nowiki>






<div id=start_of_toc></div>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#overview style=text-decoration:none>1. Overview</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#b-tree_database_notes style=text-decoration:none>2. B-Tree Database Notes</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#pager_level_changes style=text-decoration:none>2.1. Pager Level changes</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#wal_file_format_changes style=text-decoration:none>2.1.1. WAL File Format Changes</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#wal_file_wrapping style=text-decoration:none>2.1.1.1. WAL File Wrapping</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#wal_file_headers_and_recovery style=text-decoration:none>2.1.1.2. WAL File Headers and Recovery</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#shm_file_format_changes style=text-decoration:none>2.1.2. SHM File Format Changes</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#shm-file_header style=text-decoration:none>2.1.2.3. Shm-file Header</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#shm-file_hash_tables style=text-decoration:none>2.1.2.4. Shm-File Hash Tables</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#changes_to_locking style=text-decoration:none>2.1.3. Changes to Locking</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#read-only_clients_and_halted_databases style=text-decoration:none>2.1.3.5. Read-only Clients and Halted Databases</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#read-only_clients_and_live_databases style=text-decoration:none>2.1.3.6. Read-only Clients and Live Databases</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#b-tree_level_changes style=text-decoration:none>2.2. B-Tree Level Changes</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#free_page_management style=text-decoration:none>2.2.1. Free Page Management</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#large_records style=text-decoration:none>2.2.2. Large Records</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#b-tree_page_format style=text-decoration:none>2.2.3. B-Tree Page Format</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#merge-tree_database_notes style=text-decoration:none>3. Merge-Tree Database Notes</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#merge-tree_format style=text-decoration:none>3.1. Merge-Tree Format</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#in-memory_tree style=text-decoration:none>3.2. In-Memory Tree</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#use_of_a_background_thread_or_process style=text-decoration:none>3.3. Use of a Background Thread or Process</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#design_summary style=text-decoration:none>4. Design Summary</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#locking style=text-decoration:none>4.1. Locking</a><br>

<div id=end_of_toc></div>

<h1 id=overview>1. Overview</h1>

<p>
  This is an attempt to extend a b-tree design similar to SQLite3 wal 
  mode with the following objectives:

<ol>
  <li><p>To support fast "blind writes", similar to LSM or FTS4. The user
  should be able to select on a per-write basis whether a blind write or a
  traditional b-tree write is used to update the database. In SQLite4, this
  might be used as follows:
    <ul>
      <li><p>The primary key index of tables could be written using b-tree
      writes while an FTS (or other) index attached to the same table uses
      blind writes.

      <li><p>A database used by an indexeddb implementation might write all
      data using blind writes.
    </ul>

  <li><p>To better support deferring work involved in writing to the database
  to a background thread or process. 

  <p>In SQLite, it is possible to do this by turning off auto-checkpoint
  and using a background thread or process to run periodic checkpoints
  on a database. The main problem with this is that if the database is under
  load the WAL file may grow indefinitely. Addressing this would not make 
  writer threads as responsive as they can be with LSM or LevelDB, but would 
  improve things without sacrificing DAM optimal seek queries.

  <li><p>To support read-only clients. A read-only client may not create files
  or shared-memory regions, write to the same, or take any EXCLUSIVE locks.
</ol>

<h1 id=b-tree_database_notes>2. B-Tree Database Notes</h1>

<p>
  The b-tree database is similar to SQLite in WAL mode. In the sense that:

<ul>
  <li><p> All data is stored in a b-tree structure with fixed size pages.
       Large records are stored using overflow pages.

  <li><p> A physical log file similar to SQLite's WAL file is used. This
       provides single-writer/multiple-reader MVCC.

  <li><p> A temporary rollback log file to support nested transactions.

  <li><p> A checkpoint operation copies data from the physical log file into
       the database file itself. Checkpoints may proceed concurrently with
       a single writer and any number of readers.
</ul>

<p>
  The notes below are divided into two sections - the first describing a
  WAL pager module that supports the circular log file, and the second 
  describing the b-tree created on top of it.

<h2 id=pager_level_changes>2.1. Pager Level changes</h2>

<p>
  To support the circular log concept, the format of the wal and shm files 
  are slightly different. Described below.

<h3 id=wal_file_format_changes>2.1.1. WAL File Format Changes</h3>

<p> Two header blocks at the start of the file. Each occupying an entire
    disk sector.

<p> The remainder of the file divided into "frames", just as the SQLite WAL
file is. Most valid frames contain new versions of database pages, just
as in SQLite. However, a frame may also be a "pointer frame" - a frame that
exists solely to indicate the location of the next frame in the log.

<h4 id=wal_file_wrapping>2.1.1.1. WAL File Wrapping</h4>

<p> Logically, the log always consists of three regions (contiguous runs of
    frames within the physical file), region A, region B and region C. When
    the contents of a log are recovered, regions are read in that order (A,
    then B, then C). New frames are always appended to the end of region C.

<pre>
    |---B---|..|---A---||---C---|......
</pre>

<p> Initially, the log file is empty. All three regions are zero frames in 
size. All three are located at the very start of the log file. As frames are
appended to the log, region C grows, giving:

<pre>
    1)     |---------------C---------------|....
</pre>

<p> If a checkpoint is performed at this point, it reduces the size of region
    C, to (say):

<pre>
    2)     ....................|-----C-----|....
</pre>

<p> Once there is sufficient space between the start of the log file and the
    start of region C, region C is renamed to region A and a new region C
    created at the start of the file:

<pre>
    3)     |---C---|...........|-----A-----|....
</pre>

<p> Usually, or at least hopefully, checkpoints will reduce the size of 
region A to zero before C grows large enough to meet the start of A,
leaving the system back in state (1). But, if it doesn't, region C is 
renamed B and a new C begins immediately following A:

<pre>
    4)     |--------B---------||-----A-----||---C---|...
</pre>

<p> Once in this state new frames are appended to region C until all data
in both regions A and B has been checkpointed. At which point the system
is back in state (2) above. Before it gets there, the system topology may
pass through state 5:

<pre>
    5)     .....|------B------|.............|-----C-----|...
</pre>

<h4 id=wal_file_headers_and_recovery>2.1.1.2. WAL File Headers and Recovery</h4>

<p>The WAL file begins with two header blocks - headers 1 and 2. Each 
block occupying an entire disk sector. As well as other fields, a header 
block contains:

<ul>
  <li> A 64-bit version number. The version number is incremented 
       each time a new header block is written to the WAL file.
  <li> A checksum. Based on the contents of the header block, including 
       the version number.
</ul>

<p>During recovery, both header blocks are read from the database. The
newest block with a valid checksum is used.

<p>The key piece of information in a WAL file header is the first frame
to read during recovery. In an SQLite 3 WAL file, this is always frame 
zero - the frame that immediately follows the header. In this version,
it may be any frame in the file.

<p>When a WAL file is first written (state 1 above), header 1 is written and
the recovery frame pointer set to 0 (first frame in the file). In this case,
the header is written by the writer. This is the only time a WAL header
is updated by a process holding the WRITER lock. All subsequent header 
updates are performed while holding the CHECKPOINTER lock.

<p>After a checkpoint is performed, a checkpointer may write a new header
into the WAL file, indicating a new "first frame" that can be used when
recovering the log file. Once this new header has been synced, it is safe
to overwrite any parts of the log that occur logically before the new "first
frame". The new header is always written to the slot not occupied by the
previous header, so that if a failure occurs while writing at least one header 
is always valid. If the database is in synchronous=NORMAL or FULL mode, the WAL
file is synced immediately after the header is written. Following that, the
*-shm file is updated so that subsequent writers can see that the WAL file
header has been updated. A writer process may then use this information to
determine whether or not to wrap the log file.

<h3 id=shm_file_format_changes>2.1.2. SHM File Format Changes</h3>

<h4 id=shm-file_header>2.1.2.3. Shm-file Header</h4>

<p>In SQLite, the shm-file header consists of three things:

<ul>
  <li> The WalIndexHdr structure (updated by writers).
  <li> The nBackfill field (updated by checkpointers).
  <li> The array of read-lock slots (updated by readers).
</ul>

<p>There are two identical copies of the WalIndexHdr. Both contain a checksum.
When a writer wishes to update the WalIndexHdr, it updates the first copy,
invokes the xShmBarrier() routine, then updates the second. 

<p>When a reader reads the WalIndexHdr, it reads the first copy, invokes
xShmBarrier(), then reads the second. If both operations return identical data
and the checksum matches then the read is considered successful. Unsuccessful
reads are retried for a while, then an attempt is made to determine if there is
an active writer process. If there is not, recovery is run on the WAL file. If
there is, SQLITE_PROTOCOL is returned to the user.

There are two minor problems with this:

<ul>
  <li><p> Even with high retry counts, SQLITE_PROTOCOL has very occasionally
          been observed in the wild.
  <li><p> Requiring that recovery be run if a writer process fails while
          updating the WalIndexHdr causes problems for read-only connections.  
</ul>

<p>The procedure for reading and writing the WalIndexHdr copies can be modified
to address the above as follows:

<ul>
  <li><p>The writer invokes xShmBarrier() before updating the first WalIndexHdr
  as well as before updating the second.

  <li><p>A reader starts by reading the first WalIndexHdr. If the checksum
  matches the read data, the read is considered successful without ever
  examining the second copy. If unsuccessful, an attempt is made to read the
  second copy. Then the first again, and so on, until some configured limit is
  reached or a successful read is made. If a successful read is acheived, the
  xShmBarrier() method is invoked before proceeding. If not, SQLITE_PROTOCOL is
  returned to the user.
</ul>

<p>The xShmBarrier() made by the reader and the first of those made by the
writer together ensure that a reader may not see a version of the hash tables
that is older than the WalIndexHdr it reads. The second xShmBarrier() call made
by the writer ... <i>Does what? Makes it more likely that a concurrent reader
will get a clean read? Can this be removed?</i>

<p>The single nBackfill field is replaced by two 32-bit integer fields that
may be updated by checkpointers:

<ul>
  <li><p>iFirstRead - first frame in the log that has not been checkpointed.
  <li><p>iFirstRecover - first frame in the log that has not been checkpointed
         and synced into the database file.
</ul>

<p>After a checkpointer finishes copying data from the log file into the
database, it updates iFirstRead. If it then syncs the database file (or if it
is running in synchronous=OFF mode), it updates iFirstRecover as well.

<p>The array of read-lock slots is the same as in SQLite.

<h4 id=shm-file_hash_tables>2.1.2.4. Shm-File Hash Tables</h4>

<p>In SQLite, the shm file contains a series of hash tables. Roughly one hash
table for each 4096 frames in the WAL file. In SQLite, each 32KB hash table is
made up of an ordered list of page numbers (16KB) and a table of 8192 16-bit
slots. This approach must be modified for the current design, as it it not
generally possible to safely remove entries from a hash table while it is in
use.

<p>Instead of a single 16KB table of hash slots, each 16KB ordered list of 
page numbers is followed by two such tables. Meaning that for each 4096
frames or part thereof in the WAL file there is a 48KB (instead of 32KB) 
hash table in the shm file.

<p>When the WAL file is first written, the first of each pair of hash slot
tables is used. Each time a writer wraps around to the start of the WAL 
file, it toggles which of the pair is updated (i.e. the first time the
WAL file is wrapped the writer starts using the second of each pair, then
after it is wrapped a second time it uses the first again, and so on).
Since a writer may only wrap around to the start of the WAL file when the
log consists of a single region, this makes it possible to avoid ever
needing to delete individual elements from hash slot tables. It also 
makes the shm file 50% larger.

<h3 id=changes_to_locking>2.1.3. Changes to Locking</h3>

<p>The difference between the current design and SQLite in WAL mode is
that this design supports read-only clients.

<p>Like SQLite WAL mode, this system has essentially two states - live 
and halted. The system is live when there is at least one open database
connection and the contents of the shared-memory region are deemed 
trustworthy. Otherwise, it is halted.

<p>To simplify things a bit, this design combines recovery with connecting
to the database. In SQLite, when an database is opened the library checks
to see if the new connection is the only one to the database in question.
If it is, the shared-memory region is zeroed, which forces recovery to be
run later on - perhaps by another connection. In the current design,
recovery shall be run immediately upon determining that the current 
connection is the first. Using the locking notation for SQLite, the
connection procedure for read-write clients is therefore:

<pre>
  Lock(DMS1, EXCLUSIVE)      /* blocking lock */
    Lock(DMS2, EXCLUSIVE)    /* non-blocking lock */
    if( successful ){ Run recovery }
    Lock(DMS2, SHARED)       /* "cannot" fail */
  Unlock(DMS1)
</pre>

<p>This makes things simpler by ensuring that recovery never needs to be 
run by a (possibly read-only) client once it is connected to a live system.

<h4 id=read-only_clients_and_halted_databases>2.1.3.5. Read-only Clients and Halted Databases</h4>

<p>In order to read from a halted database, a read-only client runs a process
very similar to database recovery to create a private wal-index in heap 
memory. Since there usually is no WAL file associated with a halted system 
this is often a no-op. 

<p>In order to safely read from a halted database, a read-only client 
requires the system to guarantee that for the duration of the read 
transaction no read-write client connects and overwrites any database or 
WAL data that the read-only connection may be using. To achieve this, a
read-only client grabs the CHECKPOINTER lock for the duration of its 
transaction (including, if required, the part where it builds a private
wal-index). Holding the CHECKPOINTER lock guarantees that:

<ul>
  <li><p>That no checkpoint can be run. Ensuring that database pages in
       use by the read-only client remain undisturbed for the duration of the
       transaction.

  <li><p>The logical contents of the WAL file header may not be modified. This
       ensures that if a writer process does wrap around to the start of the
       WAL file, it may not overwrite any WAL frames that may be used by
       the read-only connection.
</ul>

<p>A read-only client may therefore safely read from a database (live 
or halted) provided that it obtains a shared lock on the CHECKPOINTER. If
a wal-index is required, the reader constructs it in heap memory after
obtaining the CHECKPOINTER lock.

<h4 id=read-only_clients_and_live_databases>2.1.3.6. Read-only Clients and Live Databases</h4>

<p>Read-only clients could simply assume that all databases are halted and
safely access them as described in the previous section. However, this is
undesirable as (a) checkpointers and read-only transactions will block each
other, and (b) recovering the contents of a WAL file at the beginning of
each transaction will degrade performance.

<p>Since recovery is never required once a database is live, read-only 
clients may connect to and access a database in much the same way as 
read-write clients. There are still two differences:

<ul>
  <li><p>Since a read-only client may not attempt an EXCLUSIVE lock, the
  connection/disconnection protocol must be changed, and 

  <li><p>A read-only client may not modify the value stored in a read-lock 
  slot. Read-only clients must always use a slot that already contains 
  a suitable value.
</ul>

<p>To deal with the first problem, this design uses a block of N "DMS3" 
locks, where N is relatively large (say 64 or 128), and splits the DMS2
lock into two separate locks: "DMS2/RW AND DMS2/RO". 

<p>When a read-write client connects to the database, immediately after
obtaining the shared lock on DMS2, it attempts to obtain an EXCLUSIVE lock on
one of the bytes in the DMS3 range. If it cannot (indicating that there are
already N read-write connections), then this is not an error. Read-only clients
use the DMS3 block to detect whether or not there are already one or more
read-write clients connected.

<pre>
  Lock(DMS1, SHARED)            /* blocking lock */
    Lock(DMS3..DMS3+N, SHARED)  /* non-blocking lock */
    if( unsuccessful ){ 
      Lock(DMS2/RO, SHARED)     /* "cannot" fail */
      /* Connection is now connected to a live database */
    }
  Unlock(DMS1)
</pre>

<p>Using the above scheme, it is possible that a read-only connection
incorrectly concludes that a database is halted. This can happen if the
only other connections to the database are either read-only or else 
read-write connections that failed to grab a DMS3 lock when connecting.

<p><i>How is the second problem solved? Can we somehow guarantee that 
  there is always at least one usable slot available? Or a fallback slot 
  locked to indicate that the entire WAL may be in use.</i>

<h2 id=b-tree_level_changes>2.2. B-Tree Level Changes</h2>
<h3 id=free_page_management>2.2.1. Free Page Management</h3>

<i>
  <p> Bitmaps? A linked-list similar to SQLite? Other? Which free-list use
  cases could be improved?  
</i>

<ul>
  <li> Free a page.
  <li> Allocate a page.
  <li> Allocate a block of pages.
  <li> Query to see if a given page is free? Does this help with an 
       incr-vacuum like capability?
</ul>

tree pages.
free pages.
overflow pages.

<h3 id=large_records>2.2.2. Large Records</h3>

<p> The tree is a b+tree, so there are two types of records to consider:

<ul>
  <li>Key prefixes stored on internal tree pages.
  <li>Real key/value pairs stored on leaf pages.
</ul>

<p> The internal tree is populated by a set of key-prefixes such that for
each pair of adjacent leaf pages there exists exactly one key-prefix within 
the internal tree that is less than or equal to the smallest key on one of 
the leaves and larger than all the keys on the other. If any such key-prefix 
is larger than N bytes in size, it is omitted from the internal tree. When
the key-prefix is required (as part of a seek or rebalancing operation), it
is instead read dynamically from the leaf page containing the smallest keys
larger than or equal to the omitted key-prefix. This leaf may be found by
following the existing tree pointers. N is selected so as to guarantee a
minimum internal fanout of (say) 4 or 8.

<p>Any real key/value pair that is too large for a single leaf page is
spread across the leaf and one or more overflow pages. Overflow pages are
organized differently to those in SQLite with two goals in mind:

<ul>
  <li> To make random access to offsets within large values more efficient, and
  <li> To make it possible to append data to an existing value without
       rewriting all constituent pages.
</ul>

<i>
  <p> Inode-style overflow pages (instead of a linked list)?
</i>

<i>
  <p> Or a single pointer on the inode to a tree of overflow pages.
</i>

<h3 id=b-tree_page_format>2.2.3. B-Tree Page Format</h3>

<ul>
  <li><p> Make it easy to avoid overreads.
  <li><p> Make binary searches as fast as possible.
</ul>

<p>
  Each b-tree page has a fixed-size header and a variable-sized footer. The
  purpose of the footer is to prevent buffer overreads.

<p><b>B-Tree Nodes</b>

<p>Cell format:

<ul>
  <li> Number of bytes in the key-prefix (nKey), as a varint. Or, if the key-prefix for
       this cell is too large to be stored on an internal node (see above), zero.
  <li> nKey bytes of key data.
  <li> Page number of b-tree page containing keys equal to or smaller than 
       the key-prefix as a varint.
</ul>

<p><b>B-Tree Leaves</b>

<p>Cell format:

<p>There are three types of cells on b-tree leaves, depending on how overflow
pages are used - (a) cells that do not use overflow pages at all, (b) cells
that use overflow pages for the value only, and (c) cells that use overflow
pages for the key and value.

<p>Cell type (a):
<ul>
  <li> Number of bytes of the entries key (nKey), as a varint.
  <li> nKey bytes of key data.
  <li> Number of bytes in entries value (nValue), as a varint.
  <li> nValue bytes of value data.
</ul>

<p>Cell type (b):
<ul>
  <li> Number of bytes in entries key (nKey), as a varint.
  <li> nKey bytes of key data.
  <li> Single 0x00 byte.
  <li> Number of bytes of the entries value (nValue) stored locally, as a varint.
  <li> nValue bytes of value data.
  <li> Number of bytes of the entries value stored on overflow pages, as a varint.
</ul>

<p>Cell type (c):
<ul>
  <li> Single 0x00 byte.
  <li> Number of bytes of the entries key (nKey) stored locally, as a varint.
  <li> nKey bytes of key data.
  <li> Number of bytes of the entries key stored on overflow pages, as a varint.
  <li> Number of bytes in the entries value, as a varint.
</ul>

<p>Cell types (b) and (c) are followed by an array of pointers to overflow
pages. The overflow data for a single entry is distributed between up to 16
"direct" overflow pages and a single overflow tree. A direct overflow page
is just that - a pointer to an overflow page that contains data. An overflow
"tree" is a tree structure where leaves contain overflow data and nodes 
contain pointers to leaves or sub-trees. All leaves are the same distance
from the root of the tree.

<p>The overflow array that follows cell types (b) and (c) is formatted as
follows:
<ul>
  <li> Single overflow descriptor byte. The least-significant 4 bits contain
       the number of direct overflow pages. The most-significant contain the 
       depth of the overflow tree (0 means the root of the tree contains data). 
  <li> The page numbers of the N direct overflow pages, formatted as varints.
  <li> The page number of the root of the overflow tree.
</ul>



<h1 id=merge-tree_database_notes>3. Merge-Tree Database Notes</h1>

<p>
  In general, LSM and other merge-tree databases obtain their performance
  advantages by:

<ol>
  <li> <p>Using a merge-tree format to improve locality of writes.

  <li> <p>Accumulating a number of writes in main-memory before flushing them to
  disk. LSM does this using its "in-memory tree" in shared-memory. LevelDb uses
  a skip-list stored in heap memory. FTS4 uses an in-memory hash table.

  <li> <p>Shifting significant amounts of work to a background thread or process.
  LevelDB uses a dedicated background thread. LSM allows a separate database 
  connections opened in the same or different processes to perform "work" on
  the database on demand. FTS4 also allows any connection to perform merge
  operations, but the merging and writing cannot be performed concurrently.
</ol>

<p>
FTS4 builds a merge-tree on top of SQLite's SQL layer. When compared to
LevelDB or LSM, this is limited in two ways:

<ul>
  <li><p>Changes are only accumulated in memory within a single transaction.
  All changes must be flushed to disk by the writer process when the 
  transaction is committed.

  <li><p>Although existing segments may be merged together at any time
  (background work), doing so requires the same lock as for any other write
  to the database.
</ul>

<h2 id=merge-tree_format>3.1. Merge-Tree Format</h2>

<ul>
  <li><p>Data is stored in segments. Each segment is a tightly packed 
  read-only b-tree that fits entirely in a single block. A block is 
  (say) 1MB in size.

  <li><p>Each segment is assigned to a level. Each level contains a 
  set of non-overlapping segments. All segments in level N are newer
  than those in segment (N+1).

  <li><p>Segments are linked together using a tree-like data structure
  stored on regular database pages (mixed in with b-tree data pages).
</ul>

<h2 id=in-memory_tree>3.2. In-Memory Tree</h2>

<p>This design uses the same multi-version in-memory tree that LSM does. 
The tree structure may be located in heap or shared memory. If shared memory
is used then a separate file (*-shm2 or some such) is used for the data. The 
tree-header used by LSM to identify the latest tree version is combined with
the wal-index header stored in the *-shm file. This makes opening a
read-transaction simpler than in LSM, as there is no need to ensure that
compatible versions of the in-memory and on-disk databases are locked.

<h2 id=use_of_a_background_thread_or_process>3.3. Use of a Background Thread or Process</h2>

<p>The b-tree database described above supports the use of a background thread
or process to perform checkpoints. To acheive LSM/LevelDB type performance,
this would be extended so that:

<ul>
  <li> <p>The user thread (WRITER lock) is responsible for updating the
  in-memory tree and for writing logical log entries to the WAL file.

  <li> <p>The checkpointer (CHECKPOINTER lock) thread is responsible or 
  copying data from the in-memory tree to the database file.

  <li> <p>The checkpointer (CHECKPOINTER lock) thread is responsible for
  merging together existing segments within the database file.
</ul>

<p>The immediate problem is that for b-tree writes, space is allocated
within the database file by the thread holding the WRITER lock. But the
second and third points above would be more easily implemented if the
thread holding the CHECKPOINTER thread were to do so.

<p>Because of this, although a checkpointer does the actual work of copying
data from the in-memory tree to the database and merging existing segments
together, these operations must be scheduled and space allocated for them by a
writer. To do this, a writer appends a special frame to the WAL specifying:

<ul>
  <li><p>The operation to perform. For example, to flush data from the in-memory
       tree to disk, or to merge two or more specified segments together.

  <li><p>The address of space within the database file allocated for the output
       of the operation. For a flush of the in-memory tree, it is not (?) easy
       to determine the amount of space required in advance. In this case an
       upper bound on the space required is determined and space allocated on
       this basis.

       <p>If the operation is a merge of existing segments, then a single 
       block of space in the db file is allocated for the output of the 
       operation. If this is insufficient, merging stops once it is full.
       The remainder of the data in the input segments is handled by a 
       subsequent merge operation.
</ul>

<p>The last page of each block contains a description of its contents. After a
checkpoint has populated a block with the results of a flush or merge operation,
the next writer client to open a transaction recognizes this and reads the
description from the last page of the populated block or blocks. It then
updates the structure that links all populated blocks together (by appending
new versions of pages to the WAL). In other words, flushing the in-memory 
tree or merging blocks together is a three-step operation:

<ol>
  <li><p> A writer process schedules the operation by adding a frame to the WAL.
  <li><p> A checkpointer does the flush or merge operation as part of a checkpoint.
  <li><p> A writer observes that the operation has finished and updates the database
  to link the new segment into the merge-tree.
</ol>

<h1 id=design_summary>4. Design Summary</h1>

<h2 id=locking>4.1. Locking</h2>

<p>Locks are divided into two groups - those that are used as part of
connecting to and initializing the system (running recovery), and those 
that are used by reader, writer and checkpointer clients once they are
connected to the database.

<p> Locks used during connection/disconnection:

<p><table style="width:90%;margin:auto" border=1>
  <tr><td style="width:15ex">DMS1
  <td>This slot is used like a mutex to serialize the connection/disconnection
      of read/write clients. 

  <tr><td>DMS2/rw
  <td>Once connected, all read/write clients hold a SHARED lock on this slot
      until they disconnect.

  <tr><td>DMS2/ro
  <td>Once connected, all read-only clients hold a SHARED lock on this slot
      until they disconnect.

  <tr><td>DMS3
  <td>This is a large block of locks (say 64 or more). Upon connection, each
    read/write connection attempts to obtain an EXCLUSIVE lock on one of 
    these slots. If there are already more than 64 read/write connections, this
    may not be possible. In that case, the connection continues with out the
    lock.
</table>

<p> Locks used during normal operations:

<p><table style="width:90%;margin:auto" border=1>
  <tr><td style="width:15ex">READER
  <td>An array of (say) 8 slots, each of which is associated with a 32-bit
    integer value. As in SQLite WAL mode.

  <tr><td> WRITER
  <td>An EXCLUSIVE lock on this slot is held when a write transaction is open.

  <tr><td> CHECKPOINTER
  <td>An EXCLUSIVE lock on this slot is held while a checkpoint is underway.
</table>

<p><b>Read-only Connections</b>

<p>To open a read-transaction, a disconnected read-only client does the following:

<pre>
  Lock(DMS3..DMS3+N, SHARED)  /* non-blocking lock */
  if( successful ){
    Unlock(DMS3..DMS3+N)
    Lock(CHECKPOINTER, SHARED)
    if( successful ){
      "recover" the db into private heap memory.
      return SQLITE4_OK
    }
  }
  return SQLITE4_BUSY
</pre>

<p>If SQLITE4_BUSY is returned, then the read-only client can conclude that
the database is now live. It should attempt to connect to it and then
re-attempt opening the read-transaction. To connect to a database:

<pre>
  Lock(DMS1, SHARED)            /* blocking lock */
    Lock(DMS3..DMS3+N, SHARED)  /* non-blocking lock */
    if( unsuccessful ){ 
      Lock(DMS2/RO, SHARED)     /* "cannot" fail */
      /* Connection is now connected to a live database */
    }else{
      Unlock(DMS3..DMS3+N)
      /* Attempt to connect has failed */
    }
  Unlock(DMS1)
</pre>

<p>

<p><b>Read/Write Connections</b>

<p>To connect:
<pre>
  Lock(DMS1, EXCLUSIVE)         /* blocking lock */
    Lock(DMS2/RO..DMS2/RW, EXCLUSIVE)
    if( successful ){ 
      Run recovery to set up shared memory
      Unlock(DMS2/RO..DMS2/RW)
    }
    Lock(DMS2/RW, SHARED)       /* "cannot" fail */
    for(i=0; i&lt;64; i++){
      Lock(DMS3+i, EXCLUSIVE)
      if( successful ) break;
    }
  Unlock(DMS1)
</pre>

<p>To disconnect:

<pre>
  Lock(DMS1, EXCLUSIVE)         /* blocking lock */
    Lock(DMS2/RW, EXCLUSIVE)
    if( successful ){ 
      Checkpoint
      Lock(DMS2/RO, EXCLUSIVE)
      if( successful ){ 
        Delete WAL+SHM files 
        Unlock(DMS/RO)
      }
      Unlock(DMS/RW)
    }
  Unlock(DMS1)
</pre>

<p>To open a read-transaction:

<pre>
  Read wal-index header.
</pre>

<p>To open a write-transaction:

<p>To commit a write-transaction:





<!--
<hr>

<p>
  As well as the merge-tree file format, LSM style embedded databases derive
  their performance advantages by:

<ol>
  <li> <p>Accumulating a number of writes in main-memory before flushing them to
  disk. LSM does this using its "in-memory tree" in shared-memory. LevelDb uses
  a skip-list stored in heap memory. FTS4 uses an in-memory hash table.

  <li> <p>Shifting significant amounts of work to a background thread or process.
  LevelDB uses a dedicated background thread. LSM allows a separate database 
  connections opened in the same or different processes to perform "work" on
  the database on demand. FTS4 also allows any connection to perform merge
  operations, but the merging and writing cannot be performed concurrently.
</ol>

<p>
  The basic idea is for the file to contain two separate tree structures - a 
  b-tree (b+tree) and a merge-tree. In order to query the database, both
  structures are queried and the results merged, with the merge-tree assumed to
  contain newer data than the b-tree. Once enough data has accumulated in the
  merge-tree, it may be merged into the b-tree.
  
<p>
  Blind-writes issued by the application are written to the merge-tree. B-tree
  writes are written directly to the b-tree, unless the merge-tree contains a
  conflicting key, in which case they must be written to the merge-tree.

<p>
  The "merge-tree" is written in the same manner as LSM/FTS4/LevelDB. When
  an application writes to the merge-tree, a copy of the new KV pair or delete
  marker is appended to the database log and an in-memory data structure
  updated. Once sufficient data, perhaps belonging to more than one transaction,
  has accumulated in the in-memory data structure, its contents are written
  out to disk. And so on.

<p>
  The b-tree is written using the same methods as SQLite3 in WAL mode. 
  Modified pages are appended to the log file and a hash table of the
  latest version of each page in the log maintained in-memory. Some parts
  of the merge tree are written using this method as well.

<h id=data_structures>4. Data Structures</h1>

<h id=log_file>4.1. Log file</h2>

How do we make a wal file into a circular log like LSM uses?

In LSM, there are two ways the log file may be read. From byte offset 0, with
well-known checksum initializer, or from some point midway through the file 
specified by an offset and checksum initializer. 

The log file also features "jump markers" used to wrap the cursor etc.

Key point is that there is some other checksummed record in which a log
offset/checksum may be stored.

So, we either need to do this or find a new "well-known offset" to start 
at.

The obvious point is the start of the database file. Two problems:

  * It changes too often.
  * Blast-radius may change after db is created.

So, elsewhere in the log file? Space reserved at the start of the log?
At the start of the log we reserve a block of size $blastradius. Then 
begin writing the log immediately following this.

When recovering, attempt to read a header from the start of the log file.
If one can be read, it contains an offset to start reading the log at. In
this case proceed. Otherwise, begin searching for a valid header at 
various power-of-two offsets within the log file (i.e. at offset 512, 
then 1024, etc.). This may be optimized to check the sector-size offset
first.

<p><b>Log file topology</b>



<p><b>Hash tables and log file wrapping</b>

<p>In WAL mode, the shm file contains a list of the page numbers corresponding
to the frames in the wal file. And a hash table to aid in searching this list.

<p>How does this change when the log file becomes wrappable...

<p>Is it possible to clear entries from the hash-table/sorted-list at
checkpoint time?

Do we use LSM style CHECKPOINTER/WRITER/WORKER locking here?



<h id=merge-tree_data>4.2. Merge-Tree Data</h2>

<p>
  Merge-tree data is stored in segments. A segment is a tightly
  packed read-only b-tree. Each segment is equal to or less than some
  configured size (say 512KB or 2MB - probably not more). The majority of
  segments should be exactly this size. This size is known as the database
  block size.

<p>
  When a new segment is created, either by flushing the in-memory buffer or by
  merging data from two or more existing segments together, it is written
  directly into the database file (bypassing the log file). All that is written
  to the log file is a checksum of the new segment, for use during recovery.
  Even if they are less than a block in size, segments are always written into
  aligned blocks.

<p>
  The data structure used to keep track of all segments within a database is
  updated using the same (WAL-like) methods as the b-tree data. For each segment,
  the data structure stores:

<ul>
  <li> The smallest key it contains,
  <li> The largest key it contains, and
  <li> A pointer to the segment within the database file.
</ul>



<hr>
  
<p>
  The tree structure is similar to a b-tree. Each internal node has N 
  children and N-1 divider keys. The Ith divider key on an internal node
  is a copy of the largest key located within the Ith child sub-tree of 
  the node.
  
<p>
  Other Invariants:
  
<ul>
    <li> All segments on a child node must be older than all overlapping
      intervals on its parent.
  
    <li> For key K, the path between the root node and the leaf that would
      contain K must visit all nodes that contain segments overlapping K.
  
    <li> The above implies that although intervals on a node may overlap with
      other segments on the same node, an ancestor node or a descendent
      node, they may not overlap with segments on a sibling node.
</ul>
  
<p>
  Within a node, the relative age of each segment is stored (i.e. enough
  information to sort all segments in order of age).

<p>
  In some cases, a node may be extended with one or more overflow pages 
  in linked into a list. Making nodes effectively variably sized.

<p><b>Insert-Segment:</b>

<ol>
  <li><p> Insert the new segment into the root node. If the root node is not
     also a leaf node, immediately attempt to promote the new segment to
     a child node. The segment may be promoted if:

    <ol>
       <li> There are no existing (older) segments on the root node that
         overlap with it, and

       <li> both the start and end key of the new segment may be stored 
         within the same child sub-tree.
    </ol>

    <p>
     Promote the new segment as many times as possible (i.e. until it
     is stored on a leaf or on a node for which one of the two conditions 
     above is not met).

  <li><p> Check if the node selected by step 1 has enough space to accomodate 
     the new segment. If so, no problem.

    <p>
     Otherwise, the node content is redistributed between itself and
     one or two siblings, and a new sibling allocated if necessary.
     This is slightly more complex than a b-tree rebalance, as the
     first and last keys of each segment must remain on the same node.
     Where this makes the content impossible to distribute between
     nodes, the original node is extended with an overflow page.
</ol>

<p><b> Merge-Segments:</b>

<ol>
  <li> From the selected node X (see below), select two or more overlapping 
     segments to merge. 

  <li> Merge the content of the selected segments together into a set of zero or
     more non-overlapping segments. Remove the originals from node X. Then
     "insert" each of the new segments into the sub-tree headed by node X
     (including attempting to promote the new segments etc.). 

  <li> If, following this, node X is over or underfull, rebalance it with
     its siblings in the manner of a b-tree.
</ol>

<p><b> Select-Segment:</b>

<p>
  Each node maintains a count of the number of overlapping segments 
  currently stored on it. Additionally, each pointer to a sub-tree
  on an internal node also stores the maximum value of this metric for 
  all nodes in the sub-tree.

<p>
  This implies that when a node is updated by a blind write operation, all
  nodes between it and the root node may also require modification (how
  often?).

<p>
  This metric, along with some pseudo-random selection when there 
  exist multiple nodes with the same value, can be used to select
  a reasonable place to merge segments together with little effort.

<h id=b-tree_data>4.3. B-Tree Data</h2>

<p>
  B-tree KV pairs are stored in leaf pages. If no merge-tree data is ever
  written to a database it is equivalent to a b+tree. The rebalancing and
  other tree manipulations required as a result of b+-tree style insert and
  delete operations are the same as they are for regular b+-tree structures,
  except that care must be taken to keep the start and end keys of any
  segments on the same internal node (see above).

<p>
  If a database is only ever written using blind writes, then it is in 
  optimal form when no two segments in the database overlap. If a database
  contains both b-tree and blind write data, then the blind writes must
  be merged into the b-tree data before the database is in its optimal
  state.

<p>
  With blind write data only (and no merging with b-tree data), segments
  may propagate all the way to leaf nodes. However, if the tree contains
  both segment and b-tree data, then segments that reside on tree nodes
  that are the head of a sub-tree roughly the size of a single segment 
  should be merged with b-tree data at that point, not promoted further 
  down the tree. So, after a node's segments are selected for merging it 
  needs to be possible to determine if the content should be merged with 
  the sub-tree headed by the node as well as with other segments. The
  answer is yes if:

<ul>
  <li> The sub-tree contains no segments (except those on its root node).
  <li> The sub-tree is smaller in size than some threshold based on the 
       segment size (say 2 or 4 times).
</ul>

<p>
  Storing the number of segments of the sub-tree on each node seems Ok. This
  means updating the path between the updated node and the root each time the
  set of segments on a node is modified. Since segments are much larger than
  this path, and updated relatively infrequently, this is acceptable.

<p>
  Maintaining the total size of the sub-tree is more difficult. Can we update
  each time a leaf is added or removed? Sounds like that might be a lot of
  extra updates over a straight b-tree.

<p>
  Can we just estimate the quantity of data based on the number of children
  the node has along with the distance from the tree leaves? If a node has
  nChild children and the sub-tree is nSub elements high, the size of the 
  sub-tree could be estimated as nChild raised to the nSub'th power. This 
  estimate is probably good enough for the test above.


<h id=file_format>5. File Format</h1>

<p>
  The data structure is stored on disk in a manner similar to SQLite WAL
  databases. The file-system is populated with:

<ul>
  <li> A database file.
  <li> A write-ahead-log file, containing recently made modifications to the
       database.
  <li> A shared-memory file, the contents of which may be reconstructed by
       parsing the wal file.
</ul>

<p>
  As well as the new page images contained in an SQLite WAL file, this log 
  file also contains KV pairs and delete markers written in blind write mode.

<p>
  Similarly, as well as the hash-tables used to index the page images contained
  in the log file, the shared-memory file also holds a tree structure
  containing all the blind write data from the log file (as it does in LSM and
  LevelDB).

<p>
  As in SQLite3, free space within the database file is tracked on a per-page 
  basis. The data structure used to store the set of free pages stores them 
  in sorted order.

<p>Note: Track free blocks (blocks on which all pages are free) separately?


<h id=locking_and_concurrency>6. Locking and Concurrency</h1>

<p>

<h id=pager_level_design>7. Pager Level Design</h1>

<p>
  Include management of free-space within the database in this level.

<p>
  The pager level provides two sizes of allocation:
  <ul>
    <li> Page size (1-4 KB), and
    <li> Block size (512KB to 2MB).
  </ul>
</p>
-->











































Changes to www/index.wiki.
8
9
10
11
12
13
14

  *  [./key_encoding.wiki | The Key Encoding]
  *  [./decimal.wiki | Internal representation of numeric values]
  *  [./porting.wiki | Porting an app from SQLite3 to SQLite4]
  *  [./storage.wiki | How to create a new storage engine]
  *  [./lsmusr.wiki  | LSM User Manual]
  *  [./lsmperf.wiki | LSM Benchmarks]
  *  [./lsmapi.wiki  | LSM API Reference]








>
8
9
10
11
12
13
14
15
  *  [./key_encoding.wiki | The Key Encoding]
  *  [./decimal.wiki | Internal representation of numeric values]
  *  [./porting.wiki | Porting an app from SQLite3 to SQLite4]
  *  [./storage.wiki | How to create a new storage engine]
  *  [./lsmusr.wiki  | LSM User Manual]
  *  [./lsmperf.wiki | LSM Benchmarks]
  *  [./lsmapi.wiki  | LSM API Reference]
  *  [./bt.wiki      | BT Design]