Documentation Source Text

Artifact [30de04a5bf]

Artifact 30de04a5bf4ee2b54f3ad4240d05430e7f4a0aceaff40cb1124ab82b320ec7d2:

<title>Measuring and Reducing CPU Usage in SQLite</title>
<tcl>hd_keywords {CPU cycles used} {CPU performance measurement}</tcl>



<p>The graph below shows the number of CPU cycles used by SQLite on a
standard workload, for versions of SQLite going back about 10 years.
Recent versions of SQLite use about one third as many the CPU cycles
compared to older versions.

This article describes how the SQLite developers measure CPU usage,
what those measurements actually mean, and the techniques used by
SQLite developers on their continuing quest to further reduce the
CPU usage of the SQLite library.

<div class="imgcontainer">
<img src="./images/cpu-usage.jpg"></img></div><br>
Measured using cachegrind on Ubuntu 16.04 on x64 with gcc 5.4.0 and -Os.<br>

<h1>Measuring Performance</h1>

<p>In brief, the CPU performance of SQLite is measured as follows:

<li> Compile SQLite in an as-delivered configuration, without any special
     telemetry or debugging options.
<li> Link SQLite against a test program that runs approximately 30,000
     SQL statements representing a typical workload.
<li> Count the number of CPU cycles consumed using

<h2>Compile Options</h2>

<p>For performance measurement, SQLite is compiled in approximately the same
way as it would be for use in production systems.  The compile-time configuration
is "approximate" in the sense that every production use of SQLite is 
different. Compile-time options used by one system are not necessarily
the same as those used by others.  The key point is that options that 
significantly impact the generated machine code are avoided.  For example,
the -DSQLITE_DEBUG option is omitted because that option inserts thousands
of assert() statements in the middle of performance critical sections of the
SQLite library.  The -pg option (on GCC) is omitted because it causes the
compiler to emit extra probabilistic performance measuring code which interferes
with actual performance measurements.

For performance measurements,
the -Os option is used (optimize for size) rather than -O2 because the
-O2 option creates so much code movement that it is difficult to associate
specific CPU instructions to C source code lines.


The "typical" workload is generated by the
program in the canonical SQLite source tree.  This program strives to
exercise the SQLite library in a way that is typical of real-world
applications.  Of course, every application is different, and so
no test program can exactly mirror the behavior of all applications.

The speedtest1.c program is updated from time to time as the SQLite
developers' understanding of what constitutes "typical" usage evolves.

[|] shell
script, also in the canonical source tree, is used to run the speedtest1.c
program.  To replicate the performance measurements, collect the following
files into a single directory:
<li> the "" script,
<li> the "speedtest1.c" test program, and
<li> the [amalgamation|SQLite amalgamation] source files "sqlite3.c" and
Then run "sh trunk".

<h2>Performance Measurement</h2>

[|Cachegrind] is used to
measure performance because it gives answers that are repeatable to 
7 or more significant digits.  In comparison, actual (wall-clock)
run times are scarcely repeatable beyond one significant digit.

<tcl>hd_fragment microopt microoptimizations</tcl>

The high repeatability of cachegrind allows the SQLite developers to
implement and measure "microoptimizations".  A microoptimization is
a change to the code that results in a very small performance increase.
Typical micro-optimizations reduce the number of CPU cycles by 0.1% or
0.05% or even less.  Such improvements are impossible to measure with
real-world timings.  But hundreds or thousands of microoptimizations
add up, resulting in measurable real-world performance gains.

<h1>Performance Measurement Workflow</h1>

As SQLite developers edit the SQLite source code, they run the
[ |]
shell script to track the performance impact of changes.  This
script compiles the speedtest1.c program, runs it under cachegrind,
processes the cachegrind output using the
[ | cg_anno.tcl] TCL
script, then saves the results in a series of text files.
Typical output from the script looks like this:

==8683== I   refs:      <font color="red">1,060,925,768</font>
==8683== I1  misses:       23,731,246
==8683== LLi misses:            5,176
==8683== I1  miss rate:          2.24%
==8683== LLi miss rate:          0.00%
==8683== D   refs:        557,686,925  (361,828,925 rd   + 195,858,000 wr)
==8683== D1  misses:        5,067,063  (  3,544,278 rd   +   1,522,785 wr)
==8683== LLd misses:           57,958  (     16,067 rd   +      41,891 wr)
==8683== D1  miss rate:           0.9% (        1.0%     +         0.8%  )
==8683== LLd miss rate:           0.0% (        0.0%     +         0.0%  )
==8683== LL refs:          28,798,309  ( 27,275,524 rd   +   1,522,785 wr)
==8683== LL misses:            63,134  (     21,243 rd   +      41,891 wr)
==8683== LL miss rate:            0.0% (        0.0%     +         0.0%  )
   text	   data	    bss	    dec	    hex	filename
 523044	   8240	   1976	 <font color="red">533260</font>	  8230c	sqlite3.o
 220507 1007870 7769352 sqlite3.c

<p>The important parts of the output (the parts that the developers pay
the most attention to) are shown in red.
Basically, the developers want to know the size of the compiled SQLite
library and how many CPU cycles were needed to run the performance test.

<p>The output from the 
[ | cg_anno.tcl] script
shows the number of CPU cycles spent on each line of code.
The report is approximately 80,000 lines long.  The following is a brief
snippet taken from the middle of the report to show what it looks like:

         .  SQLITE_PRIVATE int sqlite3BtreeNext(BtCursor *pCur, int *pRes){
         .    MemPage *pPage;
         .    assert( cursorOwnsBtShared(pCur) );
         .    assert( pRes!=0 );
         .    assert( *pRes==0 || *pRes==1 );
         .    assert( pCur-&gt;skipNext==0 || pCur-&gt;eState!=CURSOR_VALID );
   369,648    pCur-&gt;info.nSize = 0;
   369,648    pCur-&gt;curFlags &amp;= ~(BTCF_ValidNKey|BTCF_ValidOvfl);
   369,648    *pRes = 0;
   739,296    if( pCur-&gt;eState!=CURSOR_VALID ) return btreeNext(pCur, pRes);
 1,473,580    pPage = pCur-&gt;apPage&#91;pCur-&gt;iPage&#93;;
 1,841,975    if( (++pCur-&gt;aiIdx&#91;pCur-&gt;iPage&#93;)&gt;=pPage-&gt;nCell ){
     4,340      pCur-&gt;aiIdx&#91;pCur-&gt;iPage&#93;--;
     5,593      return btreeNext(pCur, pRes);
         .    }
   728,110    if( pPage-&gt;leaf ){
         .      return SQLITE_OK;
         .    }else{
     3,117      return moveToLeftmost(pCur);
         .    }
   721,876  }

The numbers on the left are the CPU cycle counts for that line of code,
of course.

The cg_anno.tcl script removes extraneous details from the default 
cachegrind annotation
output so that before-and-after reports can be compared using a 
side-by-side diff to view specific details of how a
micro-optimization attempt affected performance.


<p>The use of the standardized speedtest1.c workload and cachegrind has
enabled significant performance improvement.
However, it is important to recognize the limitations of this approach:

Performance measurements are done with a single compiler (gcc 5.4.0),
optimization setting (-Os), and
on a single platform (Ubuntu 16.04 LTS on x64).  The performance of
other compilers and processors may vary.

The speedtest1.c workload that is being measured tries to be representative
of a wide range of typical uses of SQLite.  But every application is
different.  The speedtest1.c workload might not be a good proxy for the
kinds of activities performed by some applications.  The SQLite developers
are constantly working to improve the speedtest1.c program, to make it
a better proxy for actual SQLite usage.  Community feedback is welcomed.

The cycle counts provided by cachegrind are a good proxy for actual
performance, but they are not 100% accurate.

Only CPU cycle counts are being measured here. 
CPU cycle counts are a good proxy for energy consumption,
but do not necessary correlate well with real-world timings.
Time spent doing I/O is not reflected in the CPU cycle counts,
and I/O time predominates in many SQLite usage scenarios.