mirror of
https://github.com/luanti-org/luanti.git
synced 2025-10-13 00:25:19 +02:00
Add dummy and LevelDB database backends
This commit is contained in:
213
src/leveldb/doc/impl.html
Normal file
213
src/leveldb/doc/impl.html
Normal file
@@ -0,0 +1,213 @@
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<link rel="stylesheet" type="text/css" href="doc.css" />
|
||||
<title>Leveldb file layout and compactions</title>
|
||||
</head>
|
||||
|
||||
<body>
|
||||
|
||||
<h1>Files</h1>
|
||||
|
||||
The implementation of leveldb is similar in spirit to the
|
||||
representation of a single
|
||||
<a href="http://labs.google.com/papers/bigtable.html">
|
||||
Bigtable tablet (section 5.3)</a>.
|
||||
However the organization of the files that make up the representation
|
||||
is somewhat different and is explained below.
|
||||
|
||||
<p>
|
||||
Each database is represented by a set of files stored in a directory.
|
||||
There are several different types of files as documented below:
|
||||
<p>
|
||||
<h2>Log files</h2>
|
||||
<p>
|
||||
A log file (*.log) stores a sequence of recent updates. Each update
|
||||
is appended to the current log file. When the log file reaches a
|
||||
pre-determined size (approximately 4MB by default), it is converted
|
||||
to a sorted table (see below) and a new log file is created for future
|
||||
updates.
|
||||
<p>
|
||||
A copy of the current log file is kept in an in-memory structure (the
|
||||
<code>memtable</code>). This copy is consulted on every read so that read
|
||||
operations reflect all logged updates.
|
||||
<p>
|
||||
<h2>Sorted tables</h2>
|
||||
<p>
|
||||
A sorted table (*.sst) stores a sequence of entries sorted by key.
|
||||
Each entry is either a value for the key, or a deletion marker for the
|
||||
key. (Deletion markers are kept around to hide obsolete values
|
||||
present in older sorted tables).
|
||||
<p>
|
||||
The set of sorted tables are organized into a sequence of levels. The
|
||||
sorted table generated from a log file is placed in a special <code>young</code>
|
||||
level (also called level-0). When the number of young files exceeds a
|
||||
certain threshold (currently four), all of the young files are merged
|
||||
together with all of the overlapping level-1 files to produce a
|
||||
sequence of new level-1 files (we create a new level-1 file for every
|
||||
2MB of data.)
|
||||
<p>
|
||||
Files in the young level may contain overlapping keys. However files
|
||||
in other levels have distinct non-overlapping key ranges. Consider
|
||||
level number L where L >= 1. When the combined size of files in
|
||||
level-L exceeds (10^L) MB (i.e., 10MB for level-1, 100MB for level-2,
|
||||
...), one file in level-L, and all of the overlapping files in
|
||||
level-(L+1) are merged to form a set of new files for level-(L+1).
|
||||
These merges have the effect of gradually migrating new updates from
|
||||
the young level to the largest level using only bulk reads and writes
|
||||
(i.e., minimizing expensive seeks).
|
||||
|
||||
<h2>Manifest</h2>
|
||||
<p>
|
||||
A MANIFEST file lists the set of sorted tables that make up each
|
||||
level, the corresponding key ranges, and other important metadata.
|
||||
A new MANIFEST file (with a new number embedded in the file name)
|
||||
is created whenever the database is reopened. The MANIFEST file is
|
||||
formatted as a log, and changes made to the serving state (as files
|
||||
are added or removed) are appended to this log.
|
||||
<p>
|
||||
<h2>Current</h2>
|
||||
<p>
|
||||
CURRENT is a simple text file that contains the name of the latest
|
||||
MANIFEST file.
|
||||
<p>
|
||||
<h2>Info logs</h2>
|
||||
<p>
|
||||
Informational messages are printed to files named LOG and LOG.old.
|
||||
<p>
|
||||
<h2>Others</h2>
|
||||
<p>
|
||||
Other files used for miscellaneous purposes may also be present
|
||||
(LOCK, *.dbtmp).
|
||||
|
||||
<h1>Level 0</h1>
|
||||
When the log file grows above a certain size (1MB by default):
|
||||
<ul>
|
||||
<li>Create a brand new memtable and log file and direct future updates here
|
||||
<li>In the background:
|
||||
<ul>
|
||||
<li>Write the contents of the previous memtable to an sstable
|
||||
<li>Discard the memtable
|
||||
<li>Delete the old log file and the old memtable
|
||||
<li>Add the new sstable to the young (level-0) level.
|
||||
</ul>
|
||||
</ul>
|
||||
|
||||
<h1>Compactions</h1>
|
||||
|
||||
<p>
|
||||
When the size of level L exceeds its limit, we compact it in a
|
||||
background thread. The compaction picks a file from level L and all
|
||||
overlapping files from the next level L+1. Note that if a level-L
|
||||
file overlaps only part of a level-(L+1) file, the entire file at
|
||||
level-(L+1) is used as an input to the compaction and will be
|
||||
discarded after the compaction. Aside: because level-0 is special
|
||||
(files in it may overlap each other), we treat compactions from
|
||||
level-0 to level-1 specially: a level-0 compaction may pick more than
|
||||
one level-0 file in case some of these files overlap each other.
|
||||
|
||||
<p>
|
||||
A compaction merges the contents of the picked files to produce a
|
||||
sequence of level-(L+1) files. We switch to producing a new
|
||||
level-(L+1) file after the current output file has reached the target
|
||||
file size (2MB). We also switch to a new output file when the key
|
||||
range of the current output file has grown enough to overlap more then
|
||||
ten level-(L+2) files. This last rule ensures that a later compaction
|
||||
of a level-(L+1) file will not pick up too much data from level-(L+2).
|
||||
|
||||
<p>
|
||||
The old files are discarded and the new files are added to the serving
|
||||
state.
|
||||
|
||||
<p>
|
||||
Compactions for a particular level rotate through the key space. In
|
||||
more detail, for each level L, we remember the ending key of the last
|
||||
compaction at level L. The next compaction for level L will pick the
|
||||
first file that starts after this key (wrapping around to the
|
||||
beginning of the key space if there is no such file).
|
||||
|
||||
<p>
|
||||
Compactions drop overwritten values. They also drop deletion markers
|
||||
if there are no higher numbered levels that contain a file whose range
|
||||
overlaps the current key.
|
||||
|
||||
<h2>Timing</h2>
|
||||
|
||||
Level-0 compactions will read up to four 1MB files from level-0, and
|
||||
at worst all the level-1 files (10MB). I.e., we will read 14MB and
|
||||
write 14MB.
|
||||
|
||||
<p>
|
||||
Other than the special level-0 compactions, we will pick one 2MB file
|
||||
from level L. In the worst case, this will overlap ~ 12 files from
|
||||
level L+1 (10 because level-(L+1) is ten times the size of level-L,
|
||||
and another two at the boundaries since the file ranges at level-L
|
||||
will usually not be aligned with the file ranges at level-L+1). The
|
||||
compaction will therefore read 26MB and write 26MB. Assuming a disk
|
||||
IO rate of 100MB/s (ballpark range for modern drives), the worst
|
||||
compaction cost will be approximately 0.5 second.
|
||||
|
||||
<p>
|
||||
If we throttle the background writing to something small, say 10% of
|
||||
the full 100MB/s speed, a compaction may take up to 5 seconds. If the
|
||||
user is writing at 10MB/s, we might build up lots of level-0 files
|
||||
(~50 to hold the 5*10MB). This may signficantly increase the cost of
|
||||
reads due to the overhead of merging more files together on every
|
||||
read.
|
||||
|
||||
<p>
|
||||
Solution 1: To reduce this problem, we might want to increase the log
|
||||
switching threshold when the number of level-0 files is large. Though
|
||||
the downside is that the larger this threshold, the more memory we will
|
||||
need to hold the corresponding memtable.
|
||||
|
||||
<p>
|
||||
Solution 2: We might want to decrease write rate artificially when the
|
||||
number of level-0 files goes up.
|
||||
|
||||
<p>
|
||||
Solution 3: We work on reducing the cost of very wide merges.
|
||||
Perhaps most of the level-0 files will have their blocks sitting
|
||||
uncompressed in the cache and we will only need to worry about the
|
||||
O(N) complexity in the merging iterator.
|
||||
|
||||
<h2>Number of files</h2>
|
||||
|
||||
Instead of always making 2MB files, we could make larger files for
|
||||
larger levels to reduce the total file count, though at the expense of
|
||||
more bursty compactions. Alternatively, we could shard the set of
|
||||
files into multiple directories.
|
||||
|
||||
<p>
|
||||
An experiment on an <code>ext3</code> filesystem on Feb 04, 2011 shows
|
||||
the following timings to do 100K file opens in directories with
|
||||
varying number of files:
|
||||
<table class="datatable">
|
||||
<tr><th>Files in directory</th><th>Microseconds to open a file</th></tr>
|
||||
<tr><td>1000</td><td>9</td>
|
||||
<tr><td>10000</td><td>10</td>
|
||||
<tr><td>100000</td><td>16</td>
|
||||
</table>
|
||||
So maybe even the sharding is not necessary on modern filesystems?
|
||||
|
||||
<h1>Recovery</h1>
|
||||
|
||||
<ul>
|
||||
<li> Read CURRENT to find name of the latest committed MANIFEST
|
||||
<li> Read the named MANIFEST file
|
||||
<li> Clean up stale files
|
||||
<li> We could open all sstables here, but it is probably better to be lazy...
|
||||
<li> Convert log chunk to a new level-0 sstable
|
||||
<li> Start directing new writes to a new log file with recovered sequence#
|
||||
</ul>
|
||||
|
||||
<h1>Garbage collection of files</h1>
|
||||
|
||||
<code>DeleteObsoleteFiles()</code> is called at the end of every
|
||||
compaction and at the end of recovery. It finds the names of all
|
||||
files in the database. It deletes all log files that are not the
|
||||
current log file. It deletes all table files that are not referenced
|
||||
from some level and are not the output of an active compaction.
|
||||
|
||||
</body>
|
||||
</html>
|
Reference in New Issue
Block a user