BetrFS: A Right-Optimized Write-Optimized File System

Open access to the Proceedings of the

13th USENIX Conference on

File and Storage Technologies

is sponsored by USENIX

This paper is included in the Proceedings of the

13th USENIX Conference on

File and Storage Technologies (FAST ’15).

February 16–19, 2015 • Santa Clara, CA, USA

ISBN 978-1-931971-201

BetrFS: A Right-Optimized Write-Optimized

File System

William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, Stony Brook University;

John Esmet, Tokutek Inc.; Yizheng Jiao, Ankur Mittal, Prashant Pandey, and

Phaneendra Reddy, Stony Brook University; Leif Walsh, Tokutek Inc.; Michael Bender,

Stony Brook University; Martin Farach-Colton, Rutgers University; Rob Johnson,

Stony Brook University; Bradley C. Kuszmaul, Massachusetts Institute of Technology;

Donald E. Porter, Stony Brook University

https://www.usenix.org/conference/fast15/technical-sessions/presentation/jannen

USENIX Association 13th USENIX Conference on File and Storage Technologies (FAST ’15) 301

BetrFS: A Right-Optimized Write-Optimized File System

William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet

∗

, Yizheng Jiao,

Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh

∗

, Michael Bender,

Martin Farach-Colton†, Rob Johnson, Bradley C. Kuszmaul‡, and Donald E. Porter

Stony Brook University,

∗

Tokutek Inc., †Rutgers University,

and ‡Massachusetts Institute of Technology

Abstract

The B

-tree File System, or BetrFS, (pronounced

“better eff ess”) is the ﬁrst in-kernel ﬁle system to use a

write-optimized index. Write optimized indexes (WOIs)

are promising building blocks for storage systems be-

cause of their potential to implement both microwrites

and large scans efﬁciently.

Previous work on WOI-based ﬁle systems has shown

promise but has also been hampered by several open

problems, which this paper addresses. For example,

FUSE issues many queries into the ﬁle system, su-

perimposing read-intensive workloads on top of write-

intensive ones, thereby reducing the effectiveness of

WOIs. Moving to an in-kernel implementation can ad-

dress this problem by providing ﬁner control of reads.

This paper also contributes several implementation tech-

niques to leverage kernel infrastructure without throttling

write performance.

Our results show that BetrFS provides good perfor-

mance for both arbitrary microdata operations, which in-

clude creating small ﬁles, updating metadata, and small

writes into large or small ﬁles, and for large sequen-

tial I/O. On one microdata benchmark, BetrFS pro-

vides more than 4× the performance of ext4 or XFS.

BetrFS is an ongoing prototype effort, and requires ad-

ditional data-structure tuning to match current general-

purpose ﬁle systems on some operations such as deletes,

directory renames, and large sequential writes. Nonethe-

less, many applications realize signiﬁcant performance

improvements. For instance, an in-place rsync of the

Linux kernel source realizes roughly 1.6–22× speedup

over other commodity ﬁle systems.

1 Introduction

Today’s applications exhibit widely varying I/O patterns,

making performance tuning of a general-purpose ﬁle sys-

tem a frustrating balancing act. Some software, such

as virus scans and backups, demand large, sequential

scans of data. Other software requires many small writes

(microwrites). Examples include email delivery, creat-

ing lock ﬁles for an editing application, making small

updates to a large ﬁle, or updating a ﬁle’s atime. The un-

derlying problem is that many standard data structures in

the ﬁle-system designer’s toolbox optimize for one case

at the expense of another.

Recent advances in write-optimized indexes (WOI) [4,

8–10, 23, 27, 28] are exciting because they have the po-

tential to implement both efﬁcient microwrites and large

scans. The key strength of the best WOIs is that they can

ingest data up to two orders of magnitude faster than B-

trees while matching or improving on the B-tree’s point-

and range-query performance [4,9].

WOIs have been successful in commercial key-value

stores and databases [2,3,12,17,20,33,34], and previous

research on WOIs in ﬁle systems has shown promise [15,

25,31]. However, progress towards a production-quality

write-optimized ﬁle system has been hampered by sev-

eral open challenges, which we address in this paper:

• Code complexity. A production-quality WOI can eas-

ily be 50,000 lines of complex code, which is difﬁcult

to shoehorn into an OS kernel. Previous WOI ﬁle sys-

tems have been implemented in user space.

• FUSE squanders microwrite performance.

FUSE [16] issues a query to the underlying ﬁle

system before almost every update, superimposing

search-intensive workloads on top of write-intensive

workloads. Although WOIs are no worse for point

queries than any other sensible data structure, writes

are much faster than reads, and injecting needless

point queries can nullify the advantages of write

optimization.

• Mapping ﬁle system abstractions onto a WOI. We

cannot realize the full potential performance beneﬁts

of write-optimization by simply dropping in a WOI

as a replacement for a B-tree. The schema and use

of kernel infrastructure must exploit the performance

advantages of the new data structure.

This paper describes the B

-tree File System, or

BetrFS, the ﬁrst in-kernel ﬁle system designed to take

302 13th USENIX Conference on File and Storage Technologies (FAST ’15) USENIX Association

full advantage of write optimization. Speciﬁcally,

BetrFS is built using the mature, well-engineered B

tree implementation from Tokutek’s Fractal Tree index,

called TokuDB [33]. Our design is tailored to the perfor-

mance characteristics of Fractal Tree indexes, but other-

wise uses them as a black-box key-value store, so many

of our design decisions may be applicable to other write-

optimized ﬁle systems.

Experiments show that BetrFS can give up to an order-

of-magnitude improvement to the performance of ﬁle

creation, random writes to large ﬁles, recursive directory

traversals (such as occur in find, recursive greps, back-

ups, and virus scans), and meta-data updates (such as up-

dating ﬁle atime each time a ﬁle is read).

The contributions of this paper are:

• The design and implementation of an in-kernel, write-

optimized ﬁle system.

• A schema, which ensures both locality and fast writes,

for mapping VFS operations to a write-optimized in-

dex.

• A design that uses unmodiﬁed OS kernel infrastruc-

ture, designed for traditional ﬁle systems, yet mini-

mizes the impact on write optimization. For instance,

BetrFS uses the OS kernel cache to accelerate reads

without throttling writes smaller than a disk sector.

• A thorough evaluation of the performance of the

BetrFS prototype. For instance, our results show

almost two orders of magnitude improvement on a

small-write microbenchmark and a 1.5× speedup on

applications such as rsync

Our results suggest that a well-designed B

-tree-based

ﬁle system can match or outperform traditional ﬁle sys-

tems on almost every operation, some by an order of

magnitude. Comparisons with state-of-the-art ﬁle sys-

tems, such as ext4, XFS, zfs, and btrfs support this

claim. We believe that the few slower operations are not

fundamental to WOIs, but can be addressed with a com-

bination of algorithmic advances and engineering effort

in future work.

2 Motivation and Background

This section summarizes background on the increasing

importance of microwrites, explains how WOIs work,

and summarizes previous WOI-based ﬁle systems.

2.1 The microwrite problem

A microwrite is a write operation where the setup time

(i.e. seek time on a conventional disk) exceeds the data-

transfer time. Conventional ﬁle-system data structures

force ﬁle-system designers to choose between optimizing

for efﬁcient microwrites and efﬁcient scans.

Update-in-place ﬁle systems [11, 32] optimize for

scans by keeping related items, such as entries in a di-

rectory or consecutive blocks in a ﬁle, near each other.

However, since items are updated in place, update per-

formance is often limited by the random-write latency of

the underlying disk.

B-tree-based ﬁle systems store related items logically

adjacent in the B-tree, but B-trees do not guarantee that

logically-adjacent items will be physically adjacent. As

a B-tree ages, leaves become scattered across the disk

due to node splits from insertions and node merges from

deletions. In an aged B-tree, there is little correlation be-

tween the logical and physical order of the leaves, and the

cost of reading a new leaf involves both the data-transfer

cost and the seek cost. If leaves are too small to amor-

tize the seek costs, then range queries can be slow. The

seek costs can be amortized by using larger leaves, but

this further throttles update performance.

At the other extreme, logging ﬁle systems [5,7,26,29,

30] optimize for writes. Logging ensures that ﬁles are

created and updated rapidly, but the resulting data and

metadata can be spread throughout the log, leading to

poor performance when reading data or metadata from

disk. These performance problems are particularly no-

ticeable in large scans (recursive directory traversals and

backups) that cannot be accelerated by caching.

The microwrite bottleneck creates problems for a

range of applications. HPC checkpointing systems gen-

erate so many microwrites that a custom ﬁle system,

PLFS, was designed to efﬁciently handle them [5] by

exploiting the speciﬁcs of the checkpointing workload.

Email servers often struggle to manage large sets of

small messages and metadata about those messages, such

as the read ﬂag. Desktop environments store prefer-

ences and active state in a key-value store (i.e., a reg-

istry) so that accessing and updating keys will not re-

quire ﬁle-system-level microdata operations. Unix and

Linux system administrators commonly report 10–20%

performance improvements by disabling the atime op-

tion [14]; maintaining the correct atime behavior in-

duces a heavy microwrite load, but some applications re-

quire accurate atime values.

Microwrites cause performance problems even when

the storage system uses SSDs. In a B-tree-based ﬁle sys-

tem, small writes trigger larger writes of entire B-tree

nodes, which can further be ampliﬁed to an entire erase

block on the SSD. In a log-structured ﬁle system, mi-

crowrites can induce heavy cleaning activity, especially

when the disk is nearly full. In either case, the extra write

activity reduces the lifespan of SSDs and can limit per-

formance by wasting bandwidth.

USENIX Association 13th USENIX Conference on File and Storage Technologies (FAST ’15) 303

2.2 Write-Optimized Indexes

In this subsection, we review write-optimized indexes

and their impact on performance. Speciﬁcally, we de-

scribe the B

-tree [9] and why we have selected this data

structure for BetrFS. The best WOI can dominate B-trees

in performance rather than representing a different trade-

off choice between reads and writes.

-trees. AB

-tree is a B-tree, augmented with per-

node buffers. New items are inserted in the buffer of the

root node of a B

-tree. When a node’s buffer becomes

full, messages are moved from that node’s buffer to one

of its children’s buffers. The leaves of the B

-tree store

key-value pairs, as in a B-tree. Point and range queries

behave similarly to a B-tree, except that each buffer on

the path from root to leaf must also be checked for items

that affect the query.

-trees are asymptotically faster than B-trees, as

summarized in Table 1. To see why, consider a B-

tree with N items and in which each node can hold B

keys. (For simplicity, assume keys have constant size

and that the data associated with each key has negli-

giable size.) The tree will have fanout B, so its height

will be O(log

N). Inserts and lookups will therefore re-

quire O(log

N) I/Os. A range query that returns k items

will require O(log

N + k/B) I/Os.

For comparison, a B

-tree with nodes of size B will

have B

children, where 0 < ε ≤ 1. Each node will store

one “pivot key” for each child, consuming B

space per

node. The remaining B −B

space in each node will be

used to buffer newly inserted items. Since the fanout of

the tree is B

, its height is O(log

N)=O(

log

N).

Consequently, searches will be slower by a factor of

However, each time a node ﬂushes items to one of its

children, it will move at least (B −B

)/B

≈ B

1−ε

ele-

ments. Since each element must be ﬂushed O(

log

times to reach a leaf, the amortized cost of inserting

N items is O(

εB

1−ε

log

N). Range queries returning k

items require O(

log

N + k/B) I/Os. If we pick, for

example, ε = 1/2, the point and range query costs of a

-tree become O(log

N) and O(log

N + k/B), which

are the same as a B-tree, but the insert cost becomes

O(log

√

B), which is faster by a factor of

√

In practice, however, B

-trees use much larger nodes

than B-trees. For example, a typical B-tree might

use 4KB or 64KB nodes, compared to 4MB nodes in

Tokutek’s Fractal Tree indexes. B-trees must use small

nodes because a node must be completely re-written each

time a new item is added to the database, unlike in B

trees, where writes are batched. Large nodes mean that,

in practice, the height of a B

-tree is not much larger than

the height of a B-tree on the same data. Thus, the perfor-

mance of point queries in a B

-tree implementation can

be comparable to point query performance in a B-tree.

Large nodes also speed up range queries, since the data

is spread over fewer nodes, requiring fewer disk seeks to

read in all the data.

To get a feeling for what this speedup looks like,

consider the following example. Suppose a key-value

store holds 1TB of data, with 128-byte keys and records

(key+value) of size 1KB. Suppose that data is logged for

durability, and periodically all updates in the log are ap-

plied to the main tree in batch.

In the case of a B-tree with 4KB nodes, the fanout of

the tree will be 4KB/128B= 32. Thus the non-leaf nodes

can comfortably ﬁt into the memory of a typical server

with 64GB of RAM, but only a negligible fraction of the

1TB of leaves will be cached at any given time. During a

random insertion workload, most updates in a batch will

require exactly 2 I/Os: 1 I/O to read in the target leaf and

1 I/O to write it back to disk after updating its contents.

For comparison, suppose a B

-tree has branching fac-

tor of 10 and nodes of size 1MB. Once again, all but the

last level ﬁt in memory. When a batch of logged updates

is applied to the tree, they are simply stored in the tree’s

root buffer. Since the root is cached, this requires a sin-

gle I/O. When an internal node becomes full and ﬂushes

its buffer to a non-leaf child, this causes two writes: an

update of the parent and an update of the child. There

are no reads required since both nodes are cached. When

an internal node ﬂushes its buffer to a leaf node, this re-

quires one additional read to load the leaf into memory.

There are 1TB/1MB=2

leaves, so since the tree has

fanout 10, its height is 1 + log

≈ 7. Each item is

therefore involved in 14 I/Os: it is written and read once

at each level.

However, each ﬂush moves 1MB/10 = 100kB of data,

in other words, 100 items. Thus, the average per-item

cost of ﬂushing an item to a leaf is 14/100. Since a B-

tree would require 2 I/Os for each item, the B

-tree is

able to insert data 2/(14/100)=14.3 times faster than a

B-tree. As key-value pairs get smaller, say for metadata

updates, this speedup factor grows.

In both cases, a point query requires a single I/O to

read the corresponding leaf for the queried key. Range

queries can be much faster, as the B

-tree seeks only

once every 1MB vs once every 4KB in the B-tree.

Because buffered messages are variable length in our

implementation, even with ﬁxed-size nodes, B is not con-

stant. Rather than ﬁx ε, our implementation bounds the

range of pivots per node (B

) between 4 and 16.

Upserts. B

-trees support “upserts,” an efﬁcient method

for updating a key-value pair in the tree. When an ap-

plication wants to update the value associated with key

k in the B

-tree, it inserts a message (k,( f ,∆)) into the

tree, where f speciﬁes a call-back function that can be

used to apply the change speciﬁed by ∆ to the old value

304 13th USENIX Conference on File and Storage Technologies (FAST ’15) USENIX Association

associated with key k. This message is inserted into the

tree like any other piece of data. However, every time

the message is ﬂushed from one node to a child C, the

-tree checks whether C’s buffer contains the old value

v associated with key k. If it does, then it replaces v with

f (v,∆) and discards the upsert message from the tree. If

an application queries for k before the callback function

is applied, then the B

-tree computes f (v,∆) on the ﬂy

while answering the query. This query is efﬁcient be-

cause an upsert for key k always lies on the path from

the root of the B

-tree to the leaf containing k. Thus, the

upsert mechanism can speed up updates by one to two

orders of magnitude without slowing down queries.

Log-structured merge trees. Log-structured merge

trees (LSM trees) [23] are a WOI with many vari-

ants [28]. They typically consist of a logarithmic num-

ber of indexes (e.g., B-trees) of exponentially increasing

size. Once an index at one level ﬁlls up, it is emptied by

merging it into the index at the next larger level.

LSM trees can be tuned to have the same insertion

complexity as a B

-tree, but queries in a na

ıvely im-

plemented LSM tree can be slow, as shown in Table 1.

Implementers have developed methods to improve the

query performance, most notably using Bloom ﬁlters [6]

for each B-tree. A point query for an element in the data

structure is typically reported as improving to O(log

N),

thus matching a B-tree. Bloom ﬁlters are used in most

LSM tree implementations (e.g., [3, 12, 17, 20]).

Bloom ﬁlters do not help in range queries, since the

successor of any key may be in any level. In addition, the

utility of Bloom ﬁlters degrades with the use of upserts,

and upserts are key to the performance of BetrFS. To see

why, note that in order to compute the result of a query,

all relevant upserts must be applied to the key-value pair.

If there are many upserts “in ﬂight” at different levels of

the LSM tree, then searches will need to be performed on

each such level. Bloom ﬁlters can be helpful to direct us

to the levels of interest, but that does not obviate the need

for many searches, and since the leaves of the different

LSM tree B-trees might require fetching, the search per-

formance can degrade.

BetrFS uses B

-trees, as implemented in Tokutek’s

Fractal Tree indexes, because Fractal Tree indexes are

the only WOI implementation that matches the query

performance of B-trees for all workloads, including

the upsert-intensive workloads generated by BetrFS. In

short, LSMs match B-tree query times in special cases,

and B

-trees match B-tree query times in general.

3BetrFS Design

BetrFS is an in-kernel ﬁle system designed to take full

advantage of the performance strengths of B

-trees. The

Data Insert Point Query Range Query

Struct. no Upserts w/ Upserts

B-tree log

N log

N +

LSM

log

εB

1−ε

log

LSM+BF

log

εB

1−ε

log

-tree

log

εB

1−ε

log

Table 1: Asymptotic I/O costs of important operations in

B-trees and several different WOIs. Fractal Tree indexes

simultaneously support efﬁcient inserts, point queries

(even in the presence of upserts), and range queries.

Applications

User

Kernel

VFS

BetrFS

TokuDB

Inode

Index Index

Data

VFS

ext4

Disk

klibc

BetrFS Schema

Figure 1: The BetrFS architecture.

overall system architecture is illustrated in Figure 1.

The BetrFS VFS schema transforms ﬁle-system opera-

tions into efﬁcient B

-tree operations whenever possible.

The keys to obtaining good performance from B

-trees

are (1) to use upsert operations to update ﬁle system state

and (2) to organize data so that ﬁle-system scans can be

implemented as range queries in the B

-trees. We de-

scribe how our schema achieves these goals in Section 4.

By implementing BetrFS as an in-kernel ﬁle system,

we avoid the performance overheads of FUSE, which can

be particularly deleterious for a write-optimized ﬁle sys-

tem. We also expose opportunities for optimizing our ﬁle

system’s interaction with the kernel’s page cache.

BetrFS’s stacked ﬁle-system design cleanly separates

the complex task of implementing write-optimized in-

dexes from block allocation and free-space management.

Our kernel port of TokuDB stores data on an underlying

ext4 ﬁle system, but any ﬁle system should sufﬁce.

Porting a 45KLoC database into the kernel is a non-

trivial task. We ported TokuDB into the kernel by writ-

ing a shim layer, which we call klibc, that translates

the TokuDB external dependencies into kernel functions

for locking, memory allocation, and ﬁle I/O. Section 6

USENIX Association 13th USENIX Conference on File and Storage Technologies (FAST ’15) 305

/home

/home/alice

/home/bob

/home/alice/betrfs.pdf

/home/alice/betrfs.tex

/home/bob/betrfs.c

/home/bob/betrfs.ko

Figure 2: Example of the sort order used in BetrFS.

describes klibc and summarizes our experiences and

lessons learned from the project.

4 The BetrFS File-System Schema

-trees implement a key-value store, so BetrFS must

translate ﬁle-system operations into key-value opera-

tions. This section presents the BetrFS schema for per-

forming this translation and explains how this schema

takes advantage of the performance strengths of B

-trees.

4.1 BetrFS Data and Metadata Indexes

BetrFS stores ﬁle system data and metadata using two

indexes in the underlying database: a metadata index and

a data index. Since both keys and values may be variable-

sized, BetrFS is able to pack many index entries into each

-tree node.

The metadata index. The BetrFS prototype maintains

an index mapping full pathnames (relative to the mount

point) to ﬁle metadata (roughly equivalent to the contents

of struct stat):

path → (size, owner,timestamps, etc...)

The metadata index is designed to support efﬁcient ﬁle

creations, deletions, lookups, and directory scans. The

index sorts paths ﬁrst by the number of slashes, then lex-

icographically. Thus, items within the same directory are

stored consecutively as illustrated in Figure 2. With this

ordering, scanning a directory, either recursively or not,

can be implemented as a range query.

The data index. Though keys and values may be

variable-sized, the BetrFS prototype breaks ﬁles into

4096-byte blocks for better integration with the page

cache. Thus, the data index maps (ﬁle, offset) tuples to

blocks:

(path,block-number) → data[4096]

Keys in the data index are also sorted lexicographically,

which guarantees that the contents of a ﬁle are logically

adjacent and therefore, as explained in Subsection 2.2,

almost always physically adjacent on disk. This enables

FS Operation B

-tree Operation

Mkdir Upsert

Rmdir Upsert

Create Upsert

Unlink Upsert + Delete data blocks

Truncate Upsert + Delete data blocks

Setattr (e.g. chmod) Upsert

Rename Copy ﬁles

Symlink Upsert

Lookup (i.e. lstat) Point Query

Readlink Point Query

Readdir Range Query

File write Upsert

File read Range Query

MMap readpage(s) Point/Range Query

MMap writepage(s) Upsert(s)

Table 2: BetrFS implementation strategies for basic ﬁle-

system operations. Almost all operations are imple-

mented using efﬁcient upserts, point queries, or range

queries. Unlink, Truncate, and Rename currently scale

with ﬁle and/or directory sizes.

ﬁle contents to be read sequentially at near disk band-

width. BetrFS implements sparse ﬁles by simply omit-

ting the sparse blocks from the data index.

BetrFS uses variable-sized values to avoid zero-

padding the last block of each ﬁle. This optimization

avoids the CPU overhead of zeroing out unused regions

of a buffer, and then compressing the zeros away before

writing a node to disk. For small-ﬁle benchmarks, this

optimization yielded a signiﬁcant reduction in overheads.

For instance, this optimization improves throughput on

TokuBench (§7) by 50–70%.

4.2 Implementing BetrFS Operations

Favoring blind writes. A latent assumption in much

ﬁle system code is that data must be written at disk-

sector granularity. As a result, a small write must

ﬁrst bring the surrounding disk block into the cache,

modify the block, and then write it back. This pat-

tern is reﬂected in the Linux page cache helper func-

tion

block write begin().BetrFS avoids this read-

modify-write pattern, instead issuing blind writes—

writes without reads—whenever possible.

Reading and writing ﬁles in BetrFS. BetrFS imple-

ments ﬁle reads using range queries in the data index.

-trees can load the results of a large range query from

disk at effectively disk bandwidth.

BetrFS supports efﬁcient ﬁle writes of any size via up-

serts and inserts. Application writes smaller than one 4K

block become messages of the form:

PSERT(WRITE,(path,n), offset,v,),

which means the application wrote  bytes of data, v, at

306 13th USENIX Conference on File and Storage Technologies (FAST ’15) USENIX Association

the given offset into block n of the ﬁle speciﬁed by path.

Upsert messages completely encapsulate a block modiﬁ-

cation, obviating the need for read-modify-write. Writes

of an entire block are implemented with an insert (also

called a put), which is a blind replacement of the block,

and behaves similarly.

As explained in Subsection 2.2, upserts and inserts are

messages inserted into the root node, which percolate

down the tree. By using upserts for ﬁle writes, a write-

optimized ﬁle system can aggregate many small random

writes into a single large write to disk. Thus, the data can

be committed to disk by performing a single seek and

one large write, yielding an order-of-magnitude boost in

performance for small random writes.

In the case of large writes spanning multiple blocks,

inserts follow a similar path of copying data from the

root to the leaves. The B

-tree implementation has some

optimizations for large writes to skip intermediate nodes

on the path to a leaf, but they are not aggressive enough

to achieve full disk bandwidth for large sequential ﬁle

writes. We leave this issue for future work, as a solu-

tion must carefully address several subtle issues related

to pending messages and splitting leaf nodes.

File-system metadata operations in BetrFS. As shown

in Table 2, BetrFS also converts almost all metadata up-

dates, such timestamp changes, ﬁle creation and sym-

bolic linking, into upserts.

The only metadata updates that are not upserts in

BetrFS are unlink, truncate, and rename. We now ex-

plain the obstacles to implementing these operations as

upserts, which we leave for future work.

Unlink and truncate can both remove blocks from a

ﬁle. BetrFS performs this operation in the simplest pos-

sible way: it performs a range query on the blocks that

are to be deleted, and issues a TokuDB delete for each

such block in the data index. Although TokuDB delete

operations are implemented using upserts, issuing O(n)

upserts can make this an expensive task.

Second, keying by full path makes recursive directory

traversals efﬁcient, but makes it non-trivial to implement

efﬁcient renames. For example, our current implementa-

tion renames ﬁles and directories by re-inserting all the

key-value pairs under their new keys and then deleting

the old keys, effectively performing a deep copy of the

ﬁle or directory being renamed. One simple solution is

to add an inode-style layer of indirection, with a third

index. This approach is well-understood, and can sacri-

ﬁce some read locality as the tree ages. We believe that

data-structure-level optimizations can improve the per-

formance of rename, which we leave for future work.

Although the schema described above can use upserts

to make most changes to the ﬁle system, many POSIX

ﬁle system functions specify preconditions that the OS

must check before changing the ﬁle system. For exam-

ple, when creating a ﬁle, POSIX requires the OS to check

that the ﬁle doesn’t already exist and that the user has

write permission on the containing directory. In ou r ex-

periments, the OS cache of ﬁle and directory information

was able to answer these queries, enabling ﬁle creation

etc., to run at the full speed of upserts.

Crash consistency. We use the TokuDB transaction

mechanism for crash consistency. TokuDB transactions

are equivalent to full data journaling, with all data and

metadata updates logged to a ﬁle in the underlying ext4

ﬁle system. Log entries are retired in-order, and no up-

dates are applied to the tree on disk ahead of the TokuDB

logging mechanism. Entries are appended to one of two

in-memory log buffers (16 MB by default). These buffers

are rotated and ﬂushed to disk every second or when a

buffer overﬂows.

Although exposing a transactional API may be pos-

sible, BetrFS currently uses transactions only as an in-

ternal consistency mechanism. BetrFS generally uses

a single transaction per system call, except for writing

data, which uses a transaction per data block. In our

current implementation, transactions on metadata exe-

cute while holding appropriate VFS-level mutex locks,

making transaction conﬂicts and aborts vanishingly rare.

Compression. Compression is important to perfor-

mance, especially for keys. Both indexes use full

paths as keys, which can be long and repetitive, but

TokuDB’s compression mitigates these overheads. Us-

ing quicklz [24], the sorted path names in our experi-

ments compress by a factor of 20, making the disk-space

overhead manageable.

The use of data compression also means that there

isn’t a one-to-one correspondence between reading a ﬁle-

system-level block and reading a block from disk. A leaf

node is typically 4 MB, and compression can pack more

than 64 ﬁle system blocks into a leaf. In our experience

with large data reads and writes, data compression can

yield a boost to ﬁle system throughput, up to 20% over

disabling compression.

5 Write-Optimization in System Design

In designing BetrFS, we set the goal of working within

the existing Linux VFS framework. An underlying chal-

lenge is that, at points, the supporting code assumes

that reads are as expensive as writes, and necessary for

update-in-place. The use of write optimization violates

these assumptions, as sub-block writes can be faster than

a read. This section explains several strategies we found

for improving the BetrFS performance while retaining

Linux’s supporting infrastructure.

USENIX Association 13th USENIX Conference on File and Storage Technologies (FAST ’15) 307

5.1 Eternal Sunshine of the Spotless Cache

BetrFS leverages the Linux page cache to implement ef-

ﬁcient small reads, avoid disk reads in general, and facil-

itate memory-mapped ﬁles. By default, when an appli-

cation writes to a page that is currently in cache, Linux

marks it as dirty and writes it out later. This way, several

application-level writes to a page can be absorbed in the

cache, requiring only a single write to disk. In BetrFS,

however, small writes are so cheap that this optimization

does not always make sense.

In BetrFS, the write system call never dirties a clean

page in the cache. When an application writes to a clean

cached page, BetrFS issues an upsert to the B

-tree and

applies the write to the cached copy of that page. Thus

the contents of the cache are still in sync with the on-disk

data, and the cached page remains clean.

Note that BetrFS’ approach is not always better than

absorbing writes in the cache and writing back the entire

block. For example, if an application performs hundreds

of small writes to the same block, then it would be more

efﬁcient to mark the page dirty and wait until the appli-

cation is done to write the ﬁnal contents back to disk. A

production version of BetrFS should include heuristics to

detect this case. We found that performance in our proto-

type was good enough without this optimization, though,

so we have not yet implemented it.

The only situation where a page in the cache is dirt-

ied is when the ﬁle is memory-mapped for writing. The

memory management hardware does not support ﬁne-

grained tracking of writes to memory-mapped ﬁles —

the OS knows only that something within in the page of

memory has been modiﬁed. Therefore, BetrFS’ mmap

implementation uses the default read and write page

mechanisms, which operate at page granularity.

Our design keeps the page cache coherent with disk.

We leverage the page cacahe for faster warm-cache

reads, but avoid unnecessary full-page writebacks.

5.2 FUSE is Write De-Optimized

We implemented BetrFS as an in-kernel ﬁle system be-

cause the FUSE architecture contains several design de-

cisions that can ruin the potential performance beneﬁts

of a write-optimized ﬁle system. FUSE has well-known

overheads from the additional context switches and data

marshalling it performs when communicating with user-

space ﬁle systems. However, FUSE is particularly dam-

aging to write-optimized ﬁle systems for completely dif-

ferent reasons.

FUSE can transform write-intensive into read-

intensive workloads because it issues queries to the user-

space ﬁle system before (and, in fact, after) most ﬁle sys-

tem updates. For example, FUSE issues GETATTR calls

(analogous to calling stat()) for the entire path of a

ﬁle lookup, every time the ﬁle is looked up by an ap-

plication. For most in-kernel ﬁle systems, subsequent

lookups could be handled by the kernel’s directory cache,

but FUSE conservatively assumes that the underlying ﬁle

system can change asynchronously (which can be true,

e.g. in network ﬁle systems).

These searches can choke a write-optimized data

structure, where insertions are two orders of magnitude

faster than searches. The TokuFS authors explicitly cite

these searches as the cause of the disappointing perfor-

mance of their FUSE implementation [15].

The TableFS authors identiﬁed another source of

FUSE overhead: double caching of inode information in

the kernel [25]. This reduces the cache’s effective hit

rate. For slow ﬁle systems, the overhead of a few ex-

tra cache misses may not be signiﬁcant. For a write-

optimized data structure working on a write-intensive

workload, the overhead of the cache misses can be sub-

stantial.

5.3 Ext4 as a Block Manager

Since TokuDB stores data in compressed nodes, which

can have variable size, TokuDB relies on an underlying

ﬁle system to act as a block and free space manager for

the disk. Conventional ﬁle systems do a good job of stor-

ing blocks of large ﬁles adjacently on disk, especially

when writes to the ﬁle are performed in large chunks.

Rather than reinvent the wheel, we stick with this de-

tree nodes as blocks within one or more large ﬁles on

the underlying ﬁle system, which in our prototype is un-

modiﬁed ext4 with ordered data mode and direct IO.

We rely on ext4 to correctly issue barriers to the disk

write cache, although disabling the disk’s write cache did

not signiﬁcantly impact performance of our workloads.

In other words, all BetrFS ﬁle system updates, data or

metadata, generally appear as data writes, and an fsync

to the underlying ext4 ﬁle system ensures durability of

aBetrFS log write. Although there is some duplicated

work between the layers, we expect ordered journaling

mode minimizes this, as a typical BetrFS instance spans

11 ﬁles from ext4’s perspective. That said, these re-

dundancies could be streamlined in future work.

6 Implementation

Rather than do an in-kernel implementation of a

write-optimized data structure from scratch, we ported

TokuDB into the Linux kernel as the most expedient way

to obtain a write-optimized data structure implementa-

tion. Such data structure implementations can be com-

308 13th USENIX Conference on File and Storage Technologies (FAST ’15) USENIX Association

Component Description Lines

VFS Layer Translate VFS hooks to TokuDB queries. 1,987

TokuDB Kernel version of TokuDB. 44,293

(960 lines changed)

klibc Compatibility wrapper. 4,155

Linux Modiﬁcations 58

Table 3: Lines of code in BetrFS, by component.

Class ABIs Description

Memory 4 Allocate buffer pages and heap objects.

Threads 24 Pthreads, condition variables, and mutexes.

Files 39 Access database backend ﬁles on underly-

ing, disconnected ﬁle system.

zlib 7 Wrapper for kernel zlib.

Misc 27 Print errors, qsort, get time, etc..

Total 101

Table 4: Classes of ABI functions exported by klibc.

plex, especially in tuning the I/O and asymptotic merging

behavior.

In this section, we explain how we ported a large por-

tion of the TokuDB code into the kernel, challenges we

faced in the process, lessons learned from the experience,

and future work for the implementation. Table 3 summa-

rizes the lines of code in BetrFS, including the code in-

terfacing the VFS layer to TokuDB, the klibc code, and

minor changes to the Linux kernel code, explained be-

low. The BetrFS prototype uses Linux version 3.11.10.

6.1 Porting Approach

We initially decided the porting was feasible because

TokuDB has very few library requirements and is written

in a C-like form of C++. In other words, the C++ features

used by TokuDB are primarily implemented at compile

time (e.g., name mangling and better type checking), and

did not require runtime support for features like excep-

tions. Our approach should apply to other WOIs, such as

an LSM tree, inasmuch as the implementation follows a

similar coding style.

As a result, we were able to largely treat the TokuDB

code we used as a binary blob, creating a kernel module

(.ko ﬁle) from the code. We exported interfaces used by

the BetrFS VFS layer to use C linkage, and similarly de-

clared interfaces that TokuDB imported from klibc to

be C linkage.

We generally minimized changes to the TokuDB code,

and selected imported code at object-ﬁle granularity. In

a few cases, we added compile-time macros to eliminate

code paths or functions that would not be used yet re-

quired cumbersome dependencies. Finally, when a par-

ticularly cumbersome user-level API, such as fork, is

used in only a few places, we rewrote the code to use

a more suitable kernel API. We call the resulting set of

dependencies imported by TokuDB klibc.

6.2 The klibc Framework

Table 4 summarizes the ABIs exported by klibc. In

many cases, kernel ABIs were exported directly, such as

memcpy, or were straightforward wrappers for features

such as synchronization and memory allocation. In a few

cases, the changes were more complex.

The use of errno in the TokuDB code presented a

particular challenge. Linux passes error codes as neg-

ative return values, whereas libc simply returns nega-

tive one and places the error code in a per-thread variable

errno. Checks for a negative value and reads of errno

in TokuDB were so ubiquitous that changing the error-

handling behavior was impractical. We ultimately added

an errno ﬁeld to the Linux task struct; a production

implementation would instead rework the error-passing

code.

Although wrapping pthread abstractions in kernel ab-

stractions was fairly straightforward, static initialization

and direct access to pthread structures created problems.

The primary issue is that converting pthread abstractions

to kernel abstractions replaced members in the pthread

structure deﬁnitions. Static initialization would not prop-

erly initialize the modiﬁed structure. Once the size of

pthread structures changed, we had to eliminate any code

that imported system pthread headers, lest embedded in-

stances of these structures calculate the wrong size.

In reusing ext4 as a block store, we faced some chal-

lenges in creating module-level ﬁle handles and paths.

File handles were more straightforward: we were able

to create a module-level handle table and use the pread

(cursor-less) API to ext4 for reads and writes. We

did have to modify Linux to export several VFS helper

function that accepted a struct file directly, rather

than walking the process-level ﬁle descriptor table. We

also modiﬁed ext4 to accept input for reads with the

DIRECT ﬂag that were not from a user-level address.

When BetrFS allocates, opens, or deletes a block store

on the underlying ext4 ﬁle system, the module essen-

tially chroots into an ext4 ﬁle system disconnected

from the main tree. Because this is kernel code, we

also wish to avoid permission checks based on the cur-

rent process’s credentials. Thus, path operations include

a “context switch” operation, where the current task’s ﬁle

system root and credentials are saved and restored.

6.3 Changes to TokuDB

With a few exceptions, we were able to use TokuDB in

the kernel without major modiﬁcations. This subsection

outlines the issues that required refactoring the code.

The ﬁrst issue we encountered was that TokuDB

makes liberal use of stack allocation throughout. One

function allocated a 12KB buffer on the stack! In con-

USENIX Association 13th USENIX Conference on File and Storage Technologies (FAST ’15) 309

trast, stack sizes in the Linux kernel are ﬁxed at compile

time, and default to 8KB. In most cases, we were able

to use compile-time warnings to identify large stack al-

location and convert them to heap allocations and add

free functions. In the case where these structures were

performance-critical, such as a database cursor, we mod-

iﬁed the TokuDB code to use faster allocation methods,

such as a kernel cache or per-CPU variable. Similarly,

we rewrote several recursive functions to use a loop.

Nonetheless, we found that deep stacks of more modest-

sized frames were still possible, and increased the stack

size to 16 KB. We plan to reign in the maximum stack

size in future work.

Finally, we found a small mismatch between the

behavior of futexes and kernel wait queues that re-

quired code changes. Essentially, recent implemen-

tations of pthread condition variables will not wake

a sleeping thread up due to an irrelevant interrupt,

making it safe (though perhaps inadvisable) in user

space not to double-check invariants after returning

from pthread

cond wait. The Linux-internal equiv-

alents, such as wait

event, can spuriously wake up a

thread in a way that is difﬁcult to distinguish without

re-checking the invariant. Thus, we had to place all

pthread

cond wait calls in a loop.

6.4 Future Work and Limitations

The BetrFS prototype is an ongoing effort. The effort

has reached sufﬁcient maturity to demonstrate the power

of write optimization in a kernel ﬁle system. However,

there are several points for improvement in future work.

The most useful feature currently missing from the

TokuDB codebase is range upserts; upserts can only be

applied to a single key, or broadcast to all keys. Cur-

rently, ﬁle deletion must be implemented by creating a

remove upsert for each data block in a ﬁle; the ability to

create a single upsert applied to a limited range would

be useful, and we leave this for future work. The pri-

mary difﬁculty in supporting such an abstraction is tun-

ing how aggressively the upsert should be ﬂushed down

to the leaves versus applied to point queries on demand;

we leave this issue for future work as well.

One subtle trade-off in organizing on-disk placement

is between rename and search performance. BetrFS keys

ﬁles by their path, which currently results in rename

copying the ﬁle from one disk location to another. This

can clearly be mitigated by adding a layer of indirection

(i.e., an inode number); however, this is at odds with the

goal of preserving data locality within a directory hierar-

chy. We plan to investigate techniques for more efﬁcient

directory manipulation that preserve locality. Similarly,

our current prototype does not support hard links.

Our current prototype also includes some double

caching of disk data. Nearly all of our experiments

measure cold-cache behavior, so this does not affect

the ﬁdelity of our results. Proﬂigate memory usage is

nonetheless problematic. In the long run, we intend to

better integrate these layers, as well as eliminate emu-

lated ﬁle handles and paths.

7 Evaluation

We organize our evaluation around the following ques-

tions:

• Are microwrites on BetrFS more efﬁcient than on

other general-purpose ﬁle systems?

• Are large reads and writes on BetrFS at least competi-

tive with other general-purpose ﬁle systems?

• How do other ﬁle system operations perform on

BetrFS?

• What are the space (memory and disk) overheads of

BetrFS?

• Do applications realize better overall performance on

BetrFS?

Unless otherwise noted, benchmarks are cold-cache

tests. All ﬁle systems beneﬁt equally from hits in the

page and directory caches; we are interested in measur-

ing the efﬁciency of cache misses.

All experimental results were collected on a Dell Op-

tiplex 790 with a 4-core 3.40 GHz Intel Core i7 CPU, 4

GB RAM, and a 250 GB, 7200 RPM ATA disk. Each

ﬁle system used a 4096-byte block size. The system ran

Ubuntu 13.10, 64-bit, with Linux kernel version 3.11.10.

Each experiment compared with several general purpose

ﬁle systems, including btrfs, ext4, XFS, and zfs. Er-

ror bars and ± ranges denote 95% conﬁdence intervals.

7.1 Microwrites

We evaluated microwrite performance using both meta-

data and data intensive microbenchmarks. To exercise

ﬁle creation, we used the TokuBench benchmark [15] to

create 500,000 200-byte ﬁles in a balanced directory tree

with a fanout of 128. The results are shown in Figure 3.

TokuBench also measures the scalability of the ﬁle sys-

tem as threads are added; we measured up to 4 threads

since our machine has 4 cores.

BetrFS exhibited substantially higher throughput than

the other ﬁle systems. The closest competitor was zfs at

1 thread; as more threads were added, the gap widened

considerably. Compared to ext4, XFS, and btrfs,

BetrFS throughput was an order of magnitude higher.

This performance distinction is attributable to both

fewer total writes and fewer seeks per byte written—i.e.,

better aggregation of small writes. Based on proﬁling

from blktrace, one major distinction is total bytes writ-

ten: BetrFS writes 4–10× fewer total MB to disk, with

310 13th USENIX Conference on File and Storage Technologies (FAST ’15) USENIX Association

10000

20000

30000

40000

1 2 4

Number of Threads

Files/Second

BetrFS

btrfs

ext4

xfs

zfs

TokuBench

Small File Creation

Figure 3: Total time to create 500,000 200-byte ﬁles, us-

ing 1, 2, and 4 threads. We measure the number of ﬁles

created per second. Higher is better.

File System Time (s)

BetrFS 0.17 ± 0.01

ext4 11.60 ± 0.39

XFS 11.71 ± 0.28

btrfs 11.60 ± 0.38

zfs 14.75 ± 1.45

Table 5: Time in seconds to execute 1000 4-byte mi-

crowrites within a 1GiB ﬁle. Lower is better.

an order of magnitude fewer total write requests. Among

the other ﬁle systems, ext4, XFS, and zfs wrote roughly

the same amount of data, but realized widely varying un-

derlying write throughput. The only ﬁle system with a

comparable write throughput was zfs, but it wrote twice

as much data using 12.7× as many disk requests.

To measure microwrites to ﬁles, we wrote a custom

benchmark that performs 1,000 random 4-byte writes

within a 1GiB ﬁle, followed by an fsync(). Table 5

lists the results. BetrFS was two orders of magnitude

faster than the other ﬁle systems.

These results demonstrate that BetrFS improves mi-

crowrite performance by one to two orders of magnitude

compared to current general-purpose ﬁle systems.

7.2 Large Reads and Writes

We measured the throughput of sequentially reading and

writing a 1GiB ﬁle, 10 blocks at a time. We created

the ﬁle using random data to avoid unfairly advantaging

compression in BetrFS. In this experiment, BetrFS bene-

ﬁts from compressing keys, but not data. We note that

with compression and moderately compressible data,

BetrFS can easily exceed disk bandwidth. The results

are illustrated in Figure 4.

In general, most general-purpose ﬁle systems can read

100

read write

Operation

MiB/s

BetrFS

btrfs

ext4

xfs

zfs

Large

Sequential I/O

Figure 4: Large ﬁle I/O. We sequentially read and write

1GiB ﬁles. Higher is better.

and write at disk bandwidth. In the case of a large se-

quential reads, BetrFS can read data at roughly 85 MiB/s.

This read rate is commensurate with overall disk uti-

lization, which we believe is a result of less aggressive

read-ahead than the other ﬁle systems. We believe this

can be addressed by re-tuning the TokuDB block cache

prefetching behavior.

In the case of large writes, the current BetrFS proto-

type achieved just below half of the disk’s throughput.

The reason for this is that each block write must percolate

down the interior tree buffers; a more efﬁcient heuristic

would detect a large streaming write and write directly

to a leaf. As an experiment, we manually forced writes

to the leaf in an empty tree, and found write throughput

comparable to the other ﬁle systems. That said, applying

this optimization is somewhat tricky, as there are a num-

ber of edge cases where leaves must be read and rewrit-

ten or messages must be ﬂushed. We leave this issue for

future work.

7.3 Directory Operations

In this section, we measure the impact of the BetrFS de-

sign on large directory operations. Table 6 reports the

time taken to run find, grep -r, mv, and rm -r on the

Linux 3.11.10 source tree, starting from a cold cache.

The grep test recursively searches the ﬁle contents for

the string “cpu

to be64”, and the find test searches for

ﬁles named “wait.c”. The rename test renames the entire

kernel source tree, and the delete test does a recursive

removal of the source.

Both the find and grep benchmarks demonstrate the

value of sorting ﬁles and their metadata lexicographically

by full path, so that related ﬁles are stored near each other

on disk. BetrFS can search directory metadata and ﬁle

data one or two orders of magnitude faster than other ﬁle

systems, with the exception of grep on btrfs, which is

USENIX Association 13th USENIX Conference on File and Storage Technologies (FAST ’15) 311

FS ﬁnd grep dir rename delete

BetrFS 0.36 ± 0.06 3.95 ± 0.28 21.17 ± 1.01 46.14 ± 1.12

btrfs 14.91 ± 1.18 3.87 ± 0.94 0.08 ± 0.05 7.82 ± 0.59

ext4 2.47 ± 0.07 46.73 ± 3.86 0.10 ± 0.02 3.01 ± 0.30

XFS 19.07 ± 3.38 66.20 ± 15.99 19.78 ± 5.29 19.78 ± 5.29

zfs 11.60 ± 0.81 41.74 ± 0.64 14.73 ± 1.64 14.73 ± 1.64

Table 6: Directory operation benchmarks, measured in seconds. Lower is better.

FS chmod mkdir open read stat unlink write

BetrFS 4913 ± 0.27 67072 ± 25.68 1697 ± 0.12 561 ± 0.01 1076 ± 0.01 47873 ± 7.7 32142 ± 4.35

btrfs 4574 ± 0.27 24805 ± 13.92 1812 ± 0.12 561 ± 0.01 1258 ± 0.01 26131 ± 0.73 3891 ± 0.08

ext4 4970 ± 0.14 41478 ± 18.99 1886 ± 0.13 556 ± 0.01 1167 ± 0.05 16209 ± 0.2 3359 ± 0.04

XFS 5342 ± 0.21 73782 ± 19.27 1757 ± 0.12 1384 ± 0.07 1134 ± 0.02 19124 ± 0.32 9192 ± 0.28

zfs 36449 ± 118.37 171080 ± 307.73 2681 ± 0.08 6467 ± 0.06 1913 ± 0.04 78946 ± 7.37 18382 ± 0.42

Table 7: Average time in cycles to execute a range of common ﬁle system calls. Lower is better.

comparable.

Both the rename and delete tests show the worst-case

behavior of BetrFS. Because BetrFS does not include a

layer of indirection from pathname to data, renaming re-

quires copying all data and metadata to new points in

the tree. We also measured large-ﬁle renames, and saw

similarly large overheads—a function of the number of

blocks in the ﬁle. Although there are known solutions

to this problem, such as by adding a layer of indirec-

tion, we plan to investigate techniques that can preserve

the appealing lexicographic locality without sacriﬁcing

rename and delete performance.

7.4 System Call Nanobenchmarks

Finally, Table 7 shows timings for a nanobenchmark

that measures various system call times. Because this

nanobenchmark is warm-cache, it primarily exercises the

VFS layer. BetrFS is close to being the fastest ﬁle sys-

tem on open, read, and stat. On chmod, mkdir, and

unlink,BetrFS is in the middle of the pack.

Our current implementation of the write system call

appears to be slow in this benchmark because, as men-

tioned in Section 5.1, writes in BetrFS issue an upsert to

the database, even if the page being written is in cache.

This can be advantageous when a page is not written of-

ten, but that is not the case in this benchmark.

7.5 Space Overheads

The Fractal Tree index implementation in BetrFS in-

cludes a cachetable, which caches tree nodes. Cachetable

memory is bounded. BetrFS triggers background ﬂush-

ing when memory exceeds a low watermark and forces

writeback at a high watermark. The high watermark is

currently set to one eighth of total system memory. This

Total BetrFS Disk Usage (GiB)

Input After After After

Data Writes Deletes Flushes

4 4.14 ± 0.07 4.12 ± 0.00 4.03 ± 0.12

16 16.24 ± 0.06 16.20 ± 0.00 10.14 ± 0.21

32 32.33 ± 0.02 32.34 ± 0.00 16.22 ± 0.00

64 64.57 ± 0.06 64.59 ± 0.00 34.36 ± 0.18

Table 8: BetrFS disk usage, measured in GiB, after writ-

ing large incompressible ﬁles, deleting half of those ﬁles,

and ﬂushing B

-tree nodes.

is conﬁgurable, but we found that additional cachetable

memory had little performance impact in our workloads.

No single rule governs BetrFS disk usage, as stale data

may remain in non-leaf nodes after delete, rename, and

overwrite operations. Background cleaner threads at-

tempt to ﬂush pending data from 5 internal nodes per

second. This creates ﬂuctuation in BetrFS disk usage,

but overheads swiftly decline at rest.

To evaluate the BetrFS disk footprint, we wrote sev-

eral large incompressible ﬁles, deleted half of those ﬁles,

and then initiated a B

-tree ﬂush. After each operation,

we calculated the BetrFS disk usage using df on the un-

derlying ext4 partition.

Writing new data to BetrFS introduced very little over-

head, as seen in Table 8. For deletes, however, BetrFS

issues an upsert for every ﬁle block, which had little im-

pact on the BetrFS footprint because stale data is lazily

reclaimed. After ﬂushing, there was less than 3GiB of

disk overhead, regardless of the amount of live data.

7.6 Application Performance

Figure 5 presents performance measurements for a vari-

ety of metadata-intensive applications. We measured the

time to rsync the Linux 3.11.10 source code to a new di-

312 13th USENIX Conference on File and Storage Technologies (FAST ’15) USENIX Association

MB / s

rsync

(a) rsync of Linux 3.11.10 source. The data source

and destination are within the same partition and ﬁle

system. Throughput in MB/s, higher is better.

200

400

600

Time (s)

IMAP

(b) IMAP benchmark, 50% message reads, 50%

marking and moving messages among inboxes. Ex-

ecution time in seconds, lower is better.

BetrFS

btrfs

ext4

xfs

zfs

clone diff

Time (s)

Git Operations

git-cloned locally, and a git-diff was taken be-

tween two milestone commits. Lower is better.

120

tar untar

Time (s)

Unix (Un)Tar

(d) Unix tar operations. The Linux version 3.11.10

source code was both tared and un-tared. Time is

in seconds. Lower is better.

Figure 5: Application benchmarks

rectory on the same ﬁle system, using the --in-place

option to avoid temporary ﬁle creation (Figure 5a). We

performed a benchmark using version 2.2.13 of the

Dovecot mail server to simulate IMAP client behavior

under a mix of read requests, requests to mark messages

as read, and requests to move a message to one of 10

other folders. The balance of requests was 50% reads,

50% ﬂags or moves. We exercised the git version control

system using a git-clone of the local BetrFS source

tree and a git-diff between two milestone commits

(Figure 5c). Finally, we measured the time to tar and

un-tar the Linux 3.11.10 source (Figure 5d).

BetrFS yielded substantially higher performance on

several applications, primarily applications with mi-

crowrites or large streaming reads or writes. In the case

of the IMAP benchmark, marking or moving messages

is a small-ﬁle rename in a large directory—a case BetrFS

handled particularly well, cutting execution time in half

compared to most other ﬁle systems. Note that the IMAP

test is a sync-heavy workload, issuing over 26K fsync()

calls over 334 seconds, each forcing a full BetrFS log

ﬂush. rsync on BetrFS realized signiﬁcantly higher

throughput because writing a large number of modestly

sized ﬁles in lexicographic order is, on BetrFS, aggre-

gated into large, streaming disk writes. Similarly, tar

beneﬁted from both improved reads of many ﬁles in

lexicographic order, as well as efﬁcient aggregation of

writes. tar on BetrFS was only marginally better than

btrfs, but the execution time was at least halved com-

pared to ext4 and XFS.

The only benchmark signiﬁcantly worse on BetrFS

was git-clone, which does an lstat on every new ﬁle

before creating it—despite cloning into an empty direc-

tory. Here, a slower, small read obstructs a faster write.

For comparison, the rsync --in-place test case illus-

trates that, if an application eschews querying for the ex-

istence of ﬁles before creating them, BetrFS can deliver

substantial performance beneﬁts.

These experiments demonstrate that several real-

world, off-the-shelf applications can beneﬁt from exe-

USENIX Association 13th USENIX Conference on File and Storage Technologies (FAST ’15) 313

cuting on a write-optimized ﬁle system without appli-

cation modiﬁcations. With modest changes to favor

blind writes, applications can perform even better. For

most applications not suited to write optimization, per-

formance is not harmed or could be tuned.

8 Related Work

Previous write-optimized ﬁle systems. TokuFS [15] is

an in-application library ﬁle system, also built using B

trees. TokuFS showed that a write-optimized ﬁle system

can support efﬁcient write-intensive and scan-intensive

workloads. TokuFS had a FUSE-based version, but the

authors explicitly omitted disappointing measurements

using FUSE.

KVFS [31] is based on a transactional variation of

an LSM-tree, called a VT-tree. Impressively, the per-

formance of their transactional, FUSE-based ﬁle sys-

tem was comparable to the performance of the in-kernel

ext4 ﬁle system, which does not support transactions.

One performance highlight was on random-writes, where

they outperformed ext4 by a factor of 2. They also used

stitching to perform well on sequential I/O in the pres-

ence of LSM-tree compaction.

TableFS [25] uses LevelDB to store ﬁle-system meta-

data. They showed substantial performance improve-

ments on metadata-intensive workloads, sometimes up to

an order of magnitude. They used ext4 as an object store

for large ﬁles, so sequential I/O performance was com-

parable to ext4. They also analyzed the FUSE overhead

relative to a library implementation of their ﬁle system

and found that FUSE could cause a 1000× increase in

disk-read trafﬁc (see Figure 9 in their paper).

If these designs were ported to the kernel, we expect

that they would see some, but not all, of the performance

beneﬁts of BetrFS. Because the asymptotic behavior is

better for B

-trees than LSMs in some cases, we expect

the performance of an LSM-based ﬁle system will not be

completely comparable.

Other WOIs. COLAs [4] are an LSM tree variant that

uses fractional cascading [13] to match the performance

of B

-trees for both insertions and queries, but we are

not aware of any full featured, production-quality COLA

implementation. xDict [8] is a cache-oblivious WOI with

asymptotic behavior similar to a B

-tree.

Key-Value Stores. WOIs are widely used in key-value

stores, including BigTable [12], Cassandra [20], HBase

[3], LevelDB [17], TokuDB [33] and TokuMX [34].

BigTable, Cassandra, and LevelDB use LSM-tree vari-

ants. TokuDB and TokuMX use Fractal Tree indexes.

LOCS [35] optimizes LSM-trees for a key-value store

on a multi-channel SSD.

Instead of using WOIs, FAWN [1] writes to a log and

maintains an in-memory index for queries. SILT [22]

further reduces the design’s memory footprint during the

merging phase.

Alternatives to update-in-place. The Write Anywhere

File Layout (WAFL) uses ﬁles to store its metadata, giv-

ing it incredible ﬂexibility in its block allocation and lay-

out policies [19]. WAFL does not address the microwrite

problem, however, as its main goal is to provide efﬁcient

copy-on-write snapshots.

Log-structured File Systems and their derivatives [1,

21, 22, 26] are write-optimized in the sense that they

log data, and are thus very fast at ingesting ﬁle system

changes. However, they still rely on read-modify-write

for ﬁle updates and suffer from fragmentation.

Logical logging is a technique used by some databases

in which operations, rather than the before and after im-

ages of individual database pages, are encoded and stored

in the log [18]. Like a logical log entry, an upsert mes-

sage encodes a mutation to a value in the key-value store.

However, an upsert is a ﬁrst-class storage object. Upsert

messages reside in B

-tree buffers and are evaluated on

the ﬂy to satisfy queries, or to be merged into leaf nodes.

9 Conclusion

The BetrFS prototype demonstrates that write-optimized

indexes are a powerful tool for ﬁle-system developers.

In some cases, BetrFS out-performs traditional designs

by orders of magnitude, advancing the state of the art

over previous results. Nonetheless, there are some cases

where additional work is needed, such as further data-

structure optimizations for large streaming I/O and efﬁ-

cient renames of directories. Our results suggest that fur-

ther integration and optimization work is likely to yield

even better performance results.

Acknowledgments

We thank the engineers at Tokutek for developing and

open-sourcing a wonderful B

-tree implementation from

which we built BetrFS. We thank the anonymous review-

ers and our shepherd, Geoff Kuenning, for their insight-

ful comments on earlier drafts of the work. Vrushali

Kulkarni, Oleksii Starov, Sourabh Yerfule, and Ahmad

Zaraei contributed to the BetrFS prototype. This research

was supported in part by NSF grants CNS-1409238,

CNS-1408782, CNS-1408695, CNS-1405641, CNS-

1149229, CNS-1161541, CNS-1228839, CNS-1408782,

IIS-1247750, CCF-1314547 and the Ofﬁce of the Vice

President for Research at Stony Brook University.

314 13th USENIX Conference on File and Storage Technologies (FAST ’15) USENIX Association

References

[1] ANDERSEN, D. G., FRANKLIN, J., KAMINSKY,

M., P

HANISHAYEE, A., TAN, L., AND

VASUDEVAN, V. Fawn: A fast array of wimpy

nodes. In Proceedings of the ACM SIGOPS 22nd

symposium on Operating systems principles

(2009), pp. 1–14.

[2] A

PACHE. Accumulo.

http://accumulo.apache.org.

[3] A

PACHE. HBase. http://hbase.apache.org.

[4] B

ENDER, M. A., FARACH-COLTON, M.,

INEMAN,J.T.,FOGEL,Y.R.,KUSZMAUL,

B. C.,

AND NELSON, J. Cache-oblivious

streaming B-trees. In Proceedings of the 19th

Annual ACM Symposium on Parallel Algorithms

and Architectures (SPAA) (2007), pp. 81–92.

[5] B

ENT, J., GIBSON, G. A., GRIDER, G.,

CCLELLAND, B., NOWOCZYNSKI,P.,NUNEZ,

J., P

OLTE, M., AND WINGATE, M. PLFS: A

checkpoint ﬁlesystem for parallel applications. In

Proceedings of the ACM/IEEE Conference on

High Performance Computing (SC) (2009).

[6] B

LOOM, B. H. Space/time trade-offs in hash

coding with allowable errors. Commun. ACM 13,7

(1970), 422–426.

[7] B

ONWICK, J. ZFS: the last word in ﬁle systems.

https://blogs.oracle.com/video/entry/

zfs_the_last_word_in, Sept. 14 2004.

[8] B

RODAL, G. S., DEMAINE, E. D., FINEMAN,

J. T., I

ACONO, J., LANGERMAN, S., A ND

MUNRO, J. I. Cache-oblivious dynamic

dictionaries with update/query tradeoffs. In

Proceedings of the Twenty-First Annual

ACM-SIAM Symposium on Discrete Algorithms

(SODA) (2010), pp. 1448–1456.

[9] B

RODAL, G. S., AND FAGERBERG, R. Lower

bounds for external memory dictionaries. In

Proceedings of the 14th Annual ACM-SIAM

symposium on Discrete Algorithms (ACM) (2003),

pp. 546–554.

[10] B

UCHSBAUM, A. L., GOLDWASSER, M.,

ENKATASUBRAMANIAN, S., A ND

WESTB ROOK, J. R. On external memory graph

traversal. In Proceedings of the 11th Annual

ACM-SIAM Symposium on Discrete Algorithms

(SODA) (2000), pp. 859–860.

[11] C

ARD, R., TS’O, T., AND TWEEDIE, S. Design

and implementation of the Second Extended

Filesystem. In Proceedings of the First Dutch

International Symposium on Linux (1994), pp. 1–6.

[12] C

HANG,F.,DEAN , J., GHE M AWAT, S., HSIEH,

W. C., W

ALLACH, D. A., BURROWS, M.,

HANDRA,T.,FIKES, A., AND GRUBER, R. E.

Bigtable: A distributed storage system for

structured data. ACM Transactions on Computer

Systems (TOCS) 26, 2 (2008), 4.

[13] C

HAZELLE, B., AND GUIBAS, L. J. Fractional

cascading: I. A data structuring technique.

Algorithmica 1, 1-4 (1986), 133–162.

[14] D

OUTHITT, D. Instant 10-20% boost in disk

performance: the “noatime” option.

http://administratosphere.wordpress.

com/2011/07/29/instant-10-20-boost-in-

disk-performance-the-noatime-option/,

July 2011.

[15] E

SMET, J., BENDER, M. A., FARACH-COLTON,

M.,

AND KUSZMAUL, B. C. The TokuFS

streaming ﬁle system. In Proceedings of the 4th

USENIX Workshop on Hot Topics in Storage

(HotStorage) (June 2012).

[16] File system in userspace.

http://fuse.sourceforge.net/.

[17] G

OOGLE,INC. LevelDB: A fast and lightweight

key/value database library by Google.

http://code.google.com/p/leveldb/.

[18] G

RAY, J., AND REUTER, A. Transaction

Processing: Concepts and Techniques. Morgan

Kaufmann, 1993.

[19] H

ITZ, D., LAU, J., AND MALCOLM, M. File

system design for an NFS ﬁle server appliance.

Tech. rep., NetApp, 1994.

[20] L

AKSHMAN, A., AND MALIK, P. Cassandra: a

decentralized structured storage system. ACM

SIGOPS Operating Systems Review 44, 2 (2010),

35–40.

[21] L

EE, C., SIM, D., HWANG, J., AND CHO, S.

F2FS: A new ﬁle system for ﬂash storage. In

Proceedings of the USENIX Conference on File

and Storage Technologies (FAST) (2015).

[22] L

IM, H., FAN, B., ANDERSEN, D. G., AND

KAMINSKY, M. Silt: A memory-efﬁcient,

high-performance key-value store. In Proceedings

of the Twenty-Third ACM Symposium on

Operating Systems Principles (2011), pp. 1–13.

USENIX Association 13th USENIX Conference on File and Storage Technologies (FAST ’15) 315

[23] O’NEIL,P.,CHENG, E., GAWLIC, D., AND

O’NEIL, E. The log-structured merge-tree

(LSM-tree). Acta Informatica 33, 4 (1996),

351–385.

[24] Q

UICKLZ. Fast compression library for c, c#, and

java. http://www.quicklz.com/.

[25] R

EN, K., AND GIBSON, G. A. TABLEFS:

Enhancing metadata efﬁciency in the local ﬁle

system. In USENIX Annual Technical Conference

(2013), pp. 145–156.

[26] R

OSENBLUM, M., AND OUSTERHOUT, J. K. The

design and implementation of a log-structured ﬁle

system. ACM Transactions on Computer Systems

10, 1 (Feb. 1992), 26–52.

[27] S

EARS, R., CALLAGHAN, M., AND BREWER,

E. A. Rose: compressed, log-structured

replication. Proceedings of the VLDB Endowment

1, 1 (2008), 526–537.

[28] S

EARS, R., AND RAMAKRISHNAN, R. bLSM: a

general purpose log structured merge tree. In

Proceedings of the 2012 ACM SIGMOD

International Conference on Management of Data

(2012), ACM, pp. 217–228.

[29] S

ELTZER, M., BOSTIC, K., MCKUSICK, M. K.,

AND STAELIN, C. An implementation of a

log-structured ﬁle system for UNIX. In

Proceedings of the USENIX Winter 1993

Conference Proceedings (San Diego, CA, Jan.

1993), p. 3.

[30] S

ELTZER, M., SMITH, K. A., BALAKRISHNAN,

H., C

HANG, J., MCMAINS, S., AND

PADMANABHAN, V. File system logging versus

clustering: A performance comparison. In

Proceedings of the USENIX 1995 Technical

Conference Proceedings (New Orleans, LA, Jan.

1995), p. 21.

[31] S

HETTY,P.,SPILLANE,R.P.,MALPANI, R.,

NDREWS, B., SEYSTER, J., AND ZADOK, E.

Building workload-independent storage with

VT-trees. In FAST (2013), pp. 17–30.

[32] S

WEENY, A., DOUCETTE, D., HU, W.,

NDERSON, C., NISHIMOTO, M., AND PECK,

G. Scalability in the XFS ﬁle system. In

Proceedings of the 1996 USENIX Technical

Conference (San Diego, CA, Jan. 1996), pp. 1–14.

[33] T

OKUTEK,INC. TokuDB: MySQL Performance,

MariaDB Performance. http://www.tokutek.

com/products/tokudb-for-mysql/.

[34] T

OKUTEK,INC. TokuMX—MongoDB

Performance Engine. http://www.tokutek.

com/products/tokumx-for-mongodb/.

[35] W

ANG,P.,SUN, G., JIANG, S., OUYANG, J.,

IN, S., ZHANG, C., AND CONG, J. An efﬁcient

design and implementation of LSM-tree based

key-value store on open-channel SSD. In

Proceedings of the Ninth European Conference on

Computer Systems (2014), EuroSys ’14,

pp. 16:1–16:14.