Building a tiny FUSE filesystem Building a tiny FUSE filesystem June 10, 2026 Filesystems as a request loop File contents as local blobs Caching and stale metadata Lately I have been working around sandboxing, storage, and networking, and a lot of that work keeps coming back to files, which makes sense since Unix has organized itself around everything is a file for over fifty years. Your terminal and random number generator are device files you can open and read (/dev/tty, /dev/urandom), and even network sockets, which are created with their own system call rather than opened by path, are read and written through the same interface afterwards. For this post, I built a small filesystem with a real backing store, enough metadata to behave like a filesystem, and a few deliberate omissions so the code is still readable. magicfs mounts at /magic, but it keeps its own local backing store next to it, with names and inode numbers in metadata.json, while file contents live as plain local files under blobs/. Calling that directory a blob store is a little grandiose, because the blobs are just files with allocated names like blob-000000000001, but keeping metadata separate from file contents lets the example cover name lookup, inode stability, write ordering, kernel caching, and what fsync() is asking the filesystem to do. The full sample code is at github.com/shayonj/magicfs, and if you have Docker, you can run the filesystem with FUSE enabled. docker run -it --rm --device /dev/fuse --cap-add SYS_ADMIN shayonj/magicfs $ cat /magic/hello.txt $ echo "remember the milk" > /magic/notes.txt $ find /tmp/magicfs-store -type f { Filesystems as a request loop When you run cat /magic/hello.txt, cat does not know that JSON metadata and blob files are involved, because all it does is call open() and read(), after which the kernel resolves the path through the VFS, and the operation eventually lands on the filesystem mounted at /magic. With FUSE, the code that answers those filesystem requests runs in userspace, where the kernel driver sends request messages over /dev/fuse, the userspace process replies, and the application that made the system call keeps waiting until the kernel has an answer, while the kernel FUSE documentation covers the protocol, and the fuser crate exposes the same operations as Rust trait methods. The path for a read looks roughly like this: flowchart LR Here is the log from writing notes.txt, trimmed to the requests involved in opening, truncating, writing, flushing, and releasing the file: [magicfs] READDIR ino=1 A filesystem usually has to answer a question about a name before it can answer anything about bytes, namely whether this name exists in this directory, and if it does, which file it refers to. Linux mostly stops caring about filenames once path lookup is done, because internally it refers to files by inode number, and on a disk filesystem, an inode is a record with metadata and pointers to data blocks, while a directory entry maps a name to an inode, which is why a rename can change a path without moving file data, and also why hard links can make the same inode appear under more than one name. magicfs keeps the directory entry and inode metadata in metadata.json: "notes.txt": { The ordering problem shows up before the read and write handlers do anything with file contents, because if a new blob reaches the backing store but metadata.json still points at the old blob, readers keep seeing the old file, while if metadata.json points at a blob that never made it to disk, readers see a broken file. magicfs handles the simple case by writing the blob first, then replacing metadata, and the metadata replacement follows the usual local-filesystem pattern where the code writes a temporary file, syncs it, renames it over metadata.json, and then syncs the containing directory. The temp-file-and-rename pattern avoids half-written JSON, but it is not a journal, and without a recovery pass or a transaction log, the filesystem cannot determine after a crash whether every in-flight metadata update had committed. File contents as local blobs For the data path, magicfs stores each committed file version as one immutable blob with an allocated ID, while a more complete filesystem would split larger files into chunks and let metadata point at a list of chunks, but one blob per file keeps the code short. For reads, metadata comes first, so given inode 3, the filesystem finds the entry for notes.txt, reads the blob ID from that entry, opens the corresponding file under blobs/, and returns the byte range the kernel requested. inode 3 The example ends up with a small copy-on-write data path, although rewriting one byte of a large file should not require rewriting the whole file, so a more complete implementation would chunk the file, track dirty chunks, write only the changed chunks, and then commit a metadata update that points at the new chunk list, while magicfs skips that complexity by assuming the files are small enough to rewrite as a unit. A shell command like this looks simpler than the filesystem work behind it: $ echo "remember the milk" > /magic/notes.txt OPEN notes.txt for writing FUSE also calls the filesystem when a file descriptor closes, because flush is called on close, and duplicated file descriptors mean one open file can have more than one flush. A filesystem can use flush to report delayed write errors, but flush does not mean the same thing as fsync, and release happens later still, when the kernel is done with the open file handle. For the shell demo, magicfs commits staged bytes on both FLUSH and FSYNC, which makes echo hello > /magic/notes.txt behave the way a person expects, while the code still treats fsync as the explicit request for durable file data and metadata. A database that calls fsync is asking a more specific question than a shell that happened to close a redirected file, and if the backing blob write fails after WRITE already returned success, the filesystem still has to decide where that error can be reported, either through a later fsync or through a close-time error from flush, although plenty of programs are not careful about checking close errors. For metadata, replacing a file with rename is atomic for readers, but atomic replacement is not the same thing as durability after power loss, so if you care that the new metadata.json survives a crash, you need to sync the new file contents and the directory entry that points at it, which magicfs handles for its local store by syncing the temporary metadata file before rename, then syncing the store directory after rename. In code, those rules show up in the order of blob writes, metadata replacement, flush, and fsync, because the filesystem has to decide which bytes exist, which names point at them, and what an application is allowed to assume after a successful sync. Caching and stale metadata FUSE replies can include time-to-live values for names and attributes, and until those TTLs expire, the kernel can answer repeated lookups and getattr calls without asking the userspace process again, which matters because crossing from the kernel into a userspace filesystem on every stat would be expensive. The same TTL also affects correctness, because magicfs uses a one second TTL, which is fine for a single-process demo, but if another process or another machine can update the same backing store, a reader may see an old file size or an old blob ID until the cache expires unless the filesystem actively invalidates the kernel's cached state. For file contents, magicfs opens files with FUSE direct I/O so reads come back to the userspace filesystem instead of being served from the page cache, which keeps the example easier to reason about but gives up caching and read-ahead that a real filesystem would probably want, and the cache policy matters because it changes which file size, inode attributes, and file contents callers are able to observe. The implementation only supports one directory, and each file is stored as one local blob, so rewriting a byte rewrites the whole file, with no journal, recovery scan, or cleanup for orphaned blobs left behind by rewrites or unlinks, and it also does not implement locking, mmap, extended attributes, a real permission model, sparse files, hard links, symlinks, or multi-client cache invalidation. The filesystem also does not model the problems that show up when the backing layer is remote, since network failures, remote consistency rules, retries, and authentication all change when reads can succeed, when writes can be retried, and what fsync can honestly report, while this example stays on local disk so the post can focus on filesystem calls. A journal or transaction log would let recovery decide whether a metadata update committed, chunking would avoid rewriting whole files, a garbage collector would find blobs no metadata entry can reach, and better cache invalidation would keep multiple readers from seeing stale metadata for too long. With FUSE, Linux asks the filesystem a fixed collection of questions, and the implementation can answer from whatever backing store it owns, which means the implementation still has to define lookup, write, flush, fsync, and rename when metadata and file contents are stored somewhere else. I am working on these filesystem, sandboxing, and storage problems at Tines, along with plenty of adjacent systems work that gets deeper than a blog post can. If that sounds interesting, we are hiring. last modified June 11, 2026 © 2026 Shayon Mukherjee Light Browse another page: |