Writing Filesystems - VFS and Vnode interfaces

From Genunix

Image:Info.gif This article has been identified as a draft. It is currently undergoing a community review. Please add your comments to the discussion page.

Do not quote any text on this page! It is still a draft!


The Solaris kernel interfaces with filesystem drivers via two sets of interfaces: VFS_*() for dealing with filesystem instances (mountpoints), and VOP_*() for file (node) instances. This section gives a 100000ft view at "what filesystem code does". Later sections will cover the details down to sample code level.

Operations on a filesystem on Solaris fall into two categories:

Operations that affect one specific instance (a "mount") of a filesystem type
These are called VFS ops and they provide backends for systemcalls like mount(2), umount(2), statvfs, or sync(2), and implement some per-instance services that the framework requires.
Operations on "nodes" (in the UNIX mknod sense)
These are files, directories, named pipes, devices, ... - whatever can be found by name in a filesystem is a "node") within such an instance. These are called Vnode ops and provide the backend for actual "filesystem I/O", aka systemcalls like open(2), close(2), read(2), write(2), stat(2), creat(2) and so on.

Sun introduced this abstraction concept of virtual filesystem/file operations somewhen in SunOS4, and the VFS/Vnode ideas have since propagated all over UNIX and UNIX-like operating systems. The exact per-mountpoint (VFS) and per-node (VOP) operations differ but the generic idea of making filesystem/file-related system calls dispatch via virtual function tables is there everywhere. One of the major differences between Solaris and other UNIX/UNIX-like operating systems is that since Solaris 10, VFS and Vnode operations are now dynamically bound to the filesystem/file, and can be changed at runtime. Yes, it is possible in Solaris to revector the VFS interfaces of a certain mountpoint elsewhere. Yes, it is possible in Solaris to let vnode operations for a given vnode, even while the file this vnode belongs to is e.g. held open, redirect operations to a new interface. Solaris therefore no longer uses statically-initialized function tables, but a template-based mechanism. We'll see the sample code shortly.

But first, a short explanation of all VFS/Vnode ops that the Solaris filesystem framework knows about.

VFS Interfaces - filesystem instances

VFS Operations are per instance. A filesystem instance is just a different word for a mounted filesystem - with kernel state associated. A filesystem driver calls the framework function vfs_setfsops(), from its module initialization function, to register per-mountpoint services it provides with the framework. The operation names are listed in <sys/vfs.h>:

VFS operation syscalls Description
VFS_MOUNT() mount(2) Create a filesystem instance. The task of VFS_MOUNT() is to validate that the passed-in pathname can be accessed as a filesystem of this type, that the on-disk metadata structures, as far as necessary to check at this time, are sane. Its behaviour must adjust to the mount options specified. If these checks pass successfully, the function will set up the 'backend' per-instance data structure and a hook it into vfs_data of the struct vfs * that's passed in by the framework as an argument.
VFS_UNMOUNT() umount(2)
umount2(2)
Tears down a filesystem instance. VFS_UNMOUNT() needs to check whether the filesystem can be unmounted (if it's not "busy" or if a forcible unmount is requested), and then proceed to release state associated with this filesystem - like cached inactive vnodes and cached metadata, and finally confirm back to the framework that the filesystem instance is no longer valid by setting VFS_UNMOUNTED in vfsp->vfs_flags.
VFS_STATVFS() statvfs(2) The backend for the statvfs(2) system call is provided by VFS_STATVFS(). Fill in a struct statvfs here.
VFS_SYNC() sync(2) This provides an (extended) backend for the sync(2) system call - commit all non-completed I/O to a specific (or to all) filesystem instance(s) of this filesystem type.
VFS_FREEVFS() N/A The filesystem framework calls this some time after a successful call to VFS_UNMOUNT() to notify the filesystem that all state (previously still active vnodes) associated with the filesystem instance is gone now and the filesystem is now allowed to free the per-instance state it hooked into vfs_data during VFS_MOUNT(). All the implementation needs to do here is to call kmem_free() on that.
VFS_ROOT() mount(2) Pathname traversal over mountpoints of this filesystem type needs the capability to inquire for the vnode_t of the root for this filesystem. VFS_ROOT() delivers that.
VFS_VNSTATE() N/A XXX - writeme ...
VFS_MOUNTROOT() N/A Being able to use a filesystem as root (Mountpoint "/") requires special support. VFS_MOUNTROOT() is executed early in boot during the first (implicit) mount of the root filesystem. It is also necessary to provide remount support(via the VFSSW_CANREMOUNT flag in vfsdef_t). It is not necessary to provide this entry point unless you wish to be able to host a root filesystem.
VFS_VGET()() N/A The life of a vnode on a filesystem begins with the frameworks request to the instance "get me a node for this 'file identifier'". VFS_VGET() does that, it associates per-filesystem state (the 'node') with a vnode_t.

A filesystem will not be really functional unless at least VFS_MOUNT(), VFS_UNMOUNT(), VFS_ROOT(), VFS_STATVFS(), VFS_SYNC(), and VFS_VGET() are implemented. If you wish to support forced umounts (aka the umount2() system call that accepts the MS_FORCE flag), VFS_FREEVFS() becomes mandatory unless you wish to go "deeply homegrown" ...

Vnode Interfaces - operations on files and directories

Just as the VFS ops implement functionality in filesystem instances, the vnode (virtualized node) operations provide the backend for per-node (file, directory or 'special') functionality. There is, like for the VFS interfaces, a rough but not 1:1 correspondence between system calls and vnode ops. Like with the VFS interfaces, the binding between a vnode_t and its operations vector is dynamic, and the filesystem can decide at any point in time to replace an existing vnode's ops vector with a different one, just by calling vn_setops(). Also, in order to allow versioning of the vnode interfaces, a filesystem actually doesn't initialize vnode ops vectors statically anymore, but uses the same name/value based template mechanism as for the VFS ops vectors - a struct vnodeops is initialized based on a fs_operation_def_t array template, by calling vn_make_ops(). All vnode ops names are listed in <sys/vnode.h>. The following is an overview of most of them. For these as well as for others not listed here, the generic advice holds: Use the source, Luke !

Vnode operation syscalls Description
VOP_OPEN()
VOP_CLOSE()
open(2)
close(2)
dup(2)
These are called by the framework to implement the open(2) and close(2) system calls. Unlike the cb_open() and cb_close() character/block device operations, they are strictly paired, i.e. not close-on-last-close.
VOP_WRITE()
VOP_READ()
read(2)
write(2)
These provide the backend for the read(2) and write(2) system calls (and all their variants, pread(), readv(), and so on).
VOP_GETPAGE()
VOP_PUTPAGE()
msync(3c)
madvise(3c)
memcntl(2)
The role of VOP_GETPAGE() and VOP_PUTPAGE goes far beyond 'mmaped file data sync support'. The implementation of these actually is the only place in a Solaris filesystem driver where userdata I/O happens. They implement paged I/O, and are often called implicitly by the VM framework, the pageout/fsflush daemons, and the filesystem code itself. Explicit calls to these function as consequence of certain system calls are the exception - not the rule.
VOP_CREATE()
VOP_REMOVE()
creat(2)
remove(3C)
unlink(2)
mknod(2)
These implement simple "node creation/deletion" (for files)
VOP_MKDIR()
VOP_RMDIR()
mkdir(2)
rmdir(2)
mkdirp(3GEN)
rmdirp(3GEN)
Directory-equivalent of VOP_CREATE() / VOP_REMOVE().
VOP_GETATTR()
VOP_SETATTR()
stat(2)
utimes(2)
chmod(2)
chown(2)
and others ...
This function pair queries/sets vnode attributes. As such, they're called as consequence of many system calls - they provide a functional superset. The system call implementation 'strips' the required fields from struct vattr as needed, to the elements the given system call operates on. See <sys/vnode.h> for an example.
VOP_GETSECATTR()
VOP_SETSECATTR()
acl(2) Filesystems that wish to support access control lists (ACLs) need to implement these two functions that deal with security attributes.
VOP_MAP()
VOP_ADDMAP()
VOP_DELMAP()
mmap(3c)
munmap(3c)
The filesystem tracks memory-map requests (for userspace memory-mapped I/O or for kernel paged I/O) via this set of functions. VOP_MAP() creates the actual mapping, while VOP_ADDMAP() does reference counting and copy-on-write, and VOP_DELMAP() tears down mappings. A call to mmap() ends in VOP_MAP(), which indirectly calls VOP_ADDMAP(). A call to munmap() will lead to VOP_DELMAP() being executed.
VOP_LINK()
VOP_READLINK()
VOP_SYMLINK()
lstat(2)
link(2)
symlink(2)
These provide support for hardlinks and softlinks.
VOP_READDIR()
VOP_LOOKUP()
readdir(3c)
lots of others ...
Filename lookup support (i.e. the ability to perform a ls [ -l ]) is provided by these. Unless VOP_LOOKUP() is present, not even a file can be opened on a filesystem, because open(2) backends perform filename lookups to retrieve a vnode, and only call VOP_OPEN() to allow the filesystem to reference count. VOP_LOOKUP() calls VFS_VGET() in order to create the filesystem-specific 'node' associated with a filename.
VOP_READDIR(), VOP_LOOKUP() and functions like VOP_CREATE(), VOP_RENAME(), VOP_MKDIR() etc. share a lot of code for parsing directory entries. It's therefore a good idea to implement them on top of a common backend that avoids duplication of large swaths of code. The code samples given later will show a way how to do this; unfortunately, apart from ZFS no current Solaris filesystem exploits this opportunity for code sharing ...
VOP_SETFL()
VOP_FRLOCK()
VOP_SHRLOCK()
fcntl(2) These provide specific functionality of fcntl(2) - file status flags (F_SETFL and F_GETFL), POSIX record locking (F_SETLK and F_GETLK), and CIFS-style file sharelocks (F_SHARE and F_UNSHARE). Generics are available to be put in for those callbacks, if the filesystem supports this behaviour and requires no modifications/adaptions.
VOP_IOCTL() ioctl(2) If the filesystem wishes to provide ioctl() operations, it needs to define the entry vector here.
VOP_PATHCONF() pathconf(2) The pathconf() system call allows queries for basic filesystem capabilities and limits. Since NFS uses VOP_PATHCONF() to detect whether a backing-store filesystem is capable of hosting a NFS export, it is mandatory to provide VOP_PATHCONF() if it is desired to allow exporting your filesystem via NFS.
VOP_FSYNC() fdsync(2)
fdatasync(3RT)
fsync(3C)
Completes all outstanding I/O on a filesystem node. Required for proper POSIX semantics.
VOP_SPACE() truncate(3C)
posix_fallocate(3C)
Support for growing/shrinking files in optimized manners can be implemented via VOP_SPACE().
VOP_ACCESS() access(2) VOP_ACCESS() maps directly to the access(2) system call.
VOP_INACTIVE() N/A One of the most important vnode ops is VOP_INACTIVE(). It's the functional counterpart to VFS_VGET() - the filesystem framework calls VOP_INACTIVE() if a vnode's reference count has dropped to one. VOP_INACTIVE() is supposed to disassociate cached filesystem state (the 'fs node') from the vnode, and discard all still-held internal references (whether filesystem node hashes, page cache, etc...) from the system.

something goes here ...