Writing Filesystems - Mapped I/O Backends

From Genunix

One of the big deficiencies in the Solaris filesystem framework is that there is no framework service function for the glue logic of VOP_GETPAGE() and VOP_PUTPAGE(). This makes these two vnode ops unnecessarily complicated.

For simplicity, this code makes extensive use of functions from the paged vnode support code in vm_pvn.c.

VOP_GETPAGE()

Before we investigate the actual implementation, let's look at the arguments and understand how the framework calls VOP_GETPAGE(), and what it expects the function to do.

The prototype for an implementation of VOP_GETPAGE() can be found in fop_getpage(), and looks like this:

int
fop_getpage(
	vnode_t *vp,
	offset_t off,
	size_t len,
	uint_t *protp,
	page_t **plarr,
	size_t plsz,
	struct seg *seg,
	caddr_t addr,
	enum seg_rw rw,
	cred_t *cr)

The arguments and their meanings are:

vnode_t *vp
This is the vnode for which the framework requests a fault to be handled.
offset_t off, size_t len
Offset and size describe the fault location, the framework requests bytes from the range [off, off + len] to be brought in.
A special meaing is attributed to len == 0 - such a request means 'from off to EOF' (end of file).
uint_t *protp, page_t *pl[], size_t plsz
The VM framework passes arrays for pages and per-page protection information as arguments to VOP_GETPAGE(). The size_t plsz parameter gives the number of entries in both the protp[] and the *pl[] array. protp[] is optional and needs not be filled in if not provided.
A special case is pl == NULL and plsz == 0, which is used for a readahead request on the requested byte range. On such a request, an implementation may choose to just return from VOP_GETPAGE() with a return code 0, which means 'success'.
The calling framework guarantees that plsz is large enough to accommodate the requested fault range [off, off + len], but will often be larger than that. XXX - explain purpose !!!!
struct seg *seg
A pointer to the virtual memory subsystem's segment structure. This is the segment mapped to the [off, off + len] fault range.
caddr_t addr
The target virtual address (guaranteed to be valid in kernel mode while VOP_GETPAGE() is processing) where the data is supposed to be written to.
enum seg_rw rw
VOP_GETPAGE() does not only handle read faults. It is also called by the framework on initialization faults, if a page in a mapping is accessed for the first time - even if that access is actually a store. The possible values for enum seg_rw can be found in <vm/seg_enum.h>.
cred_t *cr
Credentials associated with the calling process. XXX - actually ignored ?!

Actual code:

static int
fat_getpage(
	struct vnode *vp,
	offset_t off,
	size_t len,
	uint_t *protp,
	struct page *pl[],
	size_t plsz,
	struct seg *seg,
	caddr_t addr,
	enum seg_rw rw,
	struct cred *cred)
{
	struct fatnode *fip = VTOF(vp);
	struct fatfs *fsp = VFSTOFATFS(vp->v_vfsp);
	int err;

	if (vp->v_flag & VNOMAP) {
		return (ENOSYS);
	}

	ASSERT(off <= FAT_MAXOFFSET_T);
	ASSERT((off & PAGEOFFSET) == 0);

	FAT_ENTER(fsp, FAT_ENTER_SHARED);

	/*
	 * An attempt to fault in pages from beyond the end of the file
	 * must fail if the target is userspace.
	 */
	if ((off + len) > (offset_t)(fip->f_size + PAGEOFFSET) &&
	    seg != segkmap) {
		FAT_EXIT(fsp);
		return (EFAULT);	/* beyond EOF */
	}

	if (protp != NULL)
		*protp = PROT_ALL;

	/*
	 * This is a small optimization. A fault on a single page does not
	 * need to call the iterator.
	 */
	if (len <= PAGESIZE) {
		err = fat_getapage(vp, (u_offset_t)off, len, protp, pl, plsz,
		    seg, addr, rw, cred);
	} else {
		err = pvn_getpages(fat_getapage, vp, off, len, protp,
		    pl, plsz, seg, addr, rw, cred);
	}

	FAT_EXIT(fsp);
	return (err);
}

Now this breaks down the fault-in request against the byte range [off, off + len] into requests to fault in single pages.

XXX - need to explain why this is a good thing
XXX - need to give the getapage sample code !!!!

VOP_PUTPAGE()

And the same for VOP_PUTPAGE():

int
fop_putpage(
	struct vnode *vp,
	offset_t off,
	size_t len,
	int flags,
	struct cred *cr)

Being the counterpart to VOP_GETPAGE(), this vnode operation's primary task is to write dirty pages associated with the given byte range [off, off + len] into the on-disk representation of vnode_t * vp. But like VOP_GETPAGE(), which also does zero-fill and readahead, VOP_PUTPAGE() does more than just writing dirty pages out - it also must support invalidation and freeing of pages associated with the vnode in the specified byte range. The parameters for VOP_PUTPAGE() and their possible values are:

vnode_t *vp
obvious
offset_t off, size_t len
same meaning as with VOP_GETPAGE(), including len == 0 marking a request to flush from off to EOF.
int flags
XXX - explain
struct cred *cr
XXX - as with getpage, unused ...
/*
 * Flags are composed of {B_INVAL, B_FREE, B_DONTNEED, B_FORCE}
 * If len == 0, do from off to EOF.
 *
 * The normal cases should be len == 0 & off == 0 (entire vp list),
 * len == MAXBSIZE (from segmap_release actions), and len == PAGESIZE
 * (from pageout).
 *
 */
/*ARGSUSED*/
static int
fat_putpage(
	struct vnode *vp,
	offset_t off,
	size_t len,
	int flags,
	struct cred *cr)
{
	struct fatnode *fip = VTOF(vp);
	struct fatfs *fsp = VFSTOFATS(vp->v_vfsp);
	page_t *pp;
	int err = 0;
	u_offset_t io_off;
	size_t io_len;
	se_t se;
	int synchronous;

	if (vp->v_flag & VNOMAP)
		return (ENOSYS);

	FAT_ENTER(fsp, FAT_ENTER_SHARED);

	ASSERT(off <= FAT_MAXOFFSET_T);
	ASSERT((off & PAGEOFFSET) == 0);

	/*
	 * An attempt to "flush" data if there's none cached, or an
	 * attempt to write data to beyond the end of the file do
	 * immediately succeed - there's nothing to do for the filesystem.
	 */
	if (!vn_has_cached_data(vp) || off >= fip->f_size) {
		FAT_EXIT(fsp);
		return (0);
	}

	if (len == 0) {
		/*
		 * Search the entire vp list for pages >= off
		 */
		err = pvn_vplist_dirty(vp, off, fat_putapage, flags, cr);
		FAT_EXIT(fsp);
		return (err);
	}

	/*
	 * If we are not invalidating, synchronously freeing or writing pages
	 * use the routine page_lookup_nowait() to prevent reclaiming them from
	 * the free list.
	 */
	if ((flags & B_INVAL) || ((flags & B_ASYNC) == 0)) {
		se = (flags & (B_FREE | B_INVAL)) ? SE_EXCL : SE_SHARED;
		synchronous = 1;
	} else {
		se = (flags & B_FREE) ? SE_EXCL : SE_SHARED;
		synchronous = 0;
	}

	io_off = off;

	while (err == 0 && io_off < MIN(off + len, fip->f_size)) {
		if (synchronous)
			pp = page_lookup(vp, io_off, se);
		else
			pp = page_lookup_nowait(vp, io_off, se);

		/*
		 * Skip just the found page by default. But if it is dirty,
		 * give getapage() the ability to cluster multiple consecutive
		 * pages, and adjust io_len accordingly.
		 */
		io_len = PAGESIZE;

		if (pp && pvn_getdirty(pp, flags))
			err = fat_putapage(vp, pp, &io_off, &io_len, flags, cr);

		io_off += io_len;
	}

	FAT_EXIT(fsp);
	return (err);
}

to be continued...