Writing Filesystems - Userdata I/O
From Genunix
| This article has been identified as a draft. It is currently undergoing a community review. Please add your comments to the discussion page.
Do not quote any text on this page! It is still a draft! |
UNIX knows two different methods for performing userdata I/O operations:
- via read(2) and write(2) system calls resp. their variants
- using memory-mapped file access, mmap(3C).
Solaris filesystems perform the actual I/O to the block device in a common codepath for both, and that has consequences for how a filesystem implementation must look like. This section will show how userdata I/O works, so that the following implementation of VOP_READ() and VOP_WRITE() becomes easy to understand.
Part 1 - mmap-based I/O
Before we look at VOP_READ() and VOP_WRITE(), though, let's see how mmaped I/O works and what vnode operations a filesystem must implement to support it. Easy to see via DTrace. Try the following C Program and associated D script:
| mmaped I/O demonstration program | D script to find filesystem mmap backends | /*
* mmaptest.c
* A simple program to demonstrate mmaped I/O
*/
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/param.h>
#include <fcntl.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>
int main(int argc, char **argv)
{
int fd;
char localbuf[PAGESIZE];
char *mapbase;
if ((fd = open(argv[1], O_RDWR)) < 0) {
perror("open failed");
return (-1);
}
mapbase = mmap(NULL, PAGESIZE,
PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
if (mapbase == NULL) {
perror("mmap failed");
close(fd);
return (-1);
}
close(fd);
memcpy(localbuf, mapbase, PAGESIZE);
sleep(5);
memset(mapbase, 'A', PAGESIZE);
sleep(5);
msync(mapbase, PAGESIZE, MS_SYNC);
sleep(5);
munmap(mapbase, PAGESIZE);
return (0);
}
| #!/usr/sbin/dtrace -n
syscall:::entry, fbt::trap:entry
/execname == "mmaptest"/
{
self->t = 1;
}
syscall:::return, fbt::trap:return
/self->t/
{
self->t = 0;
}
fbt::fop_*:entry
/self->t/
{
self->t = 2;
}
fbt:pcfs::entry
/self->t == 2/
{
stack();
ustack();
self->t = 1;
}
|
|---|
Running this and/or modifying it so that it works on different filesystem types is left as an exercise to the reader; in any case, the important steps are:
| C source | backend | mapbase = mmap(NULL, PAGESIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); | 1 42626 pcfs_map:entry
genunix`fop_map+0x50
genunix`smmap_common+0x257
genunix`smmap32+0xaa
genunix`dtrace_systrace_syscall32+0x11f
unix`sys_syscall32+0x1ff
libc.so.1`mmap+0x7
mmaptest`0x80508a2
1 42630 pcfs_addmap:entry
genunix`fop_addmap+0x5c
genunix`segvn_create+0x2b7
genunix`as_map_locked+0x1a9
genunix`as_map+0x5a
pcfs`pcfs_map+0x13e
genunix`fop_map+0x50
genunix`smmap_common+0x257
genunix`smmap32+0xaa
genunix`dtrace_systrace_syscall32+0x11f
unix`sys_syscall32+0x1ff
libc.so.1`mmap+0x7
mmaptest`0x80508a2
| memcpy(localbuf, mapbase, PAGESIZE); | 1 42622 pcfs_getpage:entry
genunix`fop_getpage+0x52
genunix`segvn_fault+0xdde
genunix`as_fault+0x61d
unix`pagefault+0xad
unix`trap+0xecc
unix`_cmntrap+0x201
libc.so.1`memcpy+0xff
mmaptest`0x80508a2
| memset(mapbase, 'A', PAGESIZE); | This does not show up ! | msync(mapbase, PAGESIZE, MS_SYNC); | 0 42624 pcfs_putpage:entry
genunix`fop_putpage+0x3a
genunix`segvn_sync+0x104
genunix`as_ctl+0x204
genunix`memcntl+0x77a
genunix`dtrace_systrace_syscall32+0x11f
unix`sys_syscall32+0x1ff
libc.so.1`memcntl+0x7
libc.so.1`msync+0x97
mmaptest`main+0x12e
mmaptest`0x80508a2
| munmap(mapbase, PAGESIZE); | 0 42632 pcfs_delmap:entry
genunix`fop_delmap+0x5b
genunix`segvn_unmap+0x11c
genunix`as_unmap+0x11e
genunix`munmap+0x92
genunix`dtrace_systrace_syscall32+0x11f
unix`sys_syscall32+0x1ff
libc.so.1`munmap+0x7
mmaptest`0x80508a2
[ ... ]
0 42624 pcfs_putpage:entry
genunix`fop_putpage+0x3a
pcfs`syncpcp+0x43
pcfs`pc_rele+0x9d
pcfs`pcfs_inactive+0x7d
genunix`fop_inactive+0x93
genunix`vn_rele+0x66
genunix`segvn_free+0x1f9
genunix`seg_free+0x40
genunix`segvn_unmap+0x8e8
genunix`as_unmap+0x11e
genunix`munmap+0x92
genunix`dtrace_systrace_syscall32+0x11f
unix`sys_syscall32+0x1ff
libc.so.1`munmap+0x7
mmaptest`0x80508a2
|
|---|
From this we see that mmap-based I/O uses the following vnode ops, in sequence:
- When a mapping is created, VOP_MAP() is called by the framework to indicate the request.
- The filesystem's implementation of VOP_MAP() calls as_map() to create a VM segment.
- The VM framework will, on completion of the task, call VOP_ADDMAP() as a notification to the filesystem that the mapping is now 'active'.
- The first pagefault on the new segment (no matter whether it's a memory load/store) will require the data backing the mapping to be brought in from the file on disk.
The segment driver handling the fault calls VOP_GETPAGE() to request the filesystem to do this. - Further accesses, whether read or write, cause no more calls into the filesystem code until the need to synchronize the modified data back to disk occurs. This can be an explicit call to msync(), or a delayed writeback by the paging daemon, fsflush(), which periodically writes dirty pages back to disk.
Such a request makes the framework call VOP_PUTPAGE() in the filesystem. - Removing the mapping results in the segment driver calling VOP_DELMAP().
So one of the definite conclusions from this is that actual I/O operations must be performed by the filesystem in VOP_GETPAGE() and VOP_PUTPAGE() in order to support mmap-based I/O operations as per above. We will see soon how this code actually looks like.
Part 2 - I/O via systemcalls
But first, something that might be a little surprising. Let's change the C program above to perform the same sequence of I/O operations, but use "normal" system calls instead of mmap(). The DTrace script barely changes, but our C source will now look like this:
| syscall I/O demonstration program | D script to find filesystem systemcall backends | /*
* readwritetest.c
* A simple program to demonstrate
* systemcall-based I/O
*/
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/param.h>
#include <fcntl.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>
int main(int argc, char **argv)
{
int fd;
char localbuf[PAGESIZE];
if ((fd = open(argv[1], O_RDWR)) < 0) {
perror("open failed");
return (-1);
}
(void)read(fd, localbuf, PAGESIZE);
sleep(5);
memset(localbuf, 'A', PAGESIZE);
(void)write(fd, localbuf, PAGESIZE);
sleep(5);
fsync(fd);
sleep(5);
close(fd);
return (0);
}
| #!/usr/sbin/dtrace -n
syscall:::entry
/execname == "readwritetest"/
{
self->t = 1;
}
syscall:::return
/self->t/
{
self->t = 0;
}
fbt::fop_*:entry
/self->t/
{
self->t = 2;
}
fbt:pcfs::entry
/self->t == 2/
{
stack();
ustack();
self->t = 1;
}
|
|---|
Running this tells us how systemcall-based I/O works. We see output like this:
| C Sourcecode | Backend | (void)read(fd, localbuf, PAGESIZE); | 1 42590 pcfs_read:entry
genunix`fop_read+0x43
genunix`read+0x2a4
genunix`read32+0x20
genunix`dtrace_systrace_syscall32+0x11f
unix`sys_syscall32+0x1ff
libc.so.1`_read+0x7
rwt`main+0x84
rwt`0x8050862
1 42622 pcfs_getpage:entry
genunix`fop_getpage+0x52
genunix`segmap_fault+0x241
genunix`as_fault+0x61d
unix`pagefault+0x226
unix`trap+0x1596
unix`_cmntrap+0x201
unix`kcopy+0x4b
genunix`uiomove+0x17f
pcfs`rwpcp+0x4ff
pcfs`pcfs_read+0x77
genunix`fop_read+0x43
genunix`read+0x2a4
genunix`read32+0x20
genunix`dtrace_systrace_syscall32+0x11f
unix`sys_syscall32+0x1ff
| (void)write(fd, localbuf, PAGESIZE); | 1 42594 pcfs_write:entry
genunix`fop_write+0x43
genunix`write+0x21d
genunix`write32+0x20
genunix`dtrace_systrace_syscall32+0x11f
unix`sys_syscall32+0x1ff
libc.so.1`_write+0x7
rwt`main+0xc2
rwt`0x8050862
1 42622 pcfs_getpage:entry
genunix`fop_getpage+0x52
genunix`segmap_fault+0x241
genunix`as_fault+0x61d
unix`pagefault+0x226
unix`trap+0x1596
unix`_cmntrap+0x201
unix`do_copy_fault_nta+0x35
genunix`uiomove+0xc8
pcfs`rwpcp+0x46d
pcfs`pcfs_write+0x91
genunix`fop_write+0x43
genunix`write+0x21d
genunix`write32+0x20
genunix`dtrace_systrace_syscall32+0x11f
unix`sys_syscall32+0x1ff
libc.so.1`_write+0x7
rwt`main+0xc2
rwt`0x8050862
1 42624 pcfs_putpage:entry
genunix`fop_putpage+0x3a
genunix`segmap_release+0x381
pcfs`rwpcp+0x546
pcfs`pcfs_write+0x91
genunix`fop_write+0x43
genunix`write+0x21d
genunix`write32+0x20
genunix`dtrace_systrace_syscall32+0x11f
unix`sys_syscall32+0x1ff
libc.so.1`_write+0x7
rwt`main+0xc2
rwt`0x8050862
| fsync(fd); | 1 42602 pcfs_fsync:entry
genunix`fop_fsync+0x31
genunix`fdsync+0x3b
genunix`dtrace_systrace_syscall32+0x11f
unix`sys_syscall32+0x1ff
libc.so.1`__fdsync+0x7
libc.so.1`fsync+0x8b
rwt`main+0xd8
rwt`0x8050862
1 42624 pcfs_putpage:entry
genunix`fop_putpage+0x3a
pcfs`syncpcp+0x43
pcfs`pc_nodesync+0x41
pcfs`pcfs_fsync+0x70
genunix`fop_fsync+0x31
genunix`fdsync+0x3b
genunix`dtrace_systrace_syscall32+0x11f
unix`sys_syscall32+0x1ff
libc.so.1`__fdsync+0x7
libc.so.1`fsync+0x8b
rwt`main+0xd8
rwt`0x8050862
| close(fd); | 1 42588 pcfs_close:entry
genunix`fop_close+0x42
genunix`closef+0xa1
genunix`closeandsetf+0x45d
genunix`close+0x16
genunix`dtrace_systrace_syscall32+0x11f
unix`sys_syscall32+0x1ff
libc.so.1`_close+0x7
rwt`main+0xee
rwt`0x8050862
1 42604 pcfs_inactive:entry
genunix`fop_inactive+0x93
genunix`vn_rele+0x66
genunix`closef+0xc9
genunix`closeandsetf+0x45d
genunix`close+0x16
genunix`dtrace_systrace_syscall32+0x11f
unix`sys_syscall32+0x1ff
libc.so.1`_close+0x7
rwt`main+0xee
rwt`0x8050862
1 42624 pcfs_putpage:entry
genunix`fop_putpage+0x3a
pcfs`syncpcp+0x43
pcfs`pc_rele+0x9d
pcfs`pcfs_inactive+0x7d
genunix`fop_inactive+0x93
genunix`vn_rele+0x66
genunix`closef+0xc9
genunix`closeandsetf+0x45d
genunix`close+0x16
genunix`dtrace_systrace_syscall32+0x11f
unix`sys_syscall32+0x1ff
libc.so.1`_close+0x7
rwt`main+0xee
rwt`0x8050862
|
|---|
These codepaths clearly show that the implementations of VOP_READ() and VOP_WRITE() actually do not perform I/O operations themselves. Instead, they use a specific segment driver, segmap, to create temporary VM mappings, and then delegate the actual I/O request to VOP_GETPAGE() and VOP_PUTPAGE(), by causing faults directly, or by dedicated calls into functions from segmap.
Puh - long. Why do it like this ? There are two reasons. The first - don't duplicate code - is obvious but alone might not justify the strange segmap effort. But the second is compelling: We want to put userdata into the system's page cache - and it may not matter which codepath populates the page cache, we must find the same data there, whether we use VOP_READ() or VOP_GETPAGE(). This - pagecache management - kind of forces a common backend for mmap- and systemcall-based I/O, which is provided by segmap.
