Writing Linux Device Drivers Michael K. Johnson Developer, Red Hat Software johnsonm@redhat.com From a talk given at Spring DECUS '95 in Washington, DC. Copyright (C) 1995 Michael K. Johnson _________________________________________________________________ Introduction Naturally, there is far more about writing Linux device drivers than can be covered in 50 minutes. Fortunately, I am not enough of an expert to get bogged down in details, so you stand a chance of getting a helpful overview. There is some documentation available on writing device drivers for Linux; my own Linux Kernel Hackers' Guide (the KHG) is the main source for beginners. However, the details change from time to time as Linux matures, and many other details simply are not documented yet. This means that I can give you a skeleton for your driver, and give you some advice, but writing the driver may be a little bit of an adventure. If there is something that you need to do that isn't covered in this introductory tutorial, and which has been overlooked in the KHG, the next option is to look through other device drivers to see how they handle the problem. Chances are good that you are not the first person to encounter that particular problem. It is also likely that if you put some time into looking around and can't figure out what to do, you can find help on the linux-kernel mailing list or on the comp.os.linux.development.system Usenet group. Overview Linux is a clone of Unix. As in all versions of Unix, hardware devices are presented to normal programs as "special" files. Therefore, devices implement file semantics within the kernel. Because of this, it is worth taking a short look at how files in general are treated in Linux before attempting to understand how device drivers are written. Files The generic filesystems header file, , defines several structures for accessing files. super_block holds basic information about each filesystem, and super_operations is a structure of pointers to functions which are associated with a filesystem's superblock. Through that structure are reached inode_operations and file_operations, the last defining functions that can be used to access files. In normal filesystems, there is one set of file operations for all files in the filesystem, but they do not attempt to define any operations on device special files. Instead, those devices define their own file operations functions and register their own file_operations structure with the VFS. The VFS The VFS is the common abbreviation for the Virtual Filesystem Switch. Generic filesystem operations are handled by generic filesystem code, and only when filesystem-dependent or device-dependent operations need to be done is the code for that specific filesystem or device actually called. The function needed is looked up in the proper instance of one of the *_operations structures and called. The VFS code is kept in the fs/ subdirectory of the Linux kernel source, and the code to the individual filesystems is kept in subdirectories of the fs/ subdirectory. Operations What I mean by "operations" may not be very clear at this point. An operation is something that needs to be done as a result of a system call, or buffer cache activity, or because of hardware irregularities. Nearly all operations are caused directly or indirectly by system calls, and so you can think of the VFS as code that translates raw system calls into filesystem operations. Special files and filesystems All versions of Unix have device special files. This concept has even been picked up by such primitive operating systems as Microsoft's DOS. However, the VFS is so flexible that not only can special files be created, special filesystems can to. Linux has a filesystem called the proc filesystem, or "procfs", which is essentially a special filesystem. The files in this filesystem are not stored on disk; they are instead generated on-the-fly from kernel data structures. (No, I'm not way off topic.) These files are very similar to hardware devices, because they generate files from non-file data and present it to the user in the shape of a file. They are, you could say, virtual devices designed to report on the state of the kernel. File Operations Knowing this, it should not be suprising that devices "export" their functionality to the VFS by registering a file_operations structure with the VFS. We will see exactly how this is done later. Here's the file_operations structure: struct file_operations { int (*lseek) (struct inode *, struct file *, off_t, int); int (*read) (struct inode *, struct file *, char *, int); int (*write) (struct inode *, struct file *, char *, int); int (*readdir) (struct inode *, struct file *, struct dirent *, int); int (*select) (struct inode *, struct file *, int, select_table *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned lon g); int (*mmap) (struct inode *, struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); void (*release) (struct inode *, struct file *); int (*fsync) (struct inode *, struct file *); int (*fasync) (struct inode *, struct file *, int); int (*check_media_change) (dev_t dev); int (*revalidate) (dev_t dev); }; Some of the names of these function pointers should look suspiciously like system calls with which you are familiar. lseek(), read(), write(), readdir(), select(), ioctl(), mmap(), open(), and fsync() all are called directly or indirectly by the system calls of the same name. release() is called on close() and when a file is closed by a process exiting, or calling exec() when close-on-exec is set on the file. check_media_change() and revalidate() are not really file operations, they are device operations, as can be seen by their arguments. fasync() is a bit unusual; it is called when fcntl(fd, F_SETFL, FASYNC) (or ~FASYNC) is called; devices implementing this need to be aware when this change is made. The functions are provided with sensible defaults; most of the time, more than half of the functions are set to NULL because the VFS does the right thing without having to call the driver. You will see this in the skeleton driver presented. Data Structures The Linux kernel is monolithic. Only one thread of control created by a system call is active at any time. This means that device drivers do not need to lock their data structures as a general rule. The exception is that interrupt handling routines can run at any time, and data structures that are shared with interrupt handlers do need to be protected. _________________________________________________________________ The Kernel Interface By far the easiest way to develop almost any Linux device driver is as a run-time loadable kernel module. Written correctly, these modules can easily be used in their normal form as loadable modules, or re-compiled and linked with the rest of the kernel. The symbol MODULE is defined whenever a module is being compiled. The symbol __KERNEL__ is always defined when compiling kernel code, even when compiling modules; it is used in the kernel include files so that only kernel code includes certain kernel-specific definitions. This is necessary because the kernel include files are used as part of the standard include file hierarchy. I will start by presenting a very simple character device driver which implements a simple form of /dev/zero. Note that it does not deal with the memory-management uses of /dev/zero with which some of you may be familiar; this is intended to be a simple example that sends me on as few tangents as possible. All it does is allow writing of any values and reading all zero values. The code to do reading and writing takes 13 lines total; the rest of this file is a skeleton that anyone writing any device driver will find useful. /* Compile with "gcc -O -DMODULE -D__KERNEL__ -c zero.c" */ All Linux kernel code should be compiled with optimization becuase it requires certain gcc extensions that are only activated with optimization turned on. #include All device drivers should include before including any other file. #ifdef MODULE #include #include #else #define MOD_INC_USE_COUNT #define MOD_DEC_USE_COUNT #endif Kernel symbols that are exported to modules have their names "mangled" in a way similar to C++, so that changes in kernel structures will be noticed. This causes the module not to be loaded, because if a kernel structure is changed, loading an old module that has been compiled with a different version of the structure can damage system integrity. MOD_INC_USE_COUNT and MOD_DEC_USE_COUNT are documented later. #include #include #include /* for verify_area */ #include /* for -EBUSY */ #include /* for put_user_byte */ All of these are included by nearly every device driver. Real device drivers will of course include other include files as well. static int zero_major; All modules that dynamically allocate their major number, as this one does (that comes later), need to store their major number somewhere. For this simple driver, I use a static int. In a more complex driver that saves a lot of state, this might be part of a static structure. static int read_zero(struct inode * node, struct file * file, char * buf, int c ount) { int left; if (verify_area(VERIFY_WRITE, buf, count) == -EFAULT) return -EFAULT; for (left = count; left > 0; left--) { put_user_byte(0, buf); buf++; } return count; } Whenever the read() system call is called on this device, read_zero() is called. The put_user_byte() function puts a byte into user memory. It is not as easy as saying *buf++ = '\000'; because the execution stream is in kernel memory space, and the pointer buf points to user-space memory. read_zero() is expected to return the number of bytes actually written into the read buffer. Before actually writing the zeros into the buffer provided, we verify that the entire buffer is legal to write in, using a verify_area() call. This prevents us from generating kernel-space faults from the reading process if the reading process passes in a pointer to non-existent memory or a count that makes part of the buffer lie in non-existent memory. static int write_zero(struct inode * inode, struct file * file, char * buf, int count) { return count; } This function is called whenever the write() system call is called on this device. This function is expected to return the number of bytes written. write_zero() ignores its input and returns a successfully completed write operation. static int lseek_zero(struct inode * inode, struct file * file, off_t offset, int orig) { return file->f_pos=0; } The semantics of /dev/zero are unusual. Normally, the lseek function would check to make sure that the seek was in-bounds for the device and set file->f_pos = offset, or return an appropriate error. The reason that we implement this function at all is because lseek would otherwise fail if someone tried to open the device with STDIO in append ("a") mode. static int open_zero(struct inode *inode, struct file * file) { MOD_INC_USE_COUNT; return 0; } static void release_zero(struct inode *inode, struct file * file) { MOD_DEC_USE_COUNT; } These are included here only because this is a loadable module; there is nothing that needs to be allocated or deallocated when this device is opened or closed. However, if the module is in use, we would like to forbid removing the module, and this allows us to keep track of whether the module is in use or not. static struct file_operations zero_fops = { lseek_zero, read_zero, write_zero, NULL, /* no special zero_readdir */ NULL, /* no special zero_select */ NULL, /* no special zero_ioctl */ NULL, /* no special zero_mmap */ open_zero, release_zero, NULL, /* no special fsync */ NULL, /* no special fasync */ NULL, /* no special check_media_change */ NULL, /* no special revalidate */ }; This is the file_operations structure being filled in to be passed back to the VFS. #ifndef MODULE long zero_init(long mem_start, long mem_end) { if (zero_major = register_chrdev(0, "zero", &zero_fops)) printk("unable to get major for zero device\n"); return mem_start; } If this device is being compiled directly into the kernel, this initialization function will need to be called from somwhere when the kernel is booting. Most character devices are initialized from mem_init() in mem.c. It is allowed to allocate a static block of memory for itself by keeping a pointer to mem_start, adding the amount of memory needed to mem_start, and returning a new pointer. It should not return a pointer greater than mem_end. N.B.: Allocating memory this way is deprecated, since it makes writing a loadable device driver harder. This is vestigial functionality from when the Linux kernel malloc() was not able to allocate more than 4096 bytes at once. #else int init_module(void) { if ((zero_major = register_chrdev(0, "zero", &zero_fops)) == -EBUSY) { printk("unable to get major for zero device\n"); return -EIO; } return 0; } This is a function equivalent to zero_init() which is called when the module is loaded into a running system. Note that it takes zero arguments instead of two. This is because memory management setup has been completed and there is no nice large chunk of memory to grab from. This function should actually be very similar to zero_init(), since grabbing memory is deprecated. They can't be the same because they have to return different values. However, zero_init() could conceivably be written to call init_module() and init_module() could be included whether or not the driver was being compiled as a module. Choose what looks the cleanest for your own device. Note that the register_chrdev() function is called with the first argument 0. The first argument can either be a requested major number, in which case the function returns failure (-EBUSY) if that major number is already allocated, or it can be 0, in which case the first available major number greater than 64 (see MAX_BLKDEV and MAX_CHRDEV in ) is allocated and returned, or if all possible slots are taken up, -EBUSY is returned. void cleanup_module(void) { unregister_chrdev(zero_major, "zero"); } #endif This function is called when a request is made to remove the module from the kernel. _________________________________________________________________ Character vs. Block Devices The main difference between character-mode and block-mode devices in the Linux kernel is the way that requests to transfer data are made. As you can see in the device above, character devices have read and write functions that are called directly whenever I/O is required. Block devices generally don't have read or write functions. Instead, they implement a "strategy routine" or "request function" which is called not directly by system calls, but indirectly by the buffer cache. If a program requests data from a file, the particular filesystem which holds that data determines what block the data is on and requests that block from the buffer cache. That block might be cached, in which case the request is satisfied by the buffer cache, or it might not, in which case the buffer cache creates a request for the device driver to fetch the block from disk and store the data in a buffer. Of course, while finding that data block, the filesystem might have to read other blocks containing directory entries and inodes. When it needs to read those blocks, it requests them from the buffer cache in the same way it requests any other data block. The buffer cache does not need to know what the blocks are used for. This indirect approach speeds up disk access considerably, and ends up simplifying some things, as crazy as that sounds. For one thing, the block device driver has very little interaction with user programs; calls to ioctl() are likely to be the most common direct interaction. This means that block device drivers don't have to be as suspicious of their input as character device drivers. By the time a request for a data block has made it to the strategy routine, it has been pretty well checked to make sure it is valid. There is no question of writing to user-space memory and there is no chance that a buggy user-level program passed in bad arguments that need to be checked. This is fortunate, because other facets of block device drivers are more complicated. There is more infrastructure to be set up than for a simple character device driver. Especially with interrupt-driven block devices, there are opportunities for race conditions that need to be watched out for. Here is a simple example of what a non-interrupt-driven request function would look like. static void do_foo_request(void) { repeat: INIT_REQUEST; /* check to make sure that the request is for a valid physical device */ if (!valid_foo_device(CURRENT->dev)) { end_request(0); goto repeat; } if (CURRENT->cmd == WRITE) { if (foo_write(CURRENT->sector, CURRENT->buffer, CURRENT->nr_sectors << 9)) { /* successful write */ end_request(1); goto repeat; } else { end_request(0); goto repeat; } if (CURRENT->cmd == READ) { if (foo_read(CURRENT->sector, CURRENT->buffer, CURRENT->nr_sectors << 9)) { /* successful read */ end_request(1); goto repeat; } else { end_request(0); goto repeat; } } } If this looks needlessly complex to you, realize that non-interrupt-driven device drivers do not take full advantage of the infrastructure. Interrupt-driven drivers, by contrast, only start things going, and then return without calling end_request() at all; the interrupt handler (or handlers) and timeout functions (if any) do that when a request has been satisfied or there has been an error. Here is an example of a vaguely-defined interrupt-driven device driver. static int foo_busy; /* foo_init or init_module sets this to zero */ static void do_foo_request(void) { if (foo_busy) /* another request is being processed; this one will automatically follow */ return; foo_busy = 1; foo_initialize_io(); } static void foo_initialize_io(void) { if (CURRENT->cmd == READ) { SET_INTR(foo_read_intr); } else { SET_INTR(foo_write_intr); } /* send hardware command to start io based on request; just a request to read if read and preparing data for entire write; write takes more code */ } static void foo_read_intr(void) { int error=0; CLEAR_INTR; /* read data from device and put in CURRENT->buffer; set error=1 if error This is actually most of the function... */ /* successful if no error */ end_request(error?0:1); if (!CURRENT) /* allow new requests to be processed */ foo_busy = 0; /* INIT_REQUEST will return if no requests */ INIT_REQUEST; /* Now prepare to do IO on next request */ foo_initialize_io(); } static void foo_write_intr(void) { int error=0; CLEAR_INTR; /* data has been written. error=1 if error */ /* successful if no error */ end_request(error?0:1); if (!CURRENT) /* allow new requests to be processed */ foo_busy = 0; /* INIT_REQUEST will return if no requests */ INIT_REQUEST; /* Now prepare to do IO on next request */ foo_initialize_io(); } I cannot fully cover block device drivers within a 50-minute talk, but this should give you an impression of what Linux block device drivers are like, and demonstrate that the request function interface is a flexible, working abstraction. The KHG contains more information, if you wish to explore further. You could even (gasp) read the source code to a few simple working drivers, such as the ramdisk driver. _________________________________________________________________ The Hardware Interface Linux provides convenience routines for accessing I/O ports, hardware interrupts, and DMA channels. The KHG covers most of the functions mentioned here in more detail. I/O access The Linux kernel provides a service to register I/O port usage. Before probing for a device in an initialization function, the driver should call check_region(). It takes two arguments; the beginning and length of the range of I/O ports you wish to access. It will return -EBUSY if any of the ports are in use, and 0 otherwise. This will keep you from confusing other devices by reading from or writing to their I/O ports in your attempt to probe for your device. Then your driver should register the ports it wishes to use with the request_region() function, which takes three arguments: the beginning of the region, the length of the region, and the name of the driver. When your driver is unloaded, it should call release_region() with two arguments, the beginning and length of the region, to release the region. To access I/O ports, 12 inline functions are available by including . Six functions read data from ports, and each takes one argument: the name of the port. The other six write data to ports, and they take two arguments, the first being the value to write, and the second being the port to write it to. Each function has a size designation: b stands for byte, w for word (16 bits), and l for long (32 bits). Half of the functions are "pausing" functions that pause briefly when writing; a lot of hardware is a little slow on the uptake when it is being read or written, and is unable to keep up with the CPU. These functions have _p appended to their names. Here is the list: inb(), inb_p(), outb(), outb_p(), inw(), inw_p(), outw(), outw_p(), inl(), inl_p(), outl(), outl_p(), and inb(). Hardware interrupts request_irq() requests an IRQ from the kernel, and installs an interrupt handler on that IRQ if successful. Takes four arguments: unsigned int irq is the number of the IRQ being requested. void (*handler)(int, struct pt_regs *) is a pointer to the interrupt handling function. unsigned long flags set to SA_INTERRUPT to request a "fast" interrupt, or 0 to request a normal "slow" interrupt. The process scheduler is not run when returning from a fast interrupt, but it may be run when returning from a slow interrupt. const char *device a string containing the name of the device. It is used to give the name of the driver in the /proc/interrupts listing. Your handler will then be called whenever that particular interrupt occurs. The handler does not need to be re-entrant. free_irq() frees up an IRQ. It takes one argument; the interrupt number to free. cli(), which stands for CLear Interrupt enable, disables interrupts temporarily. sti(), SeT Interrupt enable, re-enables them. These are used to prevent race conditions where an interrupt-driven function and a system call (or function called from a system call) access the same data structures. DMA channels DMA channels in the PC are peculiar devices. has declarations of functions that can be used to manage DMA operations, as well as some documentation. While initilizing the driver, call request_dma(channel, "name"), where channel is the DMA channel that you wish to allocate, and "name" is the name of the driver. The name will be used to show the owner of the channel in /proc/dma. When setting up DMA, use a sequence somewhat like the following. cli(); diable_dma(channel); /* Turn it off */ clear_dma_ff(channel); /* Clear pointer flip/flop */ /* Set DMA mode. Some of these are defined in * dma.h. Others (such as auto-initialize mode) * aren't there but you can either (a) find them * in other drivers (the znet Ethernet card driver * has a few) or (b) figure out the hex value to * plug into the 8237's registers. Get the specs * on the 8237 DMA controller chip if you don't * have them already. */ set_dma_mode(channel, DMA_MODE_READ); /* Set transfer address and page bits for your channel */ set_dma_addr(channel, buffer); /* Set tranfer size */ set_dma_count(channel, count); enable_dma(channel); sti(); You will still have to make the device do the DMA, as well. Other functions are available for managing DMA depending on what you need to do; all of these functions except for disable_dma(), enable_dma(), request_dma(), and free_dma() should be called with interrupts disabled. Make sure that you read all the comments in dma.h, as they will help you avoid many possible mistakes in programming DMA. It is probably also worth reading the source code for other drivers that use DMA. Also, read the actual dma_*() function source code which is in and compare it to the documentation for the device for which you are writing a driver to make sure that you understand what you are doing; DMA is probably the easiest hardware programming interface to use incorrectly. _________________________________________________________________ Kernel convenience routines Linux provides many routines that device drivers commonly use. The KHG covers most of the functions mentioned here in more detail. Access to user memory Linux has 8 functions for moving data between user space and kernel space. Their names are mostly self-explanitory, and they are all declared in . The get_user_*() functions, get_user_byte(), get_user_word(), and get_user_long(), each take one argument, the address from which to fetch data. The put_user_*() functions, put_user_byte(), put_user_word(), and put_user_long(), each take two arguments, the first being the value to put, and the second being the user-space address at which to put it. Two memcpy()-like functions are also available; memcpy_fromfs(to, from, n) copies n bytes to kernel address to from user address from, and memcpy_tofs(to, from, n) copies n bytes to user address to from kernel address from. The reason the name uses "fs" instead of "user" is that on Linux/i386, the "fs" register is used to point to user space while in kernel mode. Before accessing memory, use verify_area() to avoid kernel-space segmentation faults in case of error. verify_area() takes three arguments: the first is the type (VERIFY_WRITE or VERIFY_READ), the second is the address at which to start validating, and the third is the number of bytes to validate. Memory allocation Several sets of functions are available for memory allocation. The first is kmalloc()/kfree()/kfree_s(). These works much like the malloc()/free() available in the C library, except that the limit on the request size is smaller. Also, kmalloc takes two arguments instead of one; the first argument is the usual size of the region to allocate, and the second is the "priority". This is one of the GFP_* defines in the file . If the driver can be safely pre-empted, then GFP_KERNEL should be used. If not, or from within an interrupt handler, GFP_ATOMIC should be used. Only use GFP_ATOMIC if absolutely necessary, because it places a larger strain on the memory management system. In order to allocate DMA-able memory, GFP_DMA should be used. This may allow pre-emption to take place, so be careful where you use it. Do be careful to free everything when you are done using it, because kernel memory is non-swapable which makes memory leaks more serious than in user-space programs. Also, be careful not to free memory before you are finished using it, because freeing memory and the continuing to use it will usually cause a kernel fault--and that's if you are lucky. If you are unlucky, it will silently corrupt memory. Sleeping The first rule about sleeping is that only kernel code that is called from user code can sleep. Kernel code called from an interrupt handler cannot sleep. If a device driver needs to sleep on an event, it can call one of several functions that are available for doing so, which work for most instances. However, some drivers need to sleep on multiple events, or do something else to avoid race conditions. In Linux, a task in kernel mode can set its state hint to a sleeping mode and keep executing for a while before calling the scheduler, which schedules another task to run. This is extremely flexible, and is partially covered in the KHG. Several devices use this to good effect, including simple devices like the lp parallel port driver and complex ones like the serial driver. There are two functions for simple sleeping on an event: sleep_on() and interruptible_sleep_on(). There are two corresponding functions for waking up all processes sleeping on an event: wake_up() and wake_up_interruptible(). Timers To simply go to sleep for a short time, measured in "jiffies" (hundredths of a second), the following code can be used. jiffies_to_wait determines a minimum length of time to wait. Using this code requires you to include . current->state = TASK_INTERRUPTIBLE; current->timeout = jiffies + jiffies_to_wait; schedule(); While the process is paused, other processes will run, and may well run in kernel space, so do not depend on the state of static data structures remaining the same after the call to schedule. Timers that act like hardware interrupts are also available. Include and allocate a struct timer_list. First pass a pointer to your structure to init_timer(), then fill in the expires, data, and function members, then call add_timer() with a pointer to your structure as the argument. expires gives the number of jiffies after which to time out, data gives the argument to pass to the timer handler, and function is a pointer to the timer handler function. When the function is called, it will not be executed in the context of a running process, so it will not be able to access any user-space data. Just like with a hardware interrupt handler, only kernel-space data structures will be available. It is possible to request multiple timers at once by making a list of these timer structures; read for details. Most of the time, this will not be necessary. Reporting information There are several ways to report information. printk() is a kernel version of the libc printf() function which does not handle floating point numbers. It prints to the screen unless a kernel logging daemon such as klogd is running, in which case it is logged to system log files. printk() enables interrupts, and is not safe to call from withing cli()/sti()-protected code. Even if it did not explicitly enable interrupts, it causes implicit I/O and might cause pre-emption to occur. If you need to debug interrupt-disabled code by printing to the screen, use sprintf() to fill a string and use console_print() to display it on the screen. console_print() is not declared in any header files, you will need to declare it with extern void console_print(const char *); before using it. It is defined in drivers/char/console.c. It is also possible to use gdb to read /proc/kcore to do inspection-only debugging of the kernel. This currently does not work with loadable modules, but a kernel patch is available to allow inspection of loaded modules as well. Block requests Block devices do I/O by iterating over a sorted list of requests for I/O. When a request has been fulfilled, an inline function called end_request() is called, which appropriately handles the request, including waking up any processes sleeping on the request and cleaning up the request list. The INIT_REQUEST; macro is then called; if there are no requests left, it causes the function to exit. Otherwise, it sets up the next request. _________________________________________________________________ Trademarks Unix is a trademark of X/Open Pty. Limited. Linux is not a licensee of X/Open Pty. Limited, and is not Unix. Microsoft is a trademark of Microsoft, Inc. Acknowledgements Thanks to Matt Welsh for his help understanding DMA under Linux.