\documentclass{shevek} \begin{document} \title{Writing a kernel from scratch} \author{Bas Wijnen} \date{\today} \maketitle \begin{abstract} This is a report of the process of writing a kernel from scratch for the cheap (€150) Trendtac laptop. In a following report I shall write about the operating system on top of it. It is written while writing the system, so that no steps are forgotten. Choices are explained and problems (and their solutions) are shown. After reading this, you should have a thorough understanding of the kernel, and (with significant effort) be able to write a similar kernel yourself. This document assumes a working Debian system with root access (for installing packages), and some knowledge about computer architectures. (If you lack that knowledge, you can try to read it anyway and check other sources when you see something new.) \end{abstract} \tableofcontents \section{Hardware details} The first step in the process of writing an operating system is finding out what the system is you're going to program for. While most of the work is supposed to be platform--independant, some parts, especially in the beginning, will depend very much on the actual hardware. So I searched the net and found: \begin{itemize} \item There's a \textbf{Jz4730} chip inside, which implements most functionality. It has a mips core, an OHCI USB host controller (so no USB2), an AC97 audio device, a TFT display controller, an SD card reader, a network device, and lots of general purpose I/O pins, which are used for the LEDs and the keyboard. There are also two PWM outputs, one of which seems to be used with the display. It also has some other features, such as a digital camera controller, which are not used in the design. \item There's a separate 4-port USB hub inside. \item There's a serial port which is accessible with a tiny connector inside the battery compartiment. It uses TTL signals, so to use it with a PC serial port, the signals must be converted with a MAX232. That is normal for these boards, so I already have one handy. The main problem in this case is that the connector is an unusual one, so it may take some time until I can actually connect things to the serial port. \end{itemize} First problem is how to write code which can be booted. This seems easy: put a file named \textbf{uimage} on the first partition on an SD card, which must be formatted FAT or ext3, and hold down Fn, left shift and left control while booting. The partition must also not be larger than 32 MB. The boot program is u-boot, which has good documentation on the web. Also, there is a Debian package named uboot-mkimage, which has the mkimage executable to create images that can be booted using u-boot. uimage should be in this format. To understand at least something of addresses, it's important to understand the memory model of the mips architecture: \begin{itemize} \item usermode code will never reference anything in the upper half of the memory (above 0x80000000). If it does, it receives a segmentation fault. \item access in the lower half is paged and can be cached. This is called kuseg when used from kernel code. It will access the same pages as non-kernel code finds there. \item the upper half is divided in 3 segments. \item kseg0 runs from 0x80000000 to 0xa0000000. Access to this memory will access physical memory from 0x00000000 to 0x20000000. It is cached, but not mapped (meaning it accesses physical, not virtual, memory) \item kseg1 runs from 0xa0000000 to 0xc0000000. It is identical to kseg0, except that is is not cached. \item kseg2 runs from 0xc0000000 to the top. It is mapped like user memory, differently for each process, and can be cached. It is intended for per-address space kernel structures. I shall not use it in my kernel. \end{itemize} U-boot has some standard commands. It can load the image from the SD card at 0x80600000. Even though the Linux image seems to use a different address, I'll go with this one for now. \section{Cross-compiler} Next thing to do is build a cross-compiler so it is possible to try out some things. This shouldn't need to be very complex, but it is. I wrote a separate document about how to do this. Please read that if you don't have a working cross-compiler, or if you would like to install libraries for cross-building more easily. \section{Making things run} For loading a program, it must be a binary executable with a header. The header is inserted by mkimage. It needs a load address and an entry point. Initially at least, the load address is 0x80600000. The entry point must be computed from the executable. The easiest way to do this is by making sure that it is the first byte in the executable. The file can then be linked as binary, so without any headers. This is done by giving the \verb+--oformat binary+ switch to ld. I think the image is loaded without the header, so that can be completely ignored while building. However, it might include it. In that case, the entry point should be 0x40 higher, because that's the size of the header. \section{The first version of the kernel} This sounds better than it is. The first version will be able to boot, and somehow show that it did that. Not too impressive at all, and certainly not usable. It is meant to find out if everything I wrote above actually works. For this kernel I need several things: a program which can boot, and a way to tell the user. As the way to tell the user, I decided to use the caps-lock LED. The display is quite complex to program, I suppose, so I won't even try at this stage. The LED should be easy. Especially because Linux can use it too. I copied the code from the Linux kernel patch that seemed to be about the LED, and that gave me the macros \verb+__gpio_as_output+, \verb+__gpio_set_pin+ and \verb+__gpio_clear_pin+. And of course there's \verb+CAPSLOCKLED_IO+, which is the pin to set or clear. I used these macros in a function I called \verb+kernel_entry+. In an endless loop, it switches the LED on 1000000 times, then off 1000000 times. If the time required to set the led is in the order of microseconds, the LED should be blinking in the order of seconds. I tried with 1000 first, but that left the LED on seemingly permanently, so it was appearantly way too fast. This is the code I want to run, but it isn't quite ready for that yet. A C function needs to have a stack when it is called. It is possible that u-boot provides one, but it may also not do that. To be sure, it's best to use some assembly as the real entry point, which sets up the stack and calls the function. The symbol that ld will use as its entry point must be called \verb+__start+ (on some other architectures with just one underscore). So I created a simple assembly file which defines some stack space and does the setting up. It also sets \$gp to the so-called \textit{global offset table}, and clears the .bss section. This is needed to make compiler-generated code run properly. Now how to build the image file? This is a problem. The ELF format allows paged memory, which means that simply loading the file may not put everything at its proper address. ld has an option for this, \verb+--omagic+. This is meant for the a.out format, which isn't supported by mipsel binutils, but that doesn't matter. The result is still that the .text section (with the executable code) is first in the file, immediately followed by the .data section. So that means that loading the file into memory at the right address results in all parts of the file in the proper place. Adding \verb+-Ttext 0x80600000+ makes everything right. However, the result is still an ELF file. So I use objcopy with \verb+-Obinary+ to create a binary file from it. At this point, I also extract the start address (the location of \verb+__start+) from the ELF file, and use that for building uimage. That way it is no longer needed that \_\_start is at the first byte of the file. Booting from the SD card is as easy as it seemed, except that I first tried an mmc card (which fits in the same slot, and usually works when SD is accepted) and that didn't work. So you really need an SD card. \section{Context switching} One very central thing in the kernel is context switching. That is, we need to know how the registers and the memory are organized when a user program is running. In order to understand that, we must know how paging is done. I already found that it is done by coprocessor 0, so now I need to find out how that works. On the net I found the \textit{MIPS32 architecture for developers}, version 3 of which is sub-titled \textit{the MIPS32 priviledged resource architecture}. It explains everything there is to know about things which are not accessible from normal programs. In other words, it is exactly the right book for programming a kernel or device driver using this processor. How nice. It explains that memory accesses to the lower 2GB are (almost always) mapped through a TLB (translation lookaside buffer). This is an array of some records where virtual to physical address mappings are stored. In case of a TLB-miss (the virtual address cannot be found in the table), an exception is generated and the kernel must insert the mapping into the TLB. This is very flexible, because I get to decide how I write the kernel. I shall use something similar to the hardware implementation of the IBM PC: a page directory which contains links to page tables, with each page table filled with pointers to page information. It is useful to have a direct mapping from virtual address to kernel data as well. There are several ways how this can be achieved. The two simplest ones each have their own drawback: making a shadow page directory with shadow page tables with links to the kernel structures instead of the pages wastes some memory. Using only the shadow, and doing a lookup of the physical address in the kernel structure (where it must be stored anyway) wastes some cpu time during the lookup. At this moment I do not know what is more expensive. I'll initially go for the cpu time wasting approach. \section{Kernel entry} Now that I have an idea of how a process looks in memory, I need to implement kernel entry and exit. A process is preempted or makes a request, then the kernel responds, and then a process (possibly the same) is started again. The main problem of kernel entry is to save all registers in the kernel structure which is associated with the thread. In case of the MIPS processor, there is a simple solution: there are two registers, k0 and k1, which cannot be used by the thread. So they can be set before starting the thread, and will still have their values when the kernel is entered again. By pointing one of them to the place to save the data, it becomes easy to perform the save and restore. As with the bootstrap process, this must be done in assembly. In this case this is because the user stack must not be used, and a C function will use the current stack. It will also mess up some registers before you can save them. The next problem is how to get the interrupt code at its address. I'll try to load the thing at address 0x80000000. It seems to work, which is good. Linux probably has some reason to do things differently, but if this works, it is the easiest way. \section{Memory organization} Now I've reached the point where I need to create some memory structures. To do that, I first need to decide how to organize the memory. There's one very simple rule in my system: everyone must pay for what they use. For memory, this means that a process brings its own memory where the kernel can write things about it. The kernel does not need its own allocation system, because it always works for some process. If the process doesn't provide the memory, the operation will fail. Memory will be organized hierarchically. It belongs to a container, which I shall call \textit{memory}. The entire memory is the property of another memory, its parent. This is true for all but one, which is the top level memory. The top level memory owns all memory in the system. Some of it directly, most of it through other memories. The kernel will have a list of unclaimed pages. For optimization, it actually has two lists: one with pages containing only zeroes, one with pages containing junk. When idle, the junk pages can be filled with zeroes. Because the kernel starts at address 0, building up the list of pages is very easy: starting from the first page above the top of the kernel, everything is free space. Initially, all pages are added to the junk list. \section{The idle task} When there is nothing to do, an endless loop should be waiting for interrupts. This loop is called the idle task. I use it also to exit bootstrapping, by enabling interrupts after everything is set up as if we're running the idle task, and then jumping to it. There are two options for the idle task, again with their own drawbacks. The idle task can run in kernel mode. This is easy, it doesn't need any paging machinery then. However, this means that the kernel must read-modify-write the status register of coprocessor 0, which contains the operating mode, on every context switch. That's quite an expensive operation for such a critical path. The other option is to run it in user mode. The drawback there is that it needs a page directory and a page table. However, since the code is completely trusted, it may be possible to sneak that in through some unused space between two interrupt handlers. That means there's no fault when accessing some memory owned by others, but the idle task is so trivial that it can be assumed to run without affecting them. \section{Intermezzo: some problems} Some problems came up while working. First, I found that the code sometimes didn't work and sometimes it did. It seemed that it had problems when the functions I called became more complex. Looking at the disassembly, it appears that I didn't fully understand the calling convention used by the compiler. Appearantly, it always needs to have register t9 set to the called function. In all compiled code, functions are called as \verb+jalr $t9+. It took quite some time to figure this out, but setting t9 to the called function in my assembly code does indeed solve the problem. The other problem is that the machine was still doing unexpected things. Appearantly, u-boot enables interrupts and handles them. This is not very nice when I'm busy setting up interrupt handlers. So before doing anything else, I first switch off all interrupts by writing 0 to the status register of CP0. This also reminded me that I need to flush the cache, so that I can be sure everything is correct. For that reason, I need to start at 0xa0000000, not 0x80000000, so that the startup code is not cached. It should be fine to load the kernel at 0x80000000, but jump in at the non-cached location anyway, if I make sure the initial code, which clears the cache, can handle it. After that, I jump to the cached region, and everything should be fine. However, at this moment I first link the kernel at the non-cached address, so I don't need to worry about it. Finally, I read in the books that k0 and k1 are in fact normal general purpose registers. So while they are by convention used for kernel purposes, and compilers will likely not touch them. However, the kernel can't actually rely on them not being changed by user code. So I'll need to use a different approach for saving the processor state. The solution is trivial: use k1 as before, but first load it from a fixed memory location. To be able to store k1 itself, a page must be mapped in kseg3 (wired into the tlb), which can then be accessed with a negative index to \$zero. At this point, I was completely startled by crashes depending on seemingly irrelevant changes. After a lot of investigation, I saw that I had forgotten that mips jumps have a delay slot, which is executed after the jump, before the first new instruction is executed. I was executing random instructions, which lead to random behaviour. \section{Back to the idle task} With all this out of the way, I continued to implement the idle task. I hoped to be able to never write to the status register. However, this is not possible. The idle task must be in user mode, and it must call wait. That means it needs the coprocessor 0 usable bit set. This bit may not be set for normal processes, however, or they would be able to change the tlb and all protection would be lost. However, writing to the status register is not a problem. First of all, it is only needed during a task switch, and they aren't as frequent as context switches (every entry to the kernel is a context switch, only when a different task is entered from the kernel than exited to the kernel is it a task switch). Furthermore, and more importantly, coprocessor 0 is intgrated into the cpu, and writing to it is actually a very fast operation and not something to be avoided at all. So to switch to user mode, I set up the status register so that it looks like it's handling an exception, set EPC to the address of the idle task, and use eret to ``return'' to it. \section{Timer interrupts} This worked well. Now I expected to get a timer interrupt soon after jumping to the idle task. After all, I have set up the compare register, the timer should be running and I enabled the interrupts. However, nothing happened. I looked at the contents of the count register, and found that it was 0. This means that it is not actually counting at all. Looking at the Linux sources, they don't use this timer either, but instead use the cpu-external (but integrated in the chip) timer. The documentation says that they have a different reason for this than a non-functional cpu timer. Still, it means it can be used as an alternative. Having a timer is important for preemptive multitasking: a process needs to be interrupted in order to be preempted, so there needs to be a periodic interrupt source. \end{document}