2009-06-01 15:26:42 +03:00
|
|
|
% Iris: micro-kernel for a capability-based operating system.
|
|
|
|
% making-of.tex: Description of the process of writing Iris.
|
|
|
|
% Copyright 2009 Bas Wijnen <wijnen@debian.org>
|
|
|
|
%
|
|
|
|
% This program is free software: you can redistribute it and/or modify
|
|
|
|
% it under the terms of the GNU General Public License as published by
|
|
|
|
% the Free Software Foundation, either version 3 of the License, or
|
|
|
|
% (at your option) any later version.
|
|
|
|
%
|
|
|
|
% This program is distributed in the hope that it will be useful,
|
|
|
|
% but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
|
|
% GNU General Public License for more details.
|
|
|
|
%
|
|
|
|
% You should have received a copy of the GNU General Public License
|
|
|
|
% along with this program. If not, see <http://www.gnu.org/licenses/>.
|
2009-05-25 22:52:44 +03:00
|
|
|
\documentclass{shevek}
|
|
|
|
\begin{document}
|
|
|
|
\title{Writing a kernel from scratch}
|
|
|
|
\author{Bas Wijnen}
|
|
|
|
\date{\today}
|
|
|
|
\maketitle
|
|
|
|
\begin{abstract}
|
2009-06-01 15:26:42 +03:00
|
|
|
This is a report of the process of writing a kernel (Iris) from scratch for
|
2009-05-25 22:52:44 +03:00
|
|
|
the cheap (€150) Trendtac laptop. In a following report I shall write about
|
2009-06-01 15:26:42 +03:00
|
|
|
the operating system on top of it. This document is written while writing the
|
|
|
|
system, so that no steps are forgotten. Choices are explained and problems
|
|
|
|
(and their solutions) are shown. After reading this, you should have a
|
|
|
|
thorough understanding of Iris, and (with significant effort) be able to write
|
|
|
|
a similar kernel yourself. This document assumes a working Debian system with
|
2009-05-25 22:52:44 +03:00
|
|
|
root access (for installing packages), and some knowledge about computer
|
|
|
|
architectures. (If you lack that knowledge, you can try to read it anyway and
|
|
|
|
check other sources when you see something new.)
|
|
|
|
\end{abstract}
|
|
|
|
|
|
|
|
\tableofcontents
|
|
|
|
|
|
|
|
\section{Hardware details}
|
|
|
|
The first step in the process of writing an operating system is finding out
|
|
|
|
what the system is you're going to program for. While most of the work is
|
|
|
|
supposed to be platform--independant, some parts, especially in the beginning,
|
|
|
|
will depend very much on the actual hardware. So I searched the net and found:
|
|
|
|
\begin{itemize}
|
|
|
|
\item There's a \textbf{Jz4730} chip inside, which implements most
|
|
|
|
functionality. It has a mips core, an OHCI USB host controller (so no USB2),
|
|
|
|
an AC97 audio device, a TFT display controller, an SD card reader, a network
|
|
|
|
device, and lots of general purpose I/O pins, which are used for the LEDs and
|
|
|
|
the keyboard. There are also two PWM outputs, one of which seems to be used
|
|
|
|
with the display. It also has some other features, such as a digital camera
|
|
|
|
controller, which are not used in the design.
|
|
|
|
\item There's a separate 4-port USB hub inside.
|
|
|
|
\item There's a serial port which is accessible with a tiny connector inside
|
|
|
|
the battery compartiment. It uses TTL signals, so to use it with a PC serial
|
|
|
|
port, the signals must be converted with a MAX232. That is normal for these
|
|
|
|
boards, so I already have one handy. The main problem in this case is that the
|
|
|
|
connector is an unusual one, so it may take some time until I can actually
|
|
|
|
connect things to the serial port.
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
First problem is how to write code which can be booted. This seems easy: put a
|
|
|
|
file named \textbf{uimage} on the first partition on an SD card, which must be
|
|
|
|
formatted FAT or ext3, and hold down Fn, left shift and left control while
|
|
|
|
booting. The partition must also not be larger than 32 MB.
|
|
|
|
|
|
|
|
The boot program is u-boot, which has good documentation on the web. Also,
|
|
|
|
there is a Debian package named uboot-mkimage, which has the mkimage executable
|
|
|
|
to create images that can be booted using u-boot. uimage should be in this
|
|
|
|
format.
|
|
|
|
|
|
|
|
To understand at least something of addresses, it's important to understand the
|
|
|
|
memory model of the mips architecture:
|
|
|
|
\begin{itemize}
|
2009-05-29 00:35:27 +03:00
|
|
|
\item usermode code will never reference anything in the upper half of the
|
|
|
|
memory (above 0x80000000). If it does, it receives a segmentation fault.
|
2009-05-25 22:52:44 +03:00
|
|
|
\item access in the lower half is paged and can be cached. This is called
|
|
|
|
kuseg when used from kernel code. It will access the same pages as non-kernel
|
|
|
|
code finds there.
|
|
|
|
\item the upper half is divided in 3 segments.
|
|
|
|
\item kseg0 runs from 0x80000000 to 0xa0000000. Access to this memory will
|
|
|
|
access physical memory from 0x00000000 to 0x20000000. It is cached, but not
|
|
|
|
mapped (meaning it accesses physical, not virtual, memory)
|
|
|
|
\item kseg1 runs from 0xa0000000 to 0xc0000000. It is identical to kseg0,
|
|
|
|
except that is is not cached.
|
|
|
|
\item kseg2 runs from 0xc0000000 to the top. It is mapped like user memory,
|
|
|
|
differently for each process, and can be cached. It is intended for
|
2009-06-01 15:26:42 +03:00
|
|
|
per-address space kernel structures. I shall not use it in Iris.\footnote{I
|
|
|
|
thought I wouldn't use kseg2. However, I needed to use it for kernel entry
|
|
|
|
code, as you can read below.}
|
2009-05-25 22:52:44 +03:00
|
|
|
\end{itemize}
|
|
|
|
U-boot has some standard commands. It can load the image from the SD card at
|
|
|
|
0x80600000. Even though the Linux image seems to use a different address, I'll
|
|
|
|
go with this one for now.
|
|
|
|
|
|
|
|
\section{Cross-compiler}
|
|
|
|
Next thing to do is build a cross-compiler so it is possible to try out some
|
|
|
|
things. This shouldn't need to be very complex, but it is. I wrote a separate
|
|
|
|
document about how to do this. Please read that if you don't have a working
|
|
|
|
cross-compiler, or if you would like to install libraries for cross-building
|
|
|
|
more easily.
|
|
|
|
|
2009-06-01 02:12:54 +03:00
|
|
|
\section{Choosing a language to write in}
|
|
|
|
Having a cross-compiler, the next thing to do is choose a language. I prefer
|
|
|
|
to use C++ for most things. I have used C for a previous kernel, though,
|
|
|
|
because it is more low-level. This time, I decided to try C++. But since I'm
|
|
|
|
not linking any libraries, I need to avoid things like new and delete. For
|
|
|
|
performance reasons I also don't use exceptions. They might need library
|
|
|
|
support as well. So what I use C++ for is classes with member functions, and
|
|
|
|
default function arguments. I'm not even using these all the time, and the
|
|
|
|
whole thing is very much like C anyway.
|
|
|
|
|
|
|
|
Except for one change I made: I'm using a \textit{pythonic preprocessor} I
|
|
|
|
wrote. It changes python-style indented code into something a C compiler
|
|
|
|
accepts. It shouldn't be too hard to understand if you see the kernel source.
|
|
|
|
Arguments to flow control instructions (if, while, for) do not need
|
|
|
|
parenthesis, but instead have a colon at the end of the line. After a colon at
|
|
|
|
the end of a line follows a possibly empty indented block, which is put in
|
|
|
|
brackets. Indenting a line with respect to the previous one without a colon
|
|
|
|
will not do anything: it makes it a continuation. Any line which is not empty
|
|
|
|
or otherwise special gets a semicolon at the end, so you don't need to type
|
|
|
|
those. When using both spaces and tabs (which I don't recommend), set the tab
|
|
|
|
width to 8 spaces.
|
|
|
|
|
2009-05-25 22:52:44 +03:00
|
|
|
\section{Making things run}
|
|
|
|
For loading a program, it must be a binary executable with a header. The
|
|
|
|
header is inserted by mkimage. It needs a load address and an entry point.
|
|
|
|
Initially at least, the load address is 0x80600000. The entry point must be
|
|
|
|
computed from the executable. The easiest way to do this is by making sure
|
|
|
|
that it is the first byte in the executable. The file can then be linked as
|
|
|
|
binary, so without any headers. This is done by giving the
|
|
|
|
\verb+--oformat binary+ switch to ld. I think the image is loaded without the
|
|
|
|
header, so that can be completely ignored while building. However, it might
|
|
|
|
include it. In that case, the entry point should be 0x40 higher, because
|
|
|
|
that's the size of the header.
|
|
|
|
|
|
|
|
\section{The first version of the kernel}
|
|
|
|
This sounds better than it is. The first version will be able to boot, and
|
|
|
|
somehow show that it did that. Not too impressive at all, and certainly not
|
|
|
|
usable. It is meant to find out if everything I wrote above actually works.
|
|
|
|
|
|
|
|
For this kernel I need several things: a program which can boot, and a way to
|
|
|
|
tell the user. As the way to tell the user, I decided to use the caps-lock
|
|
|
|
LED. The display is quite complex to program, I suppose, so I won't even try
|
|
|
|
at this stage. The LED should be easy. Especially because Linux can use it
|
|
|
|
too. I copied the code from the Linux kernel patch that seemed to be about the
|
|
|
|
LED, and that gave me the macros \verb+__gpio_as_output+, \verb+__gpio_set_pin+
|
|
|
|
and \verb+__gpio_clear_pin+. And of course there's \verb+CAPSLOCKLED_IO+,
|
|
|
|
which is the pin to set or clear.
|
|
|
|
|
|
|
|
I used these macros in a function I called \verb+kernel_entry+. In an endless
|
|
|
|
loop, it switches the LED on 1000000 times, then off 1000000 times. If the
|
|
|
|
time required to set the led is in the order of microseconds, the LED should be
|
|
|
|
blinking in the order of seconds. I tried with 1000 first, but that left the
|
|
|
|
LED on seemingly permanently, so it was appearantly way too fast.
|
|
|
|
|
|
|
|
This is the code I want to run, but it isn't quite ready for that yet. A C
|
|
|
|
function needs to have a stack when it is called. It is possible that u-boot
|
|
|
|
provides one, but it may also not do that. To be sure, it's best to use some
|
|
|
|
assembly as the real entry point, which sets up the stack and calls the
|
|
|
|
function.
|
|
|
|
|
|
|
|
The symbol that ld will use as its entry point must be called \verb+__start+
|
|
|
|
(on some other architectures with just one underscore). So I created a simple
|
|
|
|
assembly file which defines some stack space and does the setting up. It also
|
|
|
|
sets \$gp to the so-called \textit{global offset table}, and clears the .bss
|
|
|
|
section. This is needed to make compiler-generated code run properly.
|
|
|
|
|
|
|
|
Now how to build the image file? This is a problem. The ELF format allows
|
|
|
|
paged memory, which means that simply loading the file may not put everything
|
|
|
|
at its proper address. ld has an option for this, \verb+--omagic+. This is
|
|
|
|
meant for the a.out format, which isn't supported by mipsel binutils, but that
|
|
|
|
doesn't matter. The result is still that the .text section (with the
|
|
|
|
executable code) is first in the file, immediately followed by the .data
|
|
|
|
section. So that means that loading the file into memory at the right address
|
|
|
|
results in all parts of the file in the proper place. Adding
|
|
|
|
\verb+-Ttext 0x80600000+ makes everything right. However, the result is still
|
|
|
|
an ELF file. So I use objcopy with \verb+-Obinary+ to create a binary file
|
|
|
|
from it. At this point, I also extract the start address (the location of
|
|
|
|
\verb+__start+) from the ELF file, and use that for building uimage. That
|
|
|
|
way it is no longer needed that \_\_start is at the first byte of the file.
|
|
|
|
|
|
|
|
Booting from the SD card is as easy as it seemed, except that I first tried an
|
|
|
|
mmc card (which fits in the same slot, and usually works when SD is accepted)
|
|
|
|
and that didn't work. So you really need an SD card.
|
|
|
|
|
|
|
|
\section{Context switching}
|
|
|
|
One very central thing in the kernel is context switching. That is, we need to
|
|
|
|
know how the registers and the memory are organized when a user program is
|
|
|
|
running. In order to understand that, we must know how paging is done. I
|
|
|
|
already found that it is done by coprocessor 0, so now I need to find out how
|
|
|
|
that works.
|
|
|
|
|
|
|
|
On the net I found the \textit{MIPS32 architecture for developers}, version 3
|
|
|
|
of which is sub-titled \textit{the MIPS32 priviledged resource architecture}.
|
|
|
|
It explains everything there is to know about things which are not accessible
|
|
|
|
from normal programs. In other words, it is exactly the right book for
|
|
|
|
programming a kernel or device driver using this processor. How nice.
|
|
|
|
|
|
|
|
It explains that memory accesses to the lower 2GB are (almost always) mapped
|
|
|
|
through a TLB (translation lookaside buffer). This is an array of some records
|
|
|
|
where virtual to physical address mappings are stored. In case of a TLB-miss
|
|
|
|
(the virtual address cannot be found in the table), an exception is generated
|
2009-06-01 15:26:42 +03:00
|
|
|
and Iris must insert the mapping into the TLB.
|
2009-05-25 22:52:44 +03:00
|
|
|
|
|
|
|
This is very flexible, because I get to decide how I write the kernel. I shall
|
|
|
|
use something similar to the hardware implementation of the IBM PC: a page
|
|
|
|
directory which contains links to page tables, with each page table filled with
|
|
|
|
pointers to page information. It is useful to have a direct mapping from
|
|
|
|
virtual address to kernel data as well. There are several ways how this can be
|
|
|
|
achieved. The two simplest ones each have their own drawback: making a shadow
|
|
|
|
page directory with shadow page tables with links to the kernel structures
|
|
|
|
instead of the pages wastes some memory. Using only the shadow, and doing a
|
|
|
|
lookup of the physical address in the kernel structure (where it must be stored
|
|
|
|
anyway) wastes some cpu time during the lookup. At this moment I do not know
|
|
|
|
what is more expensive. I'll initially go for the cpu time wasting approach.
|
|
|
|
|
|
|
|
\section{Kernel entry}
|
|
|
|
Now that I have an idea of how a process looks in memory, I need to implement
|
2009-06-01 15:26:42 +03:00
|
|
|
kernel entry and exit. A process is preempted or makes a request, then Iris
|
|
|
|
responds, and then a process (possibly the same) is started again.
|
2009-05-25 22:52:44 +03:00
|
|
|
|
|
|
|
The main problem of kernel entry is to save all registers in the kernel
|
|
|
|
structure which is associated with the thread. In case of the MIPS processor,
|
|
|
|
there is a simple solution: there are two registers, k0 and k1, which cannot be
|
|
|
|
used by the thread. So they can be set before starting the thread, and will
|
2009-06-01 15:26:42 +03:00
|
|
|
still have their values when the kernel is entered again.\footnote{This is not
|
|
|
|
true, see below.} By pointing one of them to the place to save the data, it
|
|
|
|
becomes easy to perform the save and restore.
|
2009-05-25 22:52:44 +03:00
|
|
|
|
|
|
|
As with the bootstrap process, this must be done in assembly. In this case
|
|
|
|
this is because the user stack must not be used, and a C function will use the
|
|
|
|
current stack. It will also mess up some registers before you can save them.
|
|
|
|
|
|
|
|
The next problem is how to get the interrupt code at its address. I'll try to
|
|
|
|
load the thing at address 0x80000000. It seems to work, which is good. Linux
|
|
|
|
probably has some reason to do things differently, but if this works, it is the
|
|
|
|
easiest way.
|
|
|
|
|
|
|
|
\section{Memory organization}
|
|
|
|
Now I've reached the point where I need to create some memory structures. To
|
|
|
|
do that, I first need to decide how to organize the memory. There's one very
|
|
|
|
simple rule in my system: everyone must pay for what they use. For memory,
|
2009-06-01 15:26:42 +03:00
|
|
|
this means that a process brings its own memory where Iris can write things
|
|
|
|
about it. Iris does not need her own allocation system, because she always
|
|
|
|
works for some process. If the process doesn't provide the memory, the
|
|
|
|
operation will fail.\footnote{There are some functions with \textit{alloc} in
|
|
|
|
their name. However, they allocate pieces of memory which is owned by the
|
|
|
|
calling process. Iris never allocates anything for herself, except during
|
2009-05-29 00:35:27 +03:00
|
|
|
boot.}
|
2009-05-25 22:52:44 +03:00
|
|
|
|
|
|
|
Memory will be organized hierarchically. It belongs to a container, which I
|
2009-05-29 00:35:27 +03:00
|
|
|
shall call \textit{Memory}. The entire Memory is the property of another
|
|
|
|
Memory, its parent. This is true for all but one, which is the top level
|
|
|
|
Memory. The top level Memory owns all memory in the system. Some of it
|
|
|
|
directly, most of it through other Memories.
|
2009-05-25 22:52:44 +03:00
|
|
|
|
2009-06-01 15:26:42 +03:00
|
|
|
Iris will have a list of unclaimed pages. For optimization, she actually
|
|
|
|
has two lists: one with pages containing only zeroes, and one with pages
|
|
|
|
containing junk. When idle, the junk pages can be filled with zeroes.
|
2009-05-25 22:52:44 +03:00
|
|
|
|
2009-06-01 15:26:42 +03:00
|
|
|
Because Iris starts at address 0, building up the list of pages is very
|
2009-05-25 22:52:44 +03:00
|
|
|
easy: starting from the first page above the top of the kernel, everything is
|
|
|
|
free space. Initially, all pages are added to the junk list.
|
|
|
|
|
|
|
|
\section{The idle task}
|
|
|
|
When there is nothing to do, an endless loop should be waiting for interrupts.
|
|
|
|
This loop is called the idle task. I use it also to exit bootstrapping, by
|
|
|
|
enabling interrupts after everything is set up as if we're running the idle
|
|
|
|
task, and then jumping to it.
|
|
|
|
|
|
|
|
There are two options for the idle task, again with their own drawbacks. The
|
|
|
|
idle task can run in kernel mode. This is easy, it doesn't need any paging
|
2009-06-01 15:26:42 +03:00
|
|
|
machinery then. However, this means that Iris must read-modify-write the
|
2009-05-29 00:35:27 +03:00
|
|
|
Status register of coprocessor 0, which contains the operating mode, on every
|
2009-05-25 22:52:44 +03:00
|
|
|
context switch. That's quite an expensive operation for such a critical path.
|
|
|
|
|
|
|
|
The other option is to run it in user mode. The drawback there is that it
|
|
|
|
needs a page directory and a page table. However, since the code is completely
|
|
|
|
trusted, it may be possible to sneak that in through some unused space between
|
|
|
|
two interrupt handlers. That means there's no fault when accessing some memory
|
2009-05-29 00:35:27 +03:00
|
|
|
owned by others (which is a security issue), but the idle task is so trivial
|
|
|
|
that it can be assumed to run without affecting them.
|
2009-05-25 22:52:44 +03:00
|
|
|
|
|
|
|
\section{Intermezzo: some problems}
|
|
|
|
Some problems came up while working. First, I found that the code sometimes
|
|
|
|
didn't work and sometimes it did. It seemed that it had problems when the
|
|
|
|
functions I called became more complex. Looking at the disassembly, it appears
|
|
|
|
that I didn't fully understand the calling convention used by the compiler.
|
|
|
|
Appearantly, it always needs to have register t9 set to the called function.
|
|
|
|
In all compiled code, functions are called as \verb+jalr $t9+. It took quite
|
|
|
|
some time to figure this out, but setting t9 to the called function in my
|
|
|
|
assembly code does indeed solve the problem.
|
|
|
|
|
2009-05-29 00:35:27 +03:00
|
|
|
I also found that every compiled function starts with setting up gp. This is
|
|
|
|
complete nonsense, since gp is not changed by any code (and it isn't restored
|
|
|
|
at the end of a function either). I'll report this as a but to the compiler.
|
|
|
|
Because it is done for every function, it means a significant performance hit
|
|
|
|
for any program.
|
|
|
|
|
2009-05-25 22:52:44 +03:00
|
|
|
The other problem is that the machine was still doing unexpected things.
|
|
|
|
Appearantly, u-boot enables interrupts and handles them. This is not very nice
|
|
|
|
when I'm busy setting up interrupt handlers. So before doing anything else, I
|
2009-05-29 00:35:27 +03:00
|
|
|
first switch off all interrupts by writing 0 to the Status register of CP0.
|
2009-05-25 22:52:44 +03:00
|
|
|
|
|
|
|
This also reminded me that I need to flush the cache, so that I can be sure
|
|
|
|
everything is correct. For that reason, I need to start at 0xa0000000, not
|
|
|
|
0x80000000, so that the startup code is not cached. It should be fine to load
|
2009-06-01 15:26:42 +03:00
|
|
|
Iris at 0x80000000, but jump in at the non-cached location anyway, if I
|
2009-05-25 22:52:44 +03:00
|
|
|
make sure the initial code, which clears the cache, can handle it. After that,
|
|
|
|
I jump to the cached region, and everything should be fine. However, at this
|
2009-06-01 15:26:42 +03:00
|
|
|
moment I first link Iris at the non-cached address, so I don't need to
|
|
|
|
worry about it.\footnote{Actually, it seems that the cache is working fine, and
|
|
|
|
I'm using the cached address. They are used for kernel entry in any case.}
|
2009-05-25 22:52:44 +03:00
|
|
|
|
|
|
|
Finally, I read in the books that k0 and k1 are in fact normal general purpose
|
|
|
|
registers. So while they are by convention used for kernel purposes, and
|
2009-06-01 15:26:42 +03:00
|
|
|
compilers will likely not touch them, Iris can't actually rely on them not
|
|
|
|
being changed by user code. So I'll need to use a different approach for
|
2009-05-29 00:35:27 +03:00
|
|
|
saving the processor state. The solution is trivial: use k1 as before, but
|
|
|
|
first load it from a fixed memory location. To be able to store k1 itself, a
|
|
|
|
page must be mapped in kseg3 (wired into the tlb), which can then be accessed
|
|
|
|
with a negative index to \$zero.
|
2009-05-25 22:52:44 +03:00
|
|
|
|
|
|
|
At this point, I was completely startled by crashes depending on seemingly
|
|
|
|
irrelevant changes. After a lot of investigation, I saw that I had forgotten
|
|
|
|
that mips jumps have a delay slot, which is executed after the jump, before the
|
|
|
|
first new instruction is executed. I was executing random instructions, which
|
|
|
|
lead to random behaviour.
|
|
|
|
|
|
|
|
\section{Back to the idle task}
|
|
|
|
With all this out of the way, I continued to implement the idle task. I hoped
|
2009-05-29 00:35:27 +03:00
|
|
|
to be able to never write to the Status register. However, this is not
|
2009-05-25 22:52:44 +03:00
|
|
|
possible. The idle task must be in user mode, and it must call wait. That
|
|
|
|
means it needs the coprocessor 0 usable bit set. This bit may not be set for
|
|
|
|
normal processes, however, or they would be able to change the tlb and all
|
2009-05-29 00:35:27 +03:00
|
|
|
protection would be lost. However, writing to the Status register is not a
|
2009-05-25 22:52:44 +03:00
|
|
|
problem. First of all, it is only needed during a task switch, and they aren't
|
|
|
|
as frequent as context switches (every entry to the kernel is a context switch,
|
|
|
|
only when a different task is entered from the kernel than exited to the kernel
|
|
|
|
is it a task switch). Furthermore, and more importantly, coprocessor 0 is
|
|
|
|
intgrated into the cpu, and writing to it is actually a very fast operation and
|
|
|
|
not something to be avoided at all.
|
|
|
|
|
2009-05-29 00:35:27 +03:00
|
|
|
So to switch to user mode, I set up the Status register so that it looks like
|
2009-05-25 22:52:44 +03:00
|
|
|
it's handling an exception, set EPC to the address of the idle task, and use
|
|
|
|
eret to ``return'' to it.
|
|
|
|
|
|
|
|
\section{Timer interrupts}
|
|
|
|
This worked well. Now I expected to get a timer interrupt soon after jumping
|
|
|
|
to the idle task. After all, I have set up the compare register, the timer
|
|
|
|
should be running and I enabled the interrupts. However, nothing happened. I
|
|
|
|
looked at the contents of the count register, and found that it was 0. This
|
2009-06-28 23:44:44 +03:00
|
|
|
means that it is not actually counting at all.\footnote{I also checked the
|
|
|
|
random register, which didn't seem to change either. This is a huge
|
|
|
|
performance problem, but it is easily solved by changing the random register
|
|
|
|
manually.} Looking at the Linux sources, they don't use this timer either, but
|
|
|
|
instead use the cpu-external (but integrated in the chip) timer. The
|
|
|
|
documentation says that they have a different reason for this than a
|
|
|
|
non-functional cpu timer. Still, it means it can be used as an alternative.
|
2009-05-25 22:52:44 +03:00
|
|
|
|
|
|
|
Having a timer is important for preemptive multitasking: a process needs to be
|
|
|
|
interrupted in order to be preempted, so there needs to be a periodic interrupt
|
|
|
|
source.
|
|
|
|
|
2009-05-29 00:35:27 +03:00
|
|
|
During testing it is not critical to have a timer interrupt. Without it, the
|
|
|
|
system can still do cooperative multitasking, and all other aspects of the
|
|
|
|
system can be tested. So I decided to leave the timer interrupts until I'm
|
|
|
|
going to write the drivers for the rest of the hardware as well.
|
|
|
|
|
|
|
|
\section{Invoke}
|
|
|
|
So now I need to accept calls from programs and handle them. For this, I need
|
|
|
|
to decide what such a call looks like. It will need to send a capability to
|
|
|
|
invoke, and a number of capabilities and numbers as arguments. I chose to send
|
|
|
|
four capabilities (so five in total) and also four numbers. The way to send
|
2009-06-01 15:26:42 +03:00
|
|
|
these is by setting registers before making a system call. Similarly, when
|
|
|
|
Iris returns a message, she sets the registers before returing to the program.
|
2009-05-29 00:35:27 +03:00
|
|
|
|
|
|
|
I wrote one file with assembly for receiving interrupts and exceptions
|
|
|
|
(including system calls) and one file with functions called from this assembly
|
|
|
|
to do most of the work. For syscall, I call an arch-specific\footnote{I split
|
|
|
|
off all arch-specific parts into a limited number of files. While I am
|
2009-06-01 15:26:42 +03:00
|
|
|
currently writing Iris only for the Trendtac, I'm trying to make it easy to
|
|
|
|
port her to other machines later.} invoke function, which reads the message,
|
2009-05-29 00:35:27 +03:00
|
|
|
puts it in variables, and calls the real invoke function.
|
|
|
|
|
|
|
|
The real invoke function analyzes the called capability: if it is in page 0
|
|
|
|
(which is used by the interrupt handlers, and cannot hold real capabilities),
|
|
|
|
it must be a kernel-implemented object. If not, it is a pointer to a Receiver.
|
|
|
|
|
|
|
|
Then kernel object calls are handled, and messages to receivers are sent. When
|
|
|
|
all is done, control is returned to the current process, which may or may not
|
|
|
|
be the calling process. If it isn't, the processor state is initialized for
|
|
|
|
the new process by setting the coprocessor 0 usable bit in the Status register
|
|
|
|
and the asid bits in the EntryHi register of CP0.
|
|
|
|
|
|
|
|
\section{Paging}
|
|
|
|
While implementing user programs, I needed to think about paging as well. When
|
|
|
|
a TLB miss occurs, the processor must have a fast way to reload it. For this,
|
|
|
|
page tables are needed. On Intel processors, these need to be in the format
|
|
|
|
that Intel considered useful. On a mips processor, the programmer can choose
|
|
|
|
whatever they want. The Intel format is a page containing the
|
|
|
|
\textit{directory}, 1024 pointers to other pages. Each of those pages contains
|
|
|
|
1024 pointers to the actual page. That way, 10 bits of the virtual address
|
|
|
|
come from the directory, 10 bits from the page table, and 12 from the offset
|
|
|
|
within the page, leading to a total of 32 bits of virtual memory addressing.
|
|
|
|
|
|
|
|
On mips, we need 31 bits, because addresses with the upper bit set will always
|
|
|
|
result in an address error. So using the same format would waste half of the
|
|
|
|
page directory. However, it is often useful to have address to mapped page
|
|
|
|
information as well. For this, a shadow page table structure would be needed.
|
|
|
|
It seems logical to use the upper half of the directory page for the shadow
|
|
|
|
directory. However, I chose a different approach: I used the directory for
|
|
|
|
bits 21 to 30 (as opposed to 22 to 31). Since there are still 12 bit
|
|
|
|
addressable pages, this leaves 9 bits for the page tables. I split every page
|
|
|
|
table in two, with the data for EntryLo registers in the lower half, and a
|
|
|
|
pointer to page information in the upper half of the page. This way, my page
|
|
|
|
tables are smaller, and I waste less space for mostly empty page tables.
|
|
|
|
|
|
|
|
To make a TLB refill as fast as possible, I implemented it directly in the
|
|
|
|
assembly handler. First, I check if k0 and k1 are both zero. If not, I use
|
|
|
|
the slow handler. If they are, I can use them as temporaries, and simply set
|
|
|
|
them to zero before returning. Then I read the current directory (which I save
|
|
|
|
during a task switch), get the proper entry from it, get the page table from
|
|
|
|
there, get the proper entry from that as well, and put that in the TLB. Having
|
|
|
|
done that, I reset k0 and k1, and return. No other registers are changed, so
|
|
|
|
they need not be saved either. If anything unexpected happens (there is no
|
|
|
|
page table or no page entry at the faulting address), the slow handler is
|
|
|
|
called, which will fail as well, but it will handle the failure. This is
|
|
|
|
slightly slower than handling the failure directly, but speed is no issue in
|
|
|
|
case of such a failure.
|
|
|
|
|
|
|
|
While implementing this, I have been searching for a problem for some time. In
|
|
|
|
the end, I found that the value in the EntryLo registers does not have the bits
|
|
|
|
at their normal locations, but 6 bits back. I was mapping the wrong page in,
|
|
|
|
and thus got invalid data when it was being used.
|
|
|
|
|
|
|
|
\section{Sharing}
|
|
|
|
The next big issue is sharing memory. In order to have efficient
|
|
|
|
communication, it is important to use shared memory. The question is how to
|
|
|
|
implement it. A Page can be mapped to memory in the address space that owns
|
|
|
|
it. It can be mapped to multiple locations in that address space. However I
|
|
|
|
may remove this feature for performance reasons. It doesn't happen much
|
|
|
|
anyway, and it is always possible to map the same frame (a page in physical
|
|
|
|
memory) to multiple virtual addresses by creating an multiple Pages.
|
|
|
|
|
|
|
|
For sharing, a frame must also be mappable in a different address space. In
|
|
|
|
that case, an operation must be used which copies or moves the frame from one
|
|
|
|
Page to another. There is a problem with rights, though: if there is an
|
|
|
|
operation which allows a frame to be filled into a Page, then the rights of
|
|
|
|
capabilities to that Page may not be appropriate for the frame. For example,
|
|
|
|
if I have a frame which I am not allowed to write, and a frame which I am
|
|
|
|
allowed to write, I should not be able to write to the first frame by
|
|
|
|
transferring it to the second Page. So some frame rights must be stored in the
|
|
|
|
Page, and they must be updated during copy and move frame operations.
|
|
|
|
|
|
|
|
Move frame is only an optimization. It allows the receiver to request a
|
|
|
|
personal copy of the data without actually copying anything. The result for
|
|
|
|
the sender is a Page without a frame. Any mappings it has are remembered, but
|
|
|
|
until a new frame is requested, no frame will be mapped at the address. A Page
|
|
|
|
is also able to \textit{forget} its frame, thereby freeing some of its memory
|
2009-06-01 02:12:54 +03:00
|
|
|
quota (if it stops paying for it as well; a payed-for frame costs quota, and is
|
|
|
|
guaranteed to be allocatable at any time).
|
|
|
|
|
|
|
|
Another optimization is to specify a minimum number of bytes for a page move.
|
|
|
|
If the page needs to be copied, this reduces the time needed to complete that
|
|
|
|
operation. The rest of the page should not contain secret data: it is possible
|
|
|
|
that the entire page is copied, for example if it doesn't need to be copied,
|
|
|
|
but can be reused.
|
|
|
|
|
|
|
|
\section{Copy on write}
|
|
|
|
Another nice optimization is \textit{copy on write}: a page is shared
|
|
|
|
read-only, and when a page-fault happens, the kernel will copy the contents, so
|
|
|
|
that the other owner(s) don't see the changes. For the moment, I don't
|
2009-06-01 15:26:42 +03:00
|
|
|
implemnt this. I'm not sure if I want it in Iris at all. It can well be
|
|
|
|
implemented using an exception handler in user space, and may not be used
|
|
|
|
enough to spend kernel space on. But I can change my mind on that later.
|
2009-06-01 02:12:54 +03:00
|
|
|
|
|
|
|
\section{Memory listing}
|
|
|
|
The last thing to do for now is allowing a memory to be listed. That is,
|
|
|
|
having a suitably priviledged capability to a Memory should allow a program to
|
|
|
|
see what's in it. In particular, what objects it holds, and where pages are
|
2009-06-28 23:44:44 +03:00
|
|
|
mapped. Probably also what messages are in a receiver's queue. For now, I
|
|
|
|
postponsed the actual implementation of this, but I have reserved the code.
|
|
|
|
This is possibly the hardest kernel operation to implement, because a list of
|
|
|
|
items does not have a hard limit on its size. For other operations, it is
|
|
|
|
possible to return a value in a register, or in a page (which needs to be
|
|
|
|
provided by the caller). But in this case, that is not guaranteed to be
|
|
|
|
possible. So I need to think about how to do this.
|
2009-05-29 00:35:27 +03:00
|
|
|
|
2009-06-01 15:26:42 +03:00
|
|
|
\section{A name for the kernel}
|
|
|
|
However, at this point I am publishing the existence of the kernel, and so I
|
|
|
|
need to give it a name. I like Greek mythology, so I decided to make it a
|
|
|
|
Greek god. Because the kernel is mostly doing communication between programs,
|
|
|
|
while the programs do the real work on the system, I thought of Hermes, the
|
|
|
|
messenger of the gods. However, I don't really like his name, and I want a
|
|
|
|
logo which is furrier than a winged boot or staff. So I chose Iris, who is
|
|
|
|
also a messenger of gods, but she has a rainbow symbol. This is much nicer for
|
|
|
|
creating a logo.
|
|
|
|
|
2009-06-28 23:44:44 +03:00
|
|
|
\section{Device drivers}
|
|
|
|
It's time to do some real testing of the kernel. So I've read the Linux
|
|
|
|
keyboard driver source, and implemented the same functionality in a boot
|
|
|
|
thread. During kernel load, several boot threads are started. At first, it is
|
|
|
|
just this one.
|
|
|
|
|
|
|
|
The keyboard of the device is like any other keyboard, except that it doesn't
|
|
|
|
have a keyboard controller. So the cpu must do this task itself. A keyboard
|
|
|
|
is built as a matrix of copper wires, organised in rows and columns. Every
|
|
|
|
intersection is a key. Pressing the key makes a connection between the row and
|
|
|
|
the column wire. In the Trendtac, there are 8 rows and 17 columns. All of
|
|
|
|
these lines go to a general purpose input/output pin on the cpu. The keyboard
|
|
|
|
driver sets 0 volt on each column in turn, and reads the rows, which are set as
|
|
|
|
pull-up inputs. If they are not connected, the pull-up makes them return 1.
|
|
|
|
But if the key of the column which is scanned is pressed, the 0 is connected to
|
|
|
|
the row line and 0 is read out. Thus the entire keyboard can be read.
|
|
|
|
|
|
|
|
Linux does all this in kernel space. That means it can access the GPIO ports
|
|
|
|
in kseg2 (unmapped and uncached physical memory). In user space, this is not
|
|
|
|
possible. User space programs can only use mapped memory. So the page with
|
|
|
|
the GPIO ports needs to be mapped to the device driver's address space. For
|
|
|
|
this, I added an operation to the thread capability. Not because it has
|
|
|
|
something to do with a thread, but because every process has its own thread
|
|
|
|
capability, so no special other capability is needed. I'm adding some more
|
|
|
|
priviledged operations while I'm at it: allocate physical memory to a Page
|
|
|
|
object is what I need here. Make a thread ``priviledged'', which means it can
|
|
|
|
use coprocessor 0, and perform these operations. Get a capability for the top
|
|
|
|
memory. Register an interrupt handler. I think these should be enough, but I
|
|
|
|
can always add more, because threads don't need so many operations. I also
|
|
|
|
added a debug operation, which blinks the lock leds. This operation will be
|
|
|
|
removed once the display is working.
|
|
|
|
|
|
|
|
Writing the keyboard driver was as easy as could be expected: I had some
|
|
|
|
problems with the meaning of the bits in the registers (does 1 mean input or
|
|
|
|
output?), and for some reason the above scheme was needed and doing the other
|
|
|
|
way (scanning the rows and reading the columns) didn't work. But for the rest,
|
|
|
|
it wasn't very troublesome. And I was happy to see that it is indeed possible
|
|
|
|
to address device memory through the tlb (so using a mapping). Had that not
|
|
|
|
been possible, then the device drivers would have been forced into the kernel.
|
|
|
|
|
|
|
|
The resulting keyboard driver uses maximum cpu time, because I don't have a
|
|
|
|
timer interrupt yet, and it flashes the leds when a key is pressed or released,
|
|
|
|
without telling which key it was. The final driver will be much better. Of
|
|
|
|
course it will send messages for key events instead of flashing the leds. It
|
|
|
|
will also be interrupt-driven: when no keys are pressed, all columns will be
|
|
|
|
set to 0, and all rows will be set to input with pull-up enabled and interrupt
|
|
|
|
on falling edge. Then no scanning is required. When a key is pressed, the
|
|
|
|
keyboard will be scanned periodically (on the timer interrupt) until no key is
|
|
|
|
pressed anymore. It is not possible to use an interrupt-driven approach while
|
|
|
|
a key is pressed, because there is no way to set up the lines such that there
|
|
|
|
will always be a change when a key is pressed or released. That's not a
|
|
|
|
problem: scanning doesn't take much time, and when the keyboard is being used,
|
|
|
|
the machine is active anyway. While no keys are pressed it makes sense to
|
|
|
|
minimize power consumption, so then the interrupt-driven approach is more
|
|
|
|
important.
|
|
|
|
|
|
|
|
\section{Display driver}
|
|
|
|
The next thing to write is a display driver. With a keyboard and a display, it
|
|
|
|
starts to look like a real computer. However, this proved to be a lot harder
|
|
|
|
than I expected.
|
|
|
|
|
|
|
|
First of all, it wasn't entirely clear which part of the Linux driver I needed
|
|
|
|
to copy. It has support for many displays on all kinds of mips devices, and I
|
|
|
|
only want support for the hardware in the machine. After some searching, it
|
|
|
|
seems that the Trendtac uses the ``pmpv1'' settings.
|
|
|
|
|
|
|
|
Now the display consists of two parts: the pixels and the backlight. I started
|
|
|
|
with the easy part, the backlight. It is connected to a pulse-width-modulator
|
|
|
|
(pwm) of the cpu. This means that the cpu has some logic to make very fast
|
|
|
|
pulses of well-defined width. Connecting this to a light allows software to
|
|
|
|
set the intensity of the output. This means the backlight can be dimmed.
|
|
|
|
|
|
|
|
That's nice. Or well, it should be. When copying the Linux code, I can switch
|
|
|
|
the backlight on and off, but the pwm doesn't seem to work. That's the third
|
|
|
|
counter\footnote{a pwm is implemented using a counter to determine the pulse
|
|
|
|
width} that doesn't count: the count register, the random register and the pwm.
|
|
|
|
|
|
|
|
\section{Clocks}
|
|
|
|
Because this didn't feel good, I decided to implement the timer interrupt
|
|
|
|
first. I copied some code for it from Linux and found, as I feared, that it
|
|
|
|
didn't give any interrupts. I suppose the os timer isn't running either.
|
|
|
|
|
2009-05-25 22:52:44 +03:00
|
|
|
\end{document}
|