https://github.com/therealdreg/x86osdev

x86 OS development using Bochs emulator. MIT xv6, JamesM's kernel development tutorials (with some changes) & more
https://github.com/therealdreg/x86osdev
bochs kernel kernel-development mit operating-systems osdev x86 xv6 xv6-operating xv6-os
Last synced: 7 months ago
JSON representation
x86 OS development using Bochs emulator. MIT xv6, JamesM's kernel development tutorials (with some changes) & more
Host: GitHub
URL: https://github.com/therealdreg/x86osdev
Owner: therealdreg
Created: 2022-07-21T10:41:44.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2023-08-11T06:26:32.000Z (almost 3 years ago)
Last Synced: 2025-06-14T04:41:44.601Z (11 months ago)
Topics: bochs, kernel, kernel-development, mit, operating-systems, osdev, x86, xv6, xv6-operating, xv6-os
Language: C++
Homepage: https://rootkit.es/
Size: 35.5 MB
Stars: 82
Watchers: 4
Forks: 10
Open Issues: 0
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
Awesome Lists containing this project

README

          # x86 OS development using Bochs emulator (x86osdev)

**Prerequisites**: You need to know x86 assembly, C, GCC inline ASM, basic Linux and Windows cmd usage.

# Index

- [x86 OS development using Bochs emulator (x86osdev)](#x86-os-development-using-bochs-emulator-x86osdev)

- [Index](#index)

- [Install Bochs](#install-bochs)

- [Usage and Debug](#usage-and-debug)

- [Bochs Software Magic Breakpoint](#bochs-software-magic-breakpoint)

- [Bochs Input and Output debugger macros - BochsConsolePrintChar and BochsBreak](#bochs-input-and-output-debugger-macros---bochsconsoleprintchar-and-bochsbreak)

- [Advanced Bochs debugging](#advanced-bochs-debugging)

- [How to modify a project - optional step](#how-to-modify-a-project---optional-step)

  - [Windows](#windows)

  - [Debian based](#debian-based)

  - [Compilation normal projects](#compilation-normal-projects)

  - [Compilation initrd projects](#compilation-initrd-projects)

  - [Update floppy.img](#update-floppyimg)

- [boot code - bochs/x86osdev/boot_code/floppy.img](#boot-code---bochsx86osdevboot_codefloppyimg)

  - [Understanding the boot code](#understanding-the-boot-code)

  - [Multiboot](#multiboot)

  - [Back to the code again](#back-to-the-code-again)

  - [Adding some C code](#adding-some-c-code)

  - [C code](#c-code)

  - [More info](#more-info)

- [Screen - bochs/x86osdev/screen/floppy.img](#screen---bochsx86osdevscreenfloppyimg)

  - [Monitor code](#monitor-code)

  - [Moving the cursor](#moving-the-cursor)

  - [Scrolling the screen](#scrolling-the-screen)

  - [Writing a character to the screen](#writing-a-character-to-the-screen)

  - [Clearing the screen](#clearing-the-screen)

  - [Writing a string](#writing-a-string)

  - [Summary](#summary)

  - [Extensions](#extensions)

  - [More info](#more-info-1)

- [GDT and IDT - bochs/x86osdev/gdt_idt/floppy.img](#gdt-and-idt---bochsx86osdevgdt_idtfloppyimg)

  - [GDT - Global Descriptor Table](#gdt---global-descriptor-table)

  - [descriptor_tables.h](#descriptor_tablesh)

  - [descriptor_tables.c](#descriptor_tablesc)

  - [IDT - Interrupt Descriptor Table](#idt---interrupt-descriptor-table)

  - [Faults, traps and exceptions](#faults-traps-and-exceptions)

  - [descriptor_tables.h](#descriptor_tablesh-1)

  - [descriptor_tables.c](#descriptor_tablesc-1)

  - [interrupt.s](#interrupts)

  - [isr.c](#isrc)

  - [isr.h](#isrh)

  - [Testing it out](#testing-it-out)

  - [More info](#more-info-2)

- [IRQs and PIT - bochs/x86osdev/irqs_and_the_pit/floppy.img](#irqs-and-pit---bochsx86osdevirqs_and_the_pitfloppyimg)

  - [IRQ - Interrupt ReQuests](#irq---interrupt-requests)

  - [isr.h](#isrh-1)

  - [isr.c](#isrc-1)

  - [PIT - Programmable Interval Timer](#pit---programmable-interval-timer)

  - [More info](#more-info-3)

- [Paging - bochs/x86osdev/paging/floppy.img](#paging---bochsx86osdevpagingfloppyimg)

  - [Virtual memory](#virtual-memory)

  - [Paging as a concretion of virtual memory](#paging-as-a-concretion-of-virtual-memory)

  - [Page entries](#page-entries)

  - [Page directories and tables](#page-directories-and-tables)

  - [Enabling paging](#enabling-paging)

  - [Page faults](#page-faults)

  - [Putting it into practice](#putting-it-into-practice)

  - [Simple memory management with placement malloc](#simple-memory-management-with-placement-malloc)

  - [Required definitions](#required-definitions)

  - [Frame allocation](#frame-allocation)

  - [Paging code finally](#paging-code-finally)

  - [page fault handler](#page-fault-handler)

  - [Testing](#testing)

  - [More info](#more-info-4)

- [Heap - bochs/x86osdev/heap/floppy.img](#heap---bochsx86osdevheapfloppyimg)

  - [Data structure description](#data-structure-description)

  - [Allocation](#allocation)

  - [Deallocation](#deallocation)

  - [Pseudocode](#pseudocode)

  - [Implementing an ordered list](#implementing-an-ordered-list)

  - [ordered_array.h](#ordered_arrayh)

  - [ordered_map.c](#ordered_mapc)

  - [kheap.h](#kheaph)

  - [kheap.c](#kheapc)

  - [Expansion and contraction](#expansion-and-contraction)

  - [Allocation](#allocation-1)

  - [Freeing](#freeing)

  - [paging.c](#pagingc)

  - [Testing](#testing-1)

  - [More info](#more-info-5)

- [VFS and initrd - bochs/x86osdev/vfs_and_initrd/floppy.img](#vfs-and-initrd---bochsx86osdevvfs_and_initrdfloppyimg)

  - [VFS - Virtual File System](#vfs---virtual-file-system)

  - [Mountpoints](#mountpoints)

  - [fs.h](#fsh)

  - [fs.c](#fsc)

  - [Initial Ramdisk](#initial-ramdisk)

  - [My own solution](#my-own-solution)

  - [Filesystem generator](#filesystem-generator)

  - [Integrating it in to your own OS](#integrating-it-in-to-your-own-os)

  - [initrd.h](#initrdh)

  - [initrd.c](#initrdc)

  - [Loading initrd as a multiboot module](#loading-initrd-as-a-multiboot-module)

  - [Testing it out](#testing-it-out-1)

  - [More info](#more-info-6)

- [Multitasking - bochs/x86osdev/multitasking/floppy.img](#multitasking---bochsx86osdevmultitaskingfloppyimg)

  - [Cloning an address space](#cloning-an-address-space)

  - [Cloning a directory](#cloning-a-directory)

  - [Cloning a table](#cloning-a-table)

  - [Copying a physical frame](#copying-a-physical-frame)

  - [Creating a new stack](#creating-a-new-stack)

  - [Actual multitasking code](#actual-multitasking-code)

  - [Switching tasks](#switching-tasks)

  - [Testing](#testing-2)

  - [Summary](#summary-1)

  - [More info](#more-info-7)

- [User Mode (and syscalls) - bochs/x86osdev/user_mode/floppy.img](#user-mode-and-syscalls---bochsx86osdevuser_modefloppyimg)

  - [Switching to user mode](#switching-to-user-mode)

  - [task.c](#taskc)

  - [Something to watch out for](#something-to-watch-out-for)

  - [System calls](#system-calls)

  - [Task State Segment](#task-state-segment)

  - [descriptor_tables.h](#descriptor_tablesh-2)

  - [descriptor_tables.c](#descriptor_tablesc-2)

  - [gdt.s](#gdts)

  - [System call interface](#system-call-interface)

  - [syscall.h](#syscallh)

  - [syscall.c](#syscallc)

  - [Helper macros](#helper-macros)

  - [What happens when a an interrupt occurs in user mode?](#what-happens-when-a-an-interrupt-occurs-in-user-mode)

  - [Testing](#testing-3)

  - [More info](#more-info-8)

- [Multi core startup - bochs/x86osdev/multi_core_startup/floppy.img](#multi-core-startup---bochsx86osdevmulti_core_startupfloppyimg)

  - [Waking the APs](#waking-the-aps)

  - [Initializing and differentiating the APs](#initializing-and-differentiating-the-aps)

  - [Final notes](#final-notes)

  - [Testing](#testing-4)

  - [Interruptions in multi core](#interruptions-in-multi-core)

  - [More info](#more-info-9)

- [xv6 - bochs/x86osdev/xv6_dregmod/](#xv6---bochsx86osdevxv6_dregmod)

  - [Try it](#try-it)

  - [Debugging with symbols](#debugging-with-symbols)

  - [Compilation and modification](#compilation-and-modification)

  - [More info](#more-info-10)

- [Changelog](#changelog)

  - [New chapters](#new-chapters)

- [More info](#more-info-11)

# Install Bochs

For Windows all are included, just download this repo:

* https://github.com/therealdreg/x86osdev/archive/refs/heads/main.zip

For Linux you must install Bochs with debugger gui support + smp (--enable-smp, --enable-debugger and --enable-debugger-gui):

* https://bochs.sourceforge.io/doc/docbook/user/compiling.html

# Usage and Debug

**WARNING**: wait and be patient, Bochs is slow

1. Copy **bochs/x86osdev/Project/floppy.img** to **bochs/**

2. Go to **bochs/**

3. run **bochsdbg.bat** (For Linux **./bochsdbg.sh**)

4. Click Start

5. Click Continue (**First Breakpoint**):

![bochs_usage](img/bochs_usage.png)

6. When **"Magic Breakpoint"** text appears click Continue again (**Second Magic Breakpoint**):

![magicbp](img/magicbp.png)

**IMPORTANT**: When you read **"Run"** or **"Run Bochs"** it means Run bochs from **bochsdbg.bat** (For Linux **./bochsdbg.sh**) script. 

With **first breakpoint** its possible debug bootloader code from start.

With **second magic breakpoint** its possible debug kernel code from start.

Debug commands: https://bochs.sourceforge.io/doc/docbook/user/internal-debugger.html

# Bochs Software Magic Breakpoint

From our OS Code:

GCC:

```

asm volatile ("xchgw %bx, %bx");

```

NASM:

```

xchg bx, bx

```

# Bochs Input and Output debugger macros - BochsConsolePrintChar and BochsBreak

From our OS Code:

```

//outputs a character to the debug console

#define BochsConsolePrintChar(c) outportb(0xe9, c)

//stops simulation and breaks into the debug console

#define BochsBreak() outportw(0x8A00,0x8A00); outportw(0x8A00,0x08AE0);

```

# Advanced Bochs debugging

Commands supported by port 0x8A00

- **0x8A00**: Used to enable the device. Any I/O to the debug module before this command is sent is sent will simply be ignored.

- **0x8A01**: Selects register 0: Memory monitoring range start address (inclusive)

- **0x8A02**: Selects register 1: Memory monitoring range end address (exclusive)

- **0x8A80**: Enable address range memory monitoring as indicated by register 0 and 1 and clears both registers

- **0x8AE0**: Return to Debugger Prompt. If the debugger is enabled (via --enable-debugger), sending 0x8AE0 to port 0x8A00 after the device has been enabled will return the Bochs to the debugger prompt. Basically the same as doing CTRL+C.

- **0x8AE2**: Instruction Trace Disable. If the debugger is enabled (via --enable-debugger), sending 0x8AE2 to port 0x8A00 after the device has been enabled will disable instruction tracing

- **0x8AE3**: Instruction Trace Enable. If the debugger is enabled (via --enable-debugger), sending 0x8AE3 to port 0x8A00 after the device has been enabled will enable instruction tracing

- **0x8AE4**: Register Trace Disable. If the debugger is enabled (via --enable-debugger), sending 0x8AE4 to port 0x8A00 after the device has been enabled will disable register tracing.

- **0x8AE5**: Register Trace Enable. If the debugger is enabled (via --enable-debugger), sending 0x8AE5 to port 0x8A00 after the device has been enabled will enable register tracing. This currently output the value of all the registers for each instruction traced. Note: instruction tracing must be enabled to view the register tracing

- **0x8AFF**: Disable the I/O interface to the debugger and the memory monitoring functions. Note: all accesses must be done using word. Note: reading this register will return 0x8A00 if currently activated, otherwise 0

More info and examples: https://bochs.sourceforge.io/doc/docbook/development/debugger-advanced.html

# How to modify a project - optional step

1. Modify a project **bochs/x86osdev/Project/src**

2. After a modification you must recompile the project

## Windows

1. Install WSL2, open cmd as Administrator:

```

wsl --install

```

2. Reboot

3. Open WSL console:

```

apt-get update

apt-get install nasm build-essential gcc-multilib

```

## Debian based

```

apt-get update

apt-get install nasm build-essential gcc-multilib

```

## Compilation normal projects

Go to **bochs/x86osdev/Project/src/**

```

make clean

make

```

## Compilation initrd projects

1. Go to **bochs/x86osdev/Project/src/**

```

make clean

make

```

2. Go to **bochs/x86osdev/Project/**

```

make clean

make

./make_initrd.sh

```

## Update floppy.img

After compilation you must update floppy.img

Go to **bochs/x86osdev/Project/**

```

./update_image.sh

```

**WARNING**: for **xv6 project** update_image.sh will update xv6.img, fs.img and kernel.sym

# boot code - bochs/x86osdev/boot_code/floppy.img

OK, It's time for some code! Although the brunt of our kernel will be written in C, there are certain things we just must use assembly for. One of those things is the initial boot code.

Here we go:

```

;

; boot.s -- Kernel start location. Also defines multiboot header.

; Based on Bran's kernel development tutorial file start.asm

;

MBOOT_PAGE_ALIGN    equ 1<<0    ; Load kernel and modules on a page boundary

MBOOT_MEM_INFO      equ 1<<1    ; Provide your kernel with memory info

MBOOT_HEADER_MAGIC  equ 0x1BADB002 ; Multiboot Magic value

; NOTE: We do not use MBOOT_AOUT_KLUDGE. It means that GRUB does not

; pass us a symbol table.

MBOOT_HEADER_FLAGS  equ MBOOT_PAGE_ALIGN | MBOOT_MEM_INFO

MBOOT_CHECKSUM      equ -(MBOOT_HEADER_MAGIC + MBOOT_HEADER_FLAGS)

[BITS 32]                       ; All instructions should be 32-bit.

[GLOBAL mboot]                  ; Make 'mboot' accessible from C.

[EXTERN code]                   ; Start of the '.text' section.

[EXTERN bss]                    ; Start of the .bss section.

[EXTERN end]                    ; End of the last loadable section.

mboot:

  dd  MBOOT_HEADER_MAGIC        ; GRUB will search for this value on each

                                ; 4-byte boundary in your kernel file

  dd  MBOOT_HEADER_FLAGS        ; How GRUB should load your file / settings

  dd  MBOOT_CHECKSUM            ; To ensure that the above values are correct

   

  dd  mboot                     ; Location of this descriptor

  dd  code                      ; Start of kernel '.text' (code) section.

  dd  bss                       ; End of kernel '.data' section.

  dd  end                       ; End of kernel.

  dd  start                     ; Kernel entry point (initial EIP).

[GLOBAL start]                  ; Kernel entry point.

[EXTERN main]                   ; This is the entry point of our C code

start:

  push    ebx                   ; Load multiboot header location

  ; Execute the kernel:

  cli                         ; Disable interrupts.

  call main                   ; call our main() function.

  jmp $                       ; Enter an infinite loop, to stop the processor

                              ; executing whatever rubbish is in the memory

                              ; after our kernel!

```

## Understanding the boot code

There's actually only a few lines of code in that snippet:

```

push ebx

cli

call main

jmp $

```

The rest of it is all to do with the multiboot header.

## Multiboot

Multiboot is a standard to which GRUB expects a kernel to comply. It is a way for the bootloader to

1. Know exactly what environment the kernel wants/needs when it boots.

2. Allow the kernel to query the environment it is in.

So, for example, if your kernel needs to be loaded in a specific VESA mode (which is a bad idea, by the way), you can inform the bootloader of this, and it can take care of it for you.

To make your kernel multiboot compatible, you need to add a header structure somewhere in your kernel (Actually, the header must be in the first 4KB of the kernel). Usefully, there is a NASM command that lets us embed specific constants in our code - 'dd'. These lines:

```

dd MBOOT_HEADER_MAGIC

dd MBOOT_HEADER_FLAGS

dd MBOOT_CHECKSUM

dd mboot

dd code

dd bss

dd end

dd start

```

Do just that. The MBOOT_* constants are defined above.

- **MBOOT_HEADER_MAGIC**: A magic number. This identifies the kernel as multiboot-compatible.

- **MBOOT_HEADER_FLAGS**: A field of flags. We ask for GRUB to page-align all kernel sections (MBOOT_PAGE_ALIGN) and also to give us some memory information (MBOOT_MEM_INFO). Note that some tutorials also use MBOOT_AOUT_KLUDGE. As we are using the ELF file format, this hack is not necessary, and adding it stops GRUB giving you your symbol table when you boot up

- **MBOOT_CHECKSUM**: This field is defined such that when the magic number, the flags and this are added together, the total must be zero. It is for error checking.

- **mboot**: The address of the structure that we are currently writing. GRUB uses this to tell if we are expecting to be relocated.

- **code,bss,end,start**: These symbols are all defined by the linker. We use them to tell GRUB where the different sections of our kernel can be located.

On bootup, GRUB will load a pointer to another information structure into the EBX register. This can be used to query the environment GRUB set up for us.

## Back to the code again

So, immediately on bootup, the asm snippet tells the CPU to push the contents of EBX onto the stack (remember that EBX now contains a pointer to the multiboot information structure), disable interrupts (CLI), call our 'main' C function (which we haven't defined yet), then enter an infinite loop.

All is good, but the code won't link yet. We haven't defined main()!

## Adding some C code

Interfacing C code and assembly is dead easy. You just have to know the calling convention used. GCC on x86 uses the cdecl calling convention:

- All parameters to a function are passed on the stack.

- The parameters are pushed right-to-left.

- The return value of a function is returned in EAX.

...so the function call:

```

d = func(a, b, c);

```

Becomes:

```

push [c]

push [b]

push [a]

call func

mov [d], eax

```

See? nothing to it! So, you can see that in our asm snippet above, that 'push ebx' is actually passing the value of ebx as a parameter to the function main().

## C code

```

// main.c -- Defines the C-code kernel entry point, calls initialisation routines.

// Made for JamesM's tutorials

int main(struct multiboot *mboot_ptr)

{

  // All our initialisation calls will go in here.

  return 0xDEADBABA;

}

```

Here's our first incarnation of the main() function. As you can see, we've made it take one parameter - a pointer to a multiboot struct. We'll define that later (we don't actually need to define it for the code to compile!).

All the function does is return a constant - 0xDEADBABA. That constant is unusual enough that it should stand out at you when we run the program in a second.

Copy floppy.img from project_dir/ to bochs/ directory and run Bochs debugger.

You'll see GRUB for a few seconds then the kernel will run. It doesn't actually do anything, so it'll just freeze, saying 'starting up...'.

Press **Break [^C]** and look **EAX** value

![bootcodebreak](img/bootcodebreak.png)

Also, if you open **bochsout.txt**, at the bottom you should see something like:

```

00074621500i[CPU  ] | EAX=deadbaba  EBX=0002d000  ECX=0001edd0 EDX=00000001

00074621500i[CPU  ] | ESP=00067ec8  EBP=00067ee0  ESI=00053c76 EDI=00053c77

00074621500i[CPU  ] | IOPL=0 id vip vif ac vm rf nt of df if tf sf zf af pf cf

00074621500i[CPU  ] | SEG selector     base    limit G D

00074621500i[CPU  ] | SEG sltr(index|ti|rpl)     base    limit G D

00074621500i[CPU  ] |  CS:0008( 0001| 0|  0) 00000000 000fffff 1 1

00074621500i[CPU  ] |  DS:0010( 0002| 0|  0) 00000000 000fffff 1 1

00074621500i[CPU  ] |  SS:0010( 0002| 0|  0) 00000000 000fffff 1 1

00074621500i[CPU  ] |  ES:0010( 0002| 0|  0) 00000000 000fffff 1 1

00074621500i[CPU  ] |  FS:0010( 0002| 0|  0) 00000000 000fffff 1 1

00074621500i[CPU  ] |  GS:0010( 0002| 0|  0) 00000000 000fffff 1 1

00074621500i[CPU  ] | EIP=00100027 (00100027)

00074621500i[CPU  ] | CR0=0x00000011 CR1=0 CR2=0x00000000

00074621500i[CPU  ] | CR3=0x00000000 CR4=0x00000000

00074621500i[CPU  ] >> jmp .+0xfffffffe (0x00100027) : EBFE

```

Notice what the value of EAX is? 0xDEADBABA - the return value of main(). Congratulations, you now have a multiboot compatible assembly trampoline, and you're ready to start printing to the screen!

![genesis_bochs](img/genesis_bochs.png)

## More info

- https://wiki.osdev.org/GRUB

- https://wiki.osdev.org/Multiboot

- https://wiki.osdev.org/GRUB_Legacy

- https://wiki.osdev.org/Bootloader

- https://wiki.osdev.org/Rolling_Your_Own_Bootloader

- https://wiki.osdev.org/Bare_Bones

- https://wiki.osdev.org/Category:Babystep

# Screen - bochs/x86osdev/screen/floppy.img

So, now that we have a 'kernel' that can run and stick itself into an infinite loop, it's time to get something interesting appearing on the screen. Along with serial I/O, the monitor will be your most important ally in the debugging battle.

Your kernel gets booted by GRUB in text mode. That is, it has available to it a framebuffer (area of memory) that controls a screen of characters (not pixels) 80 wide by 25 high. This will be the mode your kernel will operate in until your get into the world of VESA (which will not be covered in this tutorial).

The area of memory known as the framebuffer is accessible just like normal RAM, at address 0xB8000. It is important to note, however, that it is not actually normal RAM. It is part of the VGA controller's dedicated video memory that has been memory-mapped via hardware into your linear address space. This is an important distinction.

The framebuffer is just an array of 16-bit words, each 16-bit value representing the display of one character. The offset from the start of the framebuffer of the word that specifies a character at position x, y is:

```

(y * 80 + x) * 2

```

What's important to note is that the '* 2' is there only because each element is 2 bytes (16 bits) long. If you're indexing an array of 16-bit values, for example, your index would just be y*80+x.

In ASCII (unicode is not supported in text mode), 8 bits are used to represent a character. That gives us 8 more bits which are unused. The VGA hardware uses these to designate foreground and background colours (4 bits each). The splitting of this 16-bit value is shown in the diagram to the right.

4 bits for a colour code gives us 15 possible colours we can display:

- 0: black 

- 1: blue 

- 2: green 

- 3: cyan 

- 4: red

- 5: magenta 

- 6: brown

- 7: light grey 

- 8: dark grey 

- 9: light blue 

- 10: light green 

- 11: light cyan 

- 12: light red 

- 13: light magneta 

- 14: light brown

- 15: white

The VGA controller also has some ports on the main I/O bus, which you can use to send it specific instructions. (Among others) it has a control register at 0x3D4 and a data register at 0x3D5. We will use these to instruct the controller to update it's cursor position (the flashy underbar thing that tells you where your next character will go).

Word format:

![the_screen_word_format](img/the_screen_word_format.png)

Firstly, we need a few more commonly-used global functions. common.c and common.h include functions for writing to and reading from the I/O bus, and some typedefs that will make it easier for us to write portable code. They are also the ideal place to put functions such as memcpy/memset etc. I have left them for you to implement! :)

```

// common.h -- Defines typedefs and some global functions.

// From JamesM's kernel development tutorials.

#ifndef COMMON_H

#define COMMON_H

// Some nice typedefs, to standardise sizes across platforms.

// These typedefs are written for 32-bit X86.

typedef unsigned int   u32int;

typedef          int   s32int;

typedef unsigned short u16int;

typedef          short s16int;

typedef unsigned char  u8int;

typedef          char  s8int;

void outb(u16int port, u8int value);

u8int inb(u16int port);

u16int inw(u16int port);

#endif

```

```

// common.c -- Defines some global functions.

// From JamesM's kernel development tutorials.

#include "common.h"

// Write a byte out to the specified port.

void outb(u16int port, u8int value)

{

    asm volatile ("outb %1, %0" : : "dN" (port), "a" (value));

}

u8int inb(u16int port)

{

   u8int ret;

   asm volatile("inb %1, %0" : "=a" (ret) : "dN" (port));

   return ret;

}

u16int inw(u16int port)

{

   u16int ret;

   asm volatile ("inw %1, %0" : "=a" (ret) : "dN" (port));

   return ret;

}

```

Disas:

```

001003cc :

  1003cc:       f3 0f 1e fb             endbr32

  1003d0:       55                      push   ebp

  1003d1:       89 e5                   mov    ebp,esp

  1003d3:       83 ec 08                sub    esp,0x8

  1003d6:       e8 29 0c 00 00          call   101004 <__x86.get_pc_thunk.ax>

  1003db:       05 25 1c 00 00          add    eax,0x1c25

  1003e0:       8b 45 08                mov    eax,DWORD PTR [ebp+0x8]

  1003e3:       8b 55 0c                mov    edx,DWORD PTR [ebp+0xc]

  1003e6:       66 89 45 fc             mov    WORD PTR [ebp-0x4],ax

  1003ea:       89 d0                   mov    eax,edx

  1003ec:       88 45 f8                mov    BYTE PTR [ebp-0x8],al

  1003ef:       0f b7 55 fc             movzx  edx,WORD PTR [ebp-0x4]

  1003f3:       0f b6 45 f8             movzx  eax,BYTE PTR [ebp-0x8]

  1003f7:       ee                      out    dx,al

  1003f8:       90                      nop

  1003f9:       c9                      leave

  1003fa:       c3                      ret

001003fb :

  1003fb:       f3 0f 1e fb             endbr32

  1003ff:       55                      push   ebp

  100400:       89 e5                   mov    ebp,esp

  100402:       83 ec 14                sub    esp,0x14

  100405:       e8 fa 0b 00 00          call   101004 <__x86.get_pc_thunk.ax>

  10040a:       05 f6 1b 00 00          add    eax,0x1bf6

  10040f:       8b 45 08                mov    eax,DWORD PTR [ebp+0x8]

  100412:       66 89 45 ec             mov    WORD PTR [ebp-0x14],ax

  100416:       0f b7 45 ec             movzx  eax,WORD PTR [ebp-0x14]

  10041a:       89 c2                   mov    edx,eax

  10041c:       ec                      in     al,dx

  10041d:       88 45 ff                mov    BYTE PTR [ebp-0x1],al

  100420:       0f b6 45 ff             movzx  eax,BYTE PTR [ebp-0x1]

  100424:       c9                      leave

  100425:       c3                      ret

00100426 :

  100426:       f3 0f 1e fb             endbr32

  10042a:       55                      push   ebp

  10042b:       89 e5                   mov    ebp,esp

  10042d:       83 ec 14                sub    esp,0x14

  100430:       e8 cf 0b 00 00          call   101004 <__x86.get_pc_thunk.ax>

  100435:       05 cb 1b 00 00          add    eax,0x1bcb

  10043a:       8b 45 08                mov    eax,DWORD PTR [ebp+0x8]

  10043d:       66 89 45 ec             mov    WORD PTR [ebp-0x14],ax

  100441:       0f b7 45 ec             movzx  eax,WORD PTR [ebp-0x14]

  100445:       89 c2                   mov    edx,eax

  100447:       66 ed                   in     ax,dx

  100449:       66 89 45 fe             mov    WORD PTR [ebp-0x2],ax

  10044d:       0f b7 45 fe             movzx  eax,WORD PTR [ebp-0x2]

  100451:       c9                      leave

  100452:       c3                      ret

```

## Monitor code

A simple header file:

```

// monitor.h -- Defines the interface for monitor.h

// From JamesM's kernel development tutorials.

#ifndef MONITOR_H

#define MONITOR_H

#include "common.h"

// Write a single character out to the screen.

void monitor_put(char c);

// Clear the screen to all black.

void monitor_clear();

// Output a null-terminated ASCII string to the monitor.

void monitor_write(char *c);

#endif // MONITOR_H

```

## Moving the cursor

To move the hardware cursor, we must firstly work out the linear offset of the x,y cursor coordinate. We do this by using the equation above. Next, we have to send this offset to the VGA controller. For some reason, it accepts the 16-bit location as two bytes. We send the controller's command port (0x3D4) the command 14 to tell it we are sending the high byte, then send that byte to port 0x3D5. We then repeat with the low byte, but send the command 15 instead.

```

// Updates the hardware cursor.

static void move_cursor()

{

   // The screen is 80 characters wide...

   u16int cursorLocation = cursor_y * 80 + cursor_x;

   outb(0x3D4, 14);                  // Tell the VGA board we are setting the high cursor byte.

   outb(0x3D5, cursorLocation >> 8); // Send the high cursor byte.

   outb(0x3D4, 15);                  // Tell the VGA board we are setting the low cursor byte.

   outb(0x3D5, cursorLocation);      // Send the low cursor byte.

}

```

## Scrolling the screen

At some point we're going to fill up the screen with text. It would be nice if, when we do that, the screen acted like a terminal and scrolled up one line. Actually, this really isn't very difficult to do:

```

// Scrolls the text on the screen up by one line.

static void scroll()

{

   // Get a space character with the default colour attributes.

   u8int attributeByte = (0 /*black*/ << 4) | (15 /*white*/ & 0x0F);

   u16int blank = 0x20 /* space */ | (attributeByte << 8);

   // Row 25 is the end, this means we need to scroll up

   if(cursor_y >= 25)

   {

       // Move the current text chunk that makes up the screen

       // back in the buffer by a line

       int i;

       for (i = 0*80; i < 24*80; i++)

       {

           video_memory[i] = video_memory[i+80];

       }

       // The last line should now be blank. Do this by writing

       // 80 spaces to it.

       for (i = 24*80; i < 25*80; i++)

       {

           video_memory[i] = blank;

       }

       // The cursor should now be on the last line.

       cursor_y = 24;

   }

}

```

## Writing a character to the screen

Now the code gets a little more complex. But, if you look at it, you'll see that most of it is logic as to where to put the cursor next - there really isn't much difficult there.

```

// Writes a single character out to the screen.

void monitor_put(char c)

{

   // The background colour is black (0), the foreground is white (15).

   u8int backColour = 0;

   u8int foreColour = 15;

   // The attribute byte is made up of two nibbles - the lower being the

   // foreground colour, and the upper the background colour.

   u8int  attributeByte = (backColour << 4) | (foreColour & 0x0F);

   // The attribute byte is the top 8 bits of the word we have to send to the

   // VGA board.

   u16int attribute = attributeByte << 8;

   u16int *location;

   // Handle a backspace, by moving the cursor back one space

   if (c == 0x08 && cursor_x)

   {

       cursor_x--;

   }

   // Handle a tab by increasing the cursor's X, but only to a point

   // where it is divisible by 8.

   else if (c == 0x09)

   {

       cursor_x = (cursor_x+8) & ~(8-1);

   }

   // Handle carriage return

   else if (c == '\r')

   {

       cursor_x = 0;

   }

   // Handle newline by moving cursor back to left and increasing the row

   else if (c == '\n')

   {

       cursor_x = 0;

       cursor_y++;

   }

   // Handle any other printable character.

   else if(c >= ' ')

   {

       location = video_memory + (cursor_y*80 + cursor_x);

       *location = c | attribute;

       cursor_x++;

   }

   // Check if we need to insert a new line because we have reached the end

   // of the screen.

   if (cursor_x >= 80)

   {

       cursor_x = 0;

       cursor_y ++;

   }

   // Scroll the screen if needed.

   scroll();

   // Move the hardware cursor.

   move_cursor();

}

```

See? It's pretty simple! The bit that actually does the writing is here:

```

location = video_memory + (cursor_y*80 + cursor_x);

*location = c | attribute;

```

- Set 'location' to point to the linear address of the word corresponding to the current cursor position (see equation above).

- Set the value at 'location' to be the logical-OR of the character and 'attribute'. Remember that we shifted 'attribute' left 8 bits above, so actually we're just setting 'c' as the lower byte of 'attribute'.

## Clearing the screen

Clearing the screen is also dead easy. Just fill it with loads of spaces:

```

// Clears the screen, by copying lots of spaces to the framebuffer.

void monitor_clear()

{

   // Make an attribute byte for the default colours

   u8int attributeByte = (0 /*black*/ << 4) | (15 /*white*/ & 0x0F);

   u16int blank = 0x20 /* space */ | (attributeByte << 8);

   int i;

   for (i = 0; i < 80*25; i++)

   {

       video_memory[i] = blank;

   }

   // Move the hardware cursor back to the start.

   cursor_x = 0;

   cursor_y = 0;

   move_cursor();

}

```

## Writing a string

```

// Outputs a null-terminated ASCII string to the monitor.

void monitor_write(char *c)

{

   int i = 0;

   while (c[i])

   {

       monitor_put(c[i++]);

   }

}

```

## Summary

If you put all that code together, you can add a couple of lines to your main.c file:

```

monitor_clear();

monitor_write("Hello, world!");

```

Et voila - a text output function! Not bad for a couple of minutes' work, eh?

## Extensions

Apart from implementing memcpy/memset/strlen/strcmp etc, there are a few other functions that will make life easier for you.

```

void monitor_write_hex(u32int n)

{

   // TODO: implement this yourself!

}

void monitor_write_dec(u32int n)

{

   // TODO: implement this yourself!

}

```

The function names should be pretty self explanatory -- writing in hexadecimal really is required if you're going to check the validity of pointers. Decimal is optional but it's nice to see something in base 10 every once in a while!

You could also have a scout at the linux0.1 code - that has an implementation of vsprintf which is quite neat and tidy. You could copy that function then use it to implement printf(), which will make your life a hell of a lot easier when it comes to debugging.

Copy floppy.img from project_dir/ to bochs/ directory and run Bochs debugger.

![the_screen_screenshot](img/the_screen_screenshot.png)

## More info

- https://wiki.osdev.org/Printing_To_Screen

- https://wiki.osdev.org/Inline_Assembly/Examples

- https://wiki.osdev.org/Inline_Assembly

# GDT and IDT - bochs/x86osdev/gdt_idt/floppy.img

The GDT and the IDT are descriptor tables. They are arrays of flags and bit values describing the operation of either the segmentation system (in the case of the GDT), or the interrupt vector table (IDT).

They are, unfortunately, a little theory-heavy, but bear with it because it'll be over soon!

## GDT - Global Descriptor Table

The x86 architecture has two methods of memory protection and of providing virtual memory - segmentation and paging.

With segmentation, every memory access is evaluated with respect to a segment. That is, the memory address is added to the segment's base address, and checked against the segment's length. You can think of a segment as a window into the address space - The process does not know it's a window, all it sees is a linear address space starting at zero and going up to the segment length.

With paging, the address space is split into (usually 4KB, but this can change) blocks, called pages. Each page can be mapped into physical memory - mapped onto what is called a 'frame'. Or, it can be unmapped. Like this you can create virtual memory spaces.

Both of these methods have their advantages, but paging is much better. Segmentation is, although still usable, fast becoming obsolete as a method of memory protection and virtual memory. In fact, the x86-64 architecture requires a flat memory model (one segment with a base of 0 and a limit of 0xFFFFFFFF) for some of it's instructions to operate properly.

Segmentation is, however, totally in-built into the x86 architecture. It's impossible to get around it. So here we're going to show you how to set up your own Global Descriptor Table - a list of segment descriptors.

As mentioned before, we're going to try and set up a flat memory model. The segment's window should start at 0x00000000 and extend to 0xFFFFFFFF (the end of memory). However, there is one thing that segmentation can do that paging can't, and that's set the ring level.

A ring is a privilege level - zero being the most privileged, and three being the least. Processes in ring zero are said to be running in kernel-mode, or supervisor-mode, because they can use instructions like sti and cli, something which most processes can't. Normally, rings 1 and 2 are unused. They can, technically, access a greater subset of the supervisor-mode instructions than ring 3 can. Some microkernel architectures use these for running server processes, or drivers.

A segment descriptor carries inside it a number representing the ring level it applies to. To change ring levels (which we'll do later on), among other things, we need segments that represent both ring 0 and ring 3.

OK, that was one humungous chunk of theory, lets get into the nitty gritty of implementing this.

One thing I forgot to mention is that GRUB sets a GDT up for you. The problem is that you don't know where that GDT is, or what's in it. So you could accidentally overwrite it, then your computer would triple-fault and reset. Not clever.

In the x86, we have 6 segmentation registers. Each holds an offset into the GDT. They are cs (code segment), ds (data segment), es (extra segment), fs, gs, ss (stack segment). The code segment must reference a descriptor which is set as a 'code segment'. There is a flag for this in the access byte. The rest should all reference a descriptor which is set as a 'data segment'.

Access byte format:

![gdt_idt_gdt_format_2](img/gdt_idt_gdt_format_2.png)

## descriptor_tables.h

A GDT entry looks like this:

```

// This structure contains the value of one GDT entry.

// We use the attribute 'packed' to tell GCC not to change

// any of the alignment in the structure.

struct gdt_entry_struct

{

   u16int limit_low;           // The lower 16 bits of the limit.

   u16int base_low;            // The lower 16 bits of the base.

   u8int  base_middle;         // The next 8 bits of the base.

   u8int  access;              // Access flags, determine what ring this segment can be used in.

   u8int  granularity;

   u8int  base_high;           // The last 8 bits of the base.

} __attribute__((packed));

typedef struct gdt_entry_struct gdt_entry_t;

```

Most of those fields should be self-explanatory. The format of the access byte is given on the right above, and the format of the granularity byte is here on the right.

- **P**: Is segment present? (1 = Yes)

- **DPL**: Descriptor privilege level - Ring 0 - 3.

- **DT**: Descriptor type

- **Type**: Segment type - code segment / data segment.

- **G**: Granularity (0 = 1 byte, 1 = 1kbyte)

- **D**: Operand size (0 = 16bit, 1 = 32bit)

- **0**: Should always be zero.

- **A**: Available for system use (always zero).

To tell the processor where to find our GDT, we have to give it the address of a special pointer structure:

```

struct gdt_ptr_struct

{

   u16int limit;               // The upper 16 bits of all selector limits.

   u32int base;                // The address of the first gdt_entry_t struct.

}

 __attribute__((packed));

typedef struct gdt_ptr_struct gdt_ptr_t;

```

The base is the address of the first entry in our GDT, the limit being the size of the table minus one (the last valid address in the table).

Those struct definitions should go in a header file, descriptor_tables.h, along with a prototype.

```

// Initialisation function is publicly accessible.

void init_descriptor_tables();

```

Granularity byte format:

![gdt_idt_gdt_format_1](img/gdt_idt_gdt_format_1.png)

## descriptor_tables.c

In descriptor_tables.c, we have a few declarations:

```

//

// descriptor_tables.c - Initialises the GDT and IDT, and defines the

// default ISR and IRQ handler.

// Based on code from Bran's kernel development tutorials.

// Rewritten for JamesM's kernel development tutorials.

//

#include "common.h"

#include "descriptor_tables.h"

// Lets us access our ASM functions from our C code.

extern void gdt_flush(u32int);

// Internal function prototypes.

static void init_gdt();

static void gdt_set_gate(s32int,u32int,u32int,u8int,u8int);

gdt_entry_t gdt_entries[5];

gdt_ptr_t   gdt_ptr;

idt_entry_t idt_entries[256];

idt_ptr_t   idt_ptr;

```

Notice the gdt_flush function - this will be defined in an ASM file, and will load our GDT pointer for us.

```

// Initialisation routine - zeroes all the interrupt service routines,

// initialises the GDT and IDT.

void init_descriptor_tables()

{

   // Initialise the global descriptor table.

   init_gdt();

}

static void init_gdt()

{

   gdt_ptr.limit = (sizeof(gdt_entry_t) * 5) - 1;

   gdt_ptr.base  = (u32int)&gdt_entries;

   gdt_set_gate(0, 0, 0, 0, 0);                // Null segment

   gdt_set_gate(1, 0, 0xFFFFFFFF, 0x9A, 0xCF); // Code segment

   gdt_set_gate(2, 0, 0xFFFFFFFF, 0x92, 0xCF); // Data segment

   gdt_set_gate(3, 0, 0xFFFFFFFF, 0xFA, 0xCF); // User mode code segment

   gdt_set_gate(4, 0, 0xFFFFFFFF, 0xF2, 0xCF); // User mode data segment

   gdt_flush((u32int)&gdt_ptr);

}

// Set the value of one GDT entry.

static void gdt_set_gate(s32int num, u32int base, u32int limit, u8int access, u8int gran)

{

   gdt_entries[num].base_low    = (base & 0xFFFF);

   gdt_entries[num].base_middle = (base >> 16) & 0xFF;

   gdt_entries[num].base_high   = (base >> 24) & 0xFF;

   gdt_entries[num].limit_low   = (limit & 0xFFFF);

   gdt_entries[num].granularity = (limit >> 16) & 0x0F;

   gdt_entries[num].granularity |= gran & 0xF0;

   gdt_entries[num].access      = access;

}

```

Lets just analyse that code for a moment. init_gdt initially sets up the gdt pointer structure - the limit is the size of each gdt entry * 5 - we have 5 entries. Why 5? well, we have a code and data segment descriptor for the kernel, code and data segment descriptors for user mode, and a null entry. This must be present, or bad things will happen.

gdt_init then sets up the 5 descriptors, by calling gdt_set_gate. gdt_set_gate just does some severe bit-twiddling and shifting, and should be self-explanatory with a hard stare at it. Notice that the only thing that changes between the 4 segment descriptors is the access byte - 0x9A, 0x92, 0xFA, 0xF2. You can see, if you map out the bits and compare them to the format diagram above, the bits that are changing are the type and DPL fields. DPL is the descriptor privilege level - 3 for user code and 0 for kernel code. Type specifies whether the segment is a code segment or a data segment (the processor checks this often, and can be the source of much frustration).

Finally, we have our ASM function that will write the GDT pointer.

```

[GLOBAL gdt_flush]    ; Allows the C code to call gdt_flush().

gdt_flush:

   mov eax, [esp+4]  ; Get the pointer to the GDT, passed as a parameter.

   lgdt [eax]        ; Load the new GDT pointer

   mov ax, 0x10      ; 0x10 is the offset in the GDT to our data segment

   mov ds, ax        ; Load all data segment selectors

   mov es, ax

   mov fs, ax

   mov gs, ax

   mov ss, ax

   jmp 0x08:.flush   ; 0x08 is the offset to our code segment: Far jump!

.flush:

   ret

```

   

This function takes the first parameter passed to it (in esp+4), loads the value is points to into the GDT (using the lgdt instruction), then loads the segment selectors for the data and code segments. Notice that each GDT entry is 8 bytes, and the kernel code descriptor is the second segment, so it's offset is 0x08. Likewise the kernel data descriptor is the third, so it's offset is 16 = 0x10. Here we move the value 0x10 into the data segment registers ds,es,fd,gs,ss. To change the code segment is slightly different; we must do a far jump. This changes the CS implicitly.

## IDT - Interrupt Descriptor Table

There are times when you want to interrupt the processor. You want to stop it doing what it is doing, and force it to do something different. An example of this is when an timer or keyboard interrupt request (IRQ) fires. An interrupt is like a POSIX signal - it tells you that something of interest has happened. The processor can register 'signal handlers' (interrupt handlers) that deal with the interrupt, then return to the code that was running before it fired. Interrupts can be fired externally, via IRQs, or internally, via the 'int n' instruction. There are very useful reasons for wanting to do fire interrupts from software, but that's for another chapter!

The Interrupt Descriptor Table tells the processor where to find handlers for each interrupt. It is very similar to the GDT. It is just an array of entries, each one corresponding to an interrupt number. There are 256 possible interrupt numbers, so 256 must be defined. If an interrupt occurs and there is no entry for it (even a NULL entry is fine), the processor will panic and reset.

## Faults, traps and exceptions

The processor will sometimes need to signal your kernel. Something major may have happened, such as a divide-by-zero, or a page fault. To do this, it uses the first 32 interrupts. It is therefore doubly important that all of these are mapped and non-NULL - else the CPU will triple-fault and reset (bochs will panic with an 'unhandled exception' error).

The special, CPU-dedicated interrupts are shown below.

- 0 - Division by zero exception

- 1 - Debug exception

- 2 - Non maskable interrupt

- 3 - Breakpoint exception

- 4 - 'Into detected overflow'

- 5 - Out of bounds exception

- 6 - Invalid opcode exception

- 7 - No coprocessor exception

- 8 - Double fault (pushes an error code)

- 9 - Coprocessor segment overrun

- 10 - Bad TSS (pushes an error code)

- 11 - Segment not present (pushes an error code)

- 12 - Stack fault (pushes an error code)

- 13 - General protection fault (pushes an error code)

- 14 - Page fault (pushes an error code)

- 15 - Unknown interrupt exception

- 16 - Coprocessor fault

- 17 - Alignment check exception

- 18 - Machine check exception

- 19-31 - Reserved

## descriptor_tables.h

We should add some definitions to descriptor_tables.h:

```

// A struct describing an interrupt gate.

struct idt_entry_struct

{

   u16int base_lo;             // The lower 16 bits of the address to jump to when this interrupt fires.

   u16int sel;                 // Kernel segment selector.

   u8int  always0;             // This must always be zero.

   u8int  flags;               // More flags. See documentation.

   u16int base_hi;             // The upper 16 bits of the address to jump to.

} __attribute__((packed));

typedef struct idt_entry_struct idt_entry_t;

// A struct describing a pointer to an array of interrupt handlers.

// This is in a format suitable for giving to 'lidt'.

struct idt_ptr_struct

{

   u16int limit;

   u32int base;                // The address of the first element in our idt_entry_t array.

} __attribute__((packed));

typedef struct idt_ptr_struct idt_ptr_t;

// These extern directives let us access the addresses of our ASM ISR handlers.

extern void isr0 ();

...

extern void isr31();

```

See? Very similar to the GDT entry and ptr structs. The flags field format is shown on the right. The lower 5-bits should be constant at 0b0110 - 14 in decimal. The DPL describes the privilege level we expect to be called from - in our case zero, but as we progress we'll have to change that to 3. The P bit signifies the entry is present. Any descriptor with this bit clear will cause a "Interrupt Not Handled" exception.

Flags byte format:

![gdt_idt_idt_format_1](img/gdt_idt_idt_format_1.png)

## descriptor_tables.c

We need to modify this file to add our new code.

```

...

extern void idt_flush(u32int);

...

static void init_idt();

static void idt_set_gate(u8int,u32int,u16int,u8int);

...

idt_entry_t idt_entries[256];

idt_ptr_t   idt_ptr;

...

void init_descriptor_tables()

{

  init_gdt();

  init_idt();

}

...

static void init_idt()

{

   idt_ptr.limit = sizeof(idt_entry_t) * 256 -1;

   idt_ptr.base  = (u32int)&idt_entries;

   memset(&idt_entries, 0, sizeof(idt_entry_t)*256);

   idt_set_gate( 0, (u32int)isr0 , 0x08, 0x8E);

   idt_set_gate( 1, (u32int)isr1 , 0x08, 0x8E);

   ...

   idt_set_gate(31, (u32int)isr32, 0x08, 0x8E);

   idt_flush((u32int)&idt_ptr);

}

static void idt_set_gate(u8int num, u32int base, u16int sel, u8int flags)

{

   idt_entries[num].base_lo = base & 0xFFFF;

   idt_entries[num].base_hi = (base >> 16) & 0xFFFF;

   idt_entries[num].sel     = sel;

   idt_entries[num].always0 = 0;

   // We must uncomment the OR below when we get to using user-mode.

   // It sets the interrupt gate's privilege level to 3.

   idt_entries[num].flags   = flags /* | 0x60 */;

}

```

This gets added to gdt.s also:

```

[GLOBAL idt_flush]    ; Allows the C code to call idt_flush().

idt_flush:

   mov eax, [esp+4]  ; Get the pointer to the IDT, passed as a parameter.

   lidt [eax]        ; Load the IDT pointer.

   ret

```

## interrupt.s

Great! We've got code that will tell the CPU where to find our interrupt handlers - but we haven't written any yet!

When the processor receives an interrupt, it saves the contents of the essential registers (instruction pointer, stack pointer, code and data segments, flags register) to the stack. It then finds the interrupt handler location from our IDT and jumps to it.

Now, just like POSIX signal handlers, you don't get given any information about what interrupt was called when your handler is run. So, unfortunately, we can't just have one common handler, we must write a different handler for each interrupt we want to handle. This is pretty crap, so we want to keep the amount of duplicated code to a minimum. We do this by writing many handlers that just push the interrupt number (hardcoded in the ASM) onto the stack, and call a common handler function.

That's all gravy, but unfortunately, we have another problem - some interrupts also push an error code onto the stack. We can't call a common function without a common stack frame, so for those that don't push an error code, we push a dummy one, so the stack is the same.

```

[GLOBAL isr0]

isr0:

  cli                 ; Disable interrupts

  push byte 0         ; Push a dummy error code (if ISR0 doesn't push it's own error code)

  push byte 0         ; Push the interrupt number (0)

  jmp isr_common_stub ; Go to our common handler.

```

That sample routine will work, but 32 versions of that still sounds like a lot of code. We can use NASM's macro facility to cut this down, though:

```

%macro ISR_NOERRCODE 1  ; define a macro, taking one parameter

  [GLOBAL isr%1]        ; %1 accesses the first parameter.

  isr%1:

    cli

    push byte 0

    push %1

    jmp isr_common_stub

%endmacro

%macro ISR_ERRCODE 1

  [GLOBAL isr%1]

  isr%1:

    cli

    push %1

    jmp isr_common_stub

%endmacro

```

We can now make a stub handler function just by doing

```

ISR_NOERRCODE 0

ISR_NOERRCODE 1

...

```

Much less work, and anything that makes our lives easier is worth doing. A quick look at the intel manual will tell you that only interrupts 8, 10-14 inclusive push error codes onto the stack. The rest require dummy error codes.

We're almost there, I promise!

Only 2 more things left to do - one is to create an ASM common handler function. The other is to create a higher-level C handler function.

```

; In isr.c

[EXTERN isr_handler]

; This is our common ISR stub. It saves the processor state, sets

; up for kernel mode segments, calls the C-level fault handler,

; and finally restores the stack frame.

isr_common_stub:

   pusha                    ; Pushes edi,esi,ebp,esp,ebx,edx,ecx,eax

   mov ax, ds               ; Lower 16-bits of eax = ds.

   push eax                 ; save the data segment descriptor

   mov ax, 0x10  ; load the kernel data segment descriptor

   mov ds, ax

   mov es, ax

   mov fs, ax

   mov gs, ax

   call isr_handler

   pop eax        ; reload the original data segment descriptor

   mov ds, ax

   mov es, ax

   mov fs, ax

   mov gs, ax

   popa                     ; Pops edi,esi,ebp...

   add esp, 8     ; Cleans up the pushed error code and pushed ISR number

   sti

   iret           ; pops 5 things at once: CS, EIP, EFLAGS, SS, and ESP

```

This piece of code is our common interrupt handler. It firstly uses the 'pusha' command to push all the general purpose registers on the stack. It uses the 'popa' command to restore them at the end. It also gets the current data segment selector and pushes that onto the stack, sets all the segment registers to the kernel data selector, and restores them afterwards. This won't actually have an effect at the moment, but it will when we switch to user-mode. Notice it also calls a higher-level interrupt handler - isr_handler.

When an interrupt fires, the processor automatically pushes information about the processor state onto the stack. The code segment, instruction pointer, flags register, stack segment and stack pointer are pushed. The IRET instruction is specifically designed to return from an interrupt. It pops these values off the stack and returns the processor to the state it was in originally.

## isr.c

```

//

// isr.c -- High level interrupt service routines and interrupt request handlers.

// Part of this code is modified from Bran's kernel development tutorials.

// Rewritten for JamesM's kernel development tutorials.

//

#include "common.h"

#include "isr.h"

#include "monitor.h"

// This gets called from our ASM interrupt handler stub.

void isr_handler(registers_t regs)

{

   monitor_write("recieved interrupt: ");

   monitor_write_dec(regs.int_no);

   monitor_put('\n');

}

```

Nothing much to explain here - The interrupt handler prints a message out to the screen, along with the interrupt number it handled. It uses a structure registers_t, which is a representation of all the registers we pushed, and is defined in isr.h:

## isr.h

```

//

// isr.h -- Interface and structures for high level interrupt service routines.

// Part of this code is modified from Bran's kernel development tutorials.

// Rewritten for JamesM's kernel development tutorials.

//

#include "common.h"

typedef struct registers

{

   u32int ds;                  // Data segment selector

   u32int edi, esi, ebp, esp, ebx, edx, ecx, eax; // Pushed by pusha.

   u32int int_no, err_code;    // Interrupt number and error code (if applicable)

   u32int eip, cs, eflags, useresp, ss; // Pushed by the processor automatically.

} registers_t;

```

## Testing it out

Wow, that was a seriously long chapter! Don't get put off, they're not all this length. We just have to do an awful lot here to get anything out of it.

Now we can test it out! Add this to your main() function:

```

asm volatile ("int $0x3");

asm volatile ("int $0x4");

```

Disas:

```

  100063:       cc                      int3

  100064:       cd 04                   int    0x4

```

This causes two software interrupts: 3 and 4. You should see the messages printed out just like the screenshot on the right.

Congrats! You've now got a kernel that can handle interrupts, and set up its own segmentation tables (a pretty hollow victory, considering all that code and theory, but unfortunately there's no getting around it!).

Copy floppy.img from project_dir/ to bochs/ directory and run Bochs debugger.

![gdt_idt_bochs](img/gdt_idt_bochs.png)

## More info

- https://wiki.osdev.org/Interrupts

- https://wiki.osdev.org/IDT

- https://wiki.osdev.org/Interrupt_Service_Routines

- https://wiki.osdev.org/Interrupt_Vector_Table

- https://wiki.osdev.org/Global_Descriptor_Table

- https://wiki.osdev.org/Segmentation

- https://wiki.osdev.org/PIC

# IRQs and PIT - bochs/x86osdev/irqs_and_the_pit/floppy.img

In this chapter we're going to be learning about interrupt requests (IRQs) and the programmable interval timer (PIT).

## IRQ - Interrupt ReQuests 

There are several methods for communicating with external devices. Two of the most useful and popular are polling and interrupting.

- **Polling**: Spin in a loop, occasionally checking if the device is ready.

- **Interrupts**: Do lots of useful stuff. When the device is ready it will cause a CPU interrupt, causing your handler to be run.

As can probably be gleaned from my biased descriptions, interrupting is considered better for many situations. Polling has lots of uses - some CPUs may not have an interrupt mechanism, or you may have many devices, or maybe you just need to check so infrequently that it's not worth the hassle of interrupts. Any rate, interrupts are a very useful method of hardware communication. They are used by the keyboard when keys are pressed, and also by the programmable interval timer (PIT).

The low-level concepts behind external interrupts are not very complex. All devices that are interrupt-capable have a line connecting them to the PIC (programmable interrupt controller). The PIC is the only device that is directly connected to the CPU's interrupt pin. It is used as a multiplexer, and has the ability to prioritise between interrupting devices. It is, essentially, a glorified 8-1 multiplexer. At some point, someone somewhere realised that 8 IRQ lines just wasn't enough, and they daisy-chained another 8-1 PIC beside the original. So in all modern PCs, you have 2 PICs, the master and the slave, serving a total of 15 interruptable devices (one line is used to signal the slave PIC).

The other clever thing about the PIC is that you can change the interrupt number it delivers for each IRQ line. This is referred to as remapping the PIC and is actually extremely useful. When the computer boots, the default interrupt mappings are:

- IRQ 0..7 - INT 0x8..0xF

- IRQ 8..15 - INT 0x70..0x77

This causes us somewhat of a problem. The master's IRQ mappings (0x8-0xF) conflict with the interrupt numbers used by the CPU to signal exceptions and faults (see last chapter). The normal thing to do is to remap the PICs so that IRQs 0..15 correspond to ISRs 32..47 (31 being the last CPU-used ISR).

The slave's output is connected to IRQ2 of the master:

![pics](img/pics.png)

The PICs are communicated with via the I/O bus. Each has a command port and a data port:

- Master - command: 0x20, data: 0x21

- Slave - command: 0xA0, data: 0xA1

The code for remapping the PICs is the most difficult and obfusticated. To remap them, you have to do a full reinitialisation of them, which is why the code is so long. If you're interested in what's actually happening, there is a nice description here: https://wiki.osdev.org/PIC

```

static void init_idt()

{

  ...

  // Remap the irq table.

  outb(0x20, 0x11);

  outb(0xA0, 0x11);

  outb(0x21, 0x20);

  outb(0xA1, 0x28);

  outb(0x21, 0x04);

  outb(0xA1, 0x02);

  outb(0x21, 0x01);

  outb(0xA1, 0x01);

  outb(0x21, 0x0);

  outb(0xA1, 0x0);

  ...

  idt_set_gate(32, (u32int)irq0, 0x08, 0x8E);

  ...

  idt_set_gate(47, (u32int)irq15, 0x08, 0x8E);

}

```

Notice that now we are also setting IDT gates for numbers 32-47, for our IRQ handlers. We must, therefore, also add stubs for these in interrupt.s. Also, though, we need a new macro in interrupt.s - an IRQ stub will have 2 numbers associated with it - it's IRQ number (0-15) and it's interrupt number (32-47):

```

; This macro creates a stub for an IRQ - the first parameter is

; the IRQ number, the second is the ISR number it is remapped to.

%macro IRQ 2

  global irq%1

  irq%1:

    cli

    push byte 0

    push %2

    jmp irq_common_stub

%endmacro

```

```

...

IRQ   0,    32

IRQ   1,    33

...

IRQ  15,    47

```

We also have a new common stub - irq_common_stub. This is because IRQs behave subtly differently - before you return from an IRQ handler, you must inform the PIC that you have finished, so it can dispatch the next (if there is one waiting). This is known as an EOI (end of interrupt). There is a slight complication though. If the master PIC sent the IRQ (number 0-7), we must send an EOI to the master (obviously). If the slave sent the IRQ (8-15), we must send an EOI to both the master and the slave (because of the daisy-chaining of the two).

First our asm common stub. It is almost identical to isr_common_stub.

```

; In isr.c

[EXTERN irq_handler]

; This is our common IRQ stub. It saves the processor state, sets

; up for kernel mode segments, calls the C-level fault handler,

; and finally restores the stack frame.

irq_common_stub:

   pusha                    ; Pushes edi,esi,ebp,esp,ebx,edx,ecx,eax

   mov ax, ds               ; Lower 16-bits of eax = ds.

   push eax                 ; save the data segment descriptor

   mov ax, 0x10  ; load the kernel data segment descriptor

   mov ds, ax

   mov es, ax

   mov fs, ax

   mov gs, ax

   call irq_handler

   pop ebx        ; reload the original data segment descriptor

   mov ds, bx

   mov es, bx

   mov fs, bx

   mov gs, bx

   popa                     ; Pops edi,esi,ebp...

   add esp, 8     ; Cleans up the pushed error code and pushed ISR number

   sti

   iret           ; pops 5 things at once: CS, EIP, EFLAGS, SS, and ESP

```

Now the C code (goes in isr.c):

```

// This gets called from our ASM interrupt handler stub.

void irq_handler(registers_t regs)

{

   // Send an EOI (end of interrupt) signal to the PICs.

   // If this interrupt involved the slave.

   if (regs.int_no >= 40)

   {

       // Send reset signal to slave.

       outb(0xA0, 0x20);

   }

   // Send reset signal to master. (As well as slave, if necessary).

   outb(0x20, 0x20);

   if (interrupt_handlers[regs.int_no] != 0)

   {

       isr_t handler = interrupt_handlers[regs.int_no];

       handler(regs);

   }

}

```

This is fairly straightforward - if the IRQ was > 7 (interrupt number > 40), we send a reset signal to the slave. In either case, we send one to the master also.

You may also notice that I have added a small custom handler mechanism, allowing you to register custom interrupt handlers. This can be very useful as an abstraction technique, and will neaten up our code nicely.

Some other declarations are needed:

## isr.h

```

// A few defines to make life a little easier

#define IRQ0 32

...

#define IRQ15 47

// Enables registration of callbacks for interrupts or IRQs.

// For IRQs, to ease confusion, use the #defines above as the

// first parameter.

typedef void (*isr_t)(registers_t);

void register_interrupt_handler(u8int n, isr_t handler);

```

## isr.c

```

isr_t interrupt_handlers[256];

void register_interrupt_handler(u8int n, isr_t handler)

{

  interrupt_handlers[n] = handler;

}

```

And there we go! We can now handle interrupt requests from external devices, and dispatch them to custom handlers. Now all we need is some interrupt requests to handle!

## PIT - Programmable Interval Timer

The programmable interval timer is a chip connected to IRQ0. It can interrupt the CPU at a user-defined rate (between 18.2Hz and 1.1931 MHz). The PIT is the primary method used for implementing a system clock and the only method available for implementing multitasking (switch processes on interrupt).

The PIT has an internal clock which oscillates at approximately 1.1931MHz. This clock signal is fed through a frequency divider http://en.wikipedia.org/wiki/Frequency_divider , to modulate the final output frequency. It has 3 channels, each with it's own frequency divider.

- Channel 0 is the most useful. It's output is connected to IRQ0.

- Channel 1 is very un-useful and on modern hardware is no longer implemented. It used to control refresh rates for DRAM http://en.wikipedia.org/wiki/DRAM

- Channel 2 controls the PC speaker.

- Channel 0 is the only one of use to us at the moment.

OK, so we want to set the PIT up so it interrupts us at regular intervals, at frequency f. I generally set f to be about 100Hz (once every 10 milliseconds), but feel free to set it to whatever you like. To do this, we send the PIT a 'divisor'. This is the number that it should divide it's input frequency (1.9131MHz) by. It's dead easy to work out:

```

divisor = 1193180 Hz / frequency (in Hz)

```

Also worthy of note is that the PIT has 4 registers in I/O space - 0x40-0x42 are the data ports for channels 0-2 respectively, and 0x43 is the command port.

We'll need a few new files. Timer.h has only a declaration in it:

```

// timer.h -- Defines the interface for all PIT-related functions.

// Written for JamesM's kernel development tutorials.

#ifndef TIMER_H

#define TIMER_H

#include "common.h"

void init_timer(u32int frequency);

#endif

```

And timer.c doesn't have much in either:

```

// timer.c -- Initialises the PIT, and handles clock updates.

// Written for JamesM's kernel development tutorials.

#include "timer.h"

#include "isr.h"

#include "monitor.h"

u32int tick = 0;

static void timer_callback(registers_t regs)

{

   tick++;

   monitor_write("Tick: ");

   monitor_write_dec(tick);

   monitor_write("\n");

}

void init_timer(u32int frequency)

{

   // Firstly, register our timer callback.

   register_interrupt_handler(IRQ0, &timer_callback);

   // The value we send to the PIT is the value to divide it's input clock

   // (1193180 Hz) by, to get our required frequency. Important to note is

   // that the divisor must be small enough to fit into 16-bits.

   u32int divisor = 1193180 / frequency;

   // Send the command byte.

   outb(0x43, 0x36);

   // Divisor has to be sent byte-wise, so split here into upper/lower bytes.

   u8int l = (u8int)(divisor & 0xFF);

   u8int h = (u8int)( (divisor>>8) & 0xFF );

   // Send the frequency divisor.

   outb(0x40, l);

   outb(0x40, h);

}

```

OK, lets go through this code. Firstly, we have our init_timer function. This tells our interrupt mechanism that we want to handle IRQ0 with the function timer_callback. This will be called whenever a timer interrupt is recieved. We then calculate the divisor to be sent to the PIT (see theory above). Then, we send a command byte to the PIT's command port. This byte (0x36) sets the PIT to repeating mode (so that when the divisor counter reaches zero it's automatically refreshed) and tells it we want to set the divisor value.

We then send the divisor value. Note that it must be sent as two seperate bytes, not as one 16-bit value.

When this is done, all we have to do is edit our Makefile, add one line to main.c

```

init_timer(50); // Initialise timer to 50Hz

```

Copy floppy.img from project_dir/ to bochs/ directory and run! You should get output like that on the right. Note however that bochs does not accurately emulate the timer chip, so although your code will run at the correct speed on a real machine, it probably won't in bochs!

![irqs_and_the_pit_bochs](img/irqs_and_the_pit_bochs.png)

## More info

- https://wiki.osdev.org/Programmable_Interval_Timer

# Paging - bochs/x86osdev/paging/floppy.img

In this chapter we're going to enable paging. Paging serves a twofold purpose - memory protection, and virtual memory (the two being almost inextricably interlinked).

## Virtual memory

If you already know what virtual memory is, you can skip this section.

In linux, if you create a tiny test program such as

```

int main(char argc, char **argv)

{

  return 0;

}

```

Compile it with:

```

gcc -static -m32 main.c -o main

```

Then run 'objdump -f', you might find something similar to this.

```

dreg@fr33project:~/test> objdump -f main

main:     file format elf32-i386

architecture: i386, flags 0x00000112:

EXEC_P, HAS_SYMS, D_PAGED

start address 0x08049950

```

Notice the start address of the program is at 0x08049950, which is about 128MB into the address space. It may seem strange, then, that this program will run perfectly on machines with < 128MB of RAM.

What the program is actually 'seeing', when it reads and writes memory, is a virtual address space. Parts of the virtual address space are mapped to physical memory, and parts are unmapped. If you try to access an unmapped part, the processor raises a page fault, the operating system catches it, and in POSIX systems delivers a SIGSEGV signal closely followed by SIGKILL.

This abstraction is extremely useful. It means that compilers can produce a program that relies on the code being at an exact location in memory, every time it is run. With virtual memory, the process thinks it is at, for example, 0x08049950, but actually it could be at physical memory location 0x1000000. Not only that, but processes cannot accidentally (or deliberately) trample other processes' data or code.

Virtual memory of this type is wholly dependent on hardware support. It cannot be emulated by software. Luckily, the x86 has just such a thing. It's called the MMU (memory management unit), and it handles all memory mappings due to segmentation and paging, forming a layer between the CPU and memory (actually, it's part of the CPU, but that's just an implementation detail).

## Paging as a concretion of virtual memory

Virtual memory is an abstract principle. As such it requires concretion through some system/algorithm. Both segmentation (see chapter 3) and paging are valid methods for implementing virtual memory. As mentioned in chapter 3 however, segmentation is becoming obsolete. Paging is the newer, better alternative for the x86 architecture.

Paging works by splitting the virtual address space into blocks called pages, which are usually 4KB in size. Pages can then be mapped on to frames - equally sized blocks of physical memory.

## Page entries

Each process normally has a different set of page mappings, so that virtual memory spaces are independent of each other. In the x86 architecture (32-bit) pages are fixed at 4KB in size. Each page has a corresponding descriptor word, which tells the processor which frame it is mapped to. Note that because pages and frames must be aligned on 4KB boundaries (4KB being 0x1000 bytes), the least significant 12 bits of the 32-bit word are always zero. The architecture takes advantage of this by using them to store information about the page, such as whether it is present, whether it is kernel-mode or user-mode etc. The layout of this word is in the picture on the right.

The fields in that picture are pretty simple, so let's quickly go through them.

- **P**: Set if the page is present in memory.

- **R/W**: If set, that page is writeable. If unset, the page is read-only. This does not apply when code is running in kernel-mode (unless a flag in CR0 is set).

- **U/S**: If set, this is a user-mode page. Else it is a supervisor (kernel)-mode page. User-mode code cannot write to or read from kernel-mode pages.

- **Reserved**: These are used by the CPU internally and cannot be trampled.

- **A**: Set if the page has been accessed (Gets set by the CPU).

- **D**: Set if the page has been written to (dirty).

- **AVAIL**: These 3 bits are unused and available for kernel-use.

- **Page frame address**: The high 20 bits of the frame address in physical memory.

Page table entry format:

![paging_pte](img/paging_pte.png)

## Page directories and tables

Possibly you've been tapping on your calculator and have worked out that to generate a table mapping each 4KB page to one 32-bit descriptor over a 4GB address space requires 4MB of memory. Perhaps, perhaps not - but it's true.

4MB may seem like a large overhead, and to be fair, it is. If you have 4GB of physical RAM, it's not much. However, if you are working on a machine that has 16MB of RAM, you've just lost a quarter of your available memory! What we want is something progressive, that will take up an amount of space proportionate to the amount of RAM you have.

Well, we don't have that. But intel did come up with something similar - they use a 2-tier system. The CPU gets told about a page directory, which is a 4KB large table, each entry of which points to a page table. The page table is, again, 4KB large and each entry is a page table entry, described above.

This way, The entire 4GB address space can be covered with the advantage that if a page table has no entries, it can be freed and it's present flag unset in the page directory.

2-tier layout:

![page_directory](img/page_directory.png)

Linear address to physical address:

![x86_page_translation_process](img/x86_page_translation_process.png)

## Enabling paging

Enabling paging is extremely easy.

1. Copy the location of your page directory into the CR3 register. This must, of course, be the physical address.

2. Set the PG bit in the CR0 register. You can do this by OR-ing with 0x80000000.

## Page faults

When a process does something the memory-management unit doesn't like, a page fault interrupt is thrown. Situations that can cause this are (not complete):

- Reading from or writing to an area of memory that is not mapped (page entry/table's 'present' flag is not set)

- The process is in user-mode and tries to write to a read-only page.

- The process is in user-mode and tries to access a kernel-only page.

- The page table entry is corrupted - the reserved bits have been overwritten.

The page fault interrupt is number 14, and looking at chapter 3 we can see that this throws an error code. This error code gives us quite a bit of information about what happened.

- **Bit 0**: If set, the fault was not because the page wasn't present. If unset, the page wasn't present.

- **Bit 1**: If set, the operation that caused the fault was a write, else it was a read.

- **Bit 2**: If set, the processor was running in user-mode when it was interrupted. Else, it was running in kernel-mode.

- **Bit 3**: If set, the fault was caused by reserved bits being overwritten.

- **Bit 4**: If set, the fault occurred during an instruction fetch.

The processor also gives us another piece of information - the address that caused the fault. This is located in the CR2 register. Beware that if your page fault hander itself causes another page fault exception this register will be overwritten - so save it early!

## Putting it into practice

We're almost ready to start implementing. We will, however, need a few assistant functions first, the most important of which are memory management functions.

## Simple memory management with placement malloc

If you come from a C++ background, you may have heard of 'placement new'. This is a version of new that takes a parameter. Instead of calling malloc, as it normally would, it creates the object at the address specified. We are going to use a very similar concept.

When the kernel is sufficiently booted, we will have a kernel heap active and operational. The way we code heaps, though, usually requires that virtual memory is enabled. So we need a simple alternative to allocate memory before the heap is active.

As we're allocating quite early on in the kernel bootup, we can make the assumption that nothing that is kmalloc()'d will ever need to be kfree()'d. This simplifies things greatly. We can just have a pointer (placement address) to some free memory that we pass back to the requestee then increment. Thus:

```

u32int kmalloc(u32int sz)

{

  u32int tmp = placement_address;

  placement_address += sz;

  return tmp;

}

```

That will actually suffice. However, we have another requirement. When we allocate page tables and directories, they must be page-aligned. So we can build that in:

```

u32int kmalloc(u32int sz, int align)

{

  if (align == 1 && (placement_address & 0xFFFFF000)) // If the address is not already page-aligned

  {

    // Align it.

    placement_address &= 0xFFFFF000;

    placement_address += 0x1000;

  }

  u32int tmp = placement_address;

  placement_address += sz;

  return tmp;

}

```

Now, unfortunately, we have one more requirement, and I can't really explain to you why it is required until later in the tutorials. It has to do with when we clone a page directory (when fork()ing processes). At this point, paging will be fully enabled, and kmalloc will return a virtual address. But, we also (bear with me, you'll be glad we did later) need to get the physical address of the memory allocated. Take it on faith for now - it's not much code anyway.

```

u32int kmalloc(u32int sz, int align, u32int *phys)

{

  if (align == 1 && (placement_address & 0xFFFFF000)) // If the address is not already page-aligned

  {

    // Align it.

    placement_address &= 0xFFFFF000;

    placement_address += 0x1000;

  }

  if (phys)

  {

    *phys = placement_address;

  }

  u32int tmp = placement_address;

  placement_address += sz;

  return tmp;

}

```

Great. This is all we need for simple memory management. In my code I have actually (for aesthetic purposes) renamed kmalloc to kmalloc_int (for kmalloc_internal). I then have several wrapper functions:

```

u32int kmalloc_a(u32int sz);  // page aligned.

u32int kmalloc_p(u32int sz, u32int *phys); // returns a physical address.

u32int kmalloc_ap(u32int sz, u32int *phys); // page aligned and returns a physical address.

u32int kmalloc(u32int sz); // vanilla (normal).

```

I just feel this interface is nicer than specifying 3 parameters for every kernel heap allocation! These definitions should go in kheap.h/kheap.c.

## Required definitions

paging.h should contain some structure definitions that will make our life easier.

```

#ifndef PAGING_H

#define PAGING_H

#include "common.h"

#include "isr.h"

typedef struct page

{

   u32int present    : 1;   // Page present in memory

   u32int rw         : 1;   // Read-only if clear, readwrite if set

   u32int user       : 1;   // Supervisor level only if clear

   u32int accessed   : 1;   // Has the page been accessed since last refresh?

   u32int dirty      : 1;   // Has the page been written to since last refresh?

   u32int unused     : 7;   // Amalgamation of unused and reserved bits

   u32int frame      : 20;  // Frame address (shifted right 12 bits)

} page_t;

typedef struct page_table

{

   page_t pages[1024];

} page_table_t;

typedef struct page_directory

{

   /**

      Array of pointers to pagetables.

   **/

   page_table_t *tables[1024];

   /**

      Array of pointers to the pagetables above, but gives their *physical*

      location, for loading into the CR3 register.

   **/

   u32int tablesPhysical[1024];

   /**

      The physical address of tablesPhysical. This comes into play

      when we get our kernel heap allocated and the directory

      may be in a different location in virtual memory.

   **/

   u32int physicalAddr;

} page_directory_t;

/**

  Sets up the environment, page directories etc and

  enables paging.

**/

void initialise_paging();

/**

  Causes the specified page directory to be loaded into the

  CR3 register.

**/

void switch_page_directory(page_directory_t *new);

/**

  Retrieves a pointer to the page required.

  If make == 1, if the page-table in which this page should

  reside isn't created, create it!

**/

page_t *get_page(u32int address, int make, page_directory_t *dir);

/**

  Handler for page faults.

**/

void page_fault(registers_t regs);

```

Note the tablesPhysical and physicalAddr members of page_table_t. What are they doing there?

The physicalAddr member is actually only for when we clone page directories (not until later in the tutorials). Remember that at that point, the new directory will have an address in virtual memory that is not the same as physical memory. We will need the physical address to tell the CPU if we ever want to switch directories.

The tablesPhysical member is similar. It is a solution to a problem: How do you access your page tables? It may seem simple, but remember that a page directory must hold physical addresses, not virtual ones. And the only way you can read/write to memory is using virtual addresses!

One solution to this problem is to never access your page tables directly, but to map one page table to point back to the page directory, so that by accessing memory at a certain address you can see all your page tables as if they were pages, and all your page table entries as if they were normal integers. The diagram on the right should help to explain. This method is a little counter-intuitive in my opinion and it also wastes 256MB of addressable space, so I prefer another method.

The second method is to, for every page directory, keep 2 arrays. One holding the physical addresses of it's page tables (for giving to the CPU), and the other holding the virtual ones (so we can read/write to them). This only gives us an extra overhead of 4KB per page directory, which is not much.

## Frame allocation

If we want to map a page to a frame, we need some way of finding a free frame. Of course, we could just maintain a massive array of 1's and 0's, but that would be extremely wasteful - we don't need 32-bits just to hold 2 values, we can do that with 1 bit. So if we use a bitset http://en.wikipedia.org/wiki/Bitset , we will be using 32 times less space!

If you don't know what a bitset (also called a bitmap) is, you should read the link above. There are only 3 functions a bitset implements - set, test and clear. I have also implemented a function to efficiently find the first free frame from the bitmap. Have a look at it and work out why it is efficient. My implementation of these is below. I'm not going to go through explaining it - this is a general concept and is not kernel related. If you're confused, search google for bitset implementations, and if worst comes to the worst post on the osdev forums https://forum.osdev.org/

```

// A bitset of frames - used or free.

u32int *frames;

u32int nframes;

// Defined in kheap.c

extern u32int placement_address;

// Macros used in the bitset algorithms.

#define INDEX_FROM_BIT(a) (a/(8*4))

#define OFFSET_FROM_BIT(a) (a%(8*4))

// Static function to set a bit in the frames bitset

static void set_frame(u32int frame_addr)

{

   u32int frame = frame_addr/0x1000;

   u32int idx = INDEX_FROM_BIT(frame);

   u32int off = OFFSET_FROM_BIT(frame);

   frames[idx] |= (0x1 << off);

}

// Static function to clear a bit in the frames bitset

static void clear_frame(u32int frame_addr)

{

   u32int frame = frame_addr/0x1000;

   u32int idx = INDEX_FROM_BIT(frame);

   u32int off = OFFSET_FROM_BIT(frame);

   frames[idx] &= ~(0x1 << off);

}

// Static function to test if a bit is set.

static u32int test_frame(u32int frame_addr)

{

   u32int frame = frame_addr/0x1000;

   u32int idx = INDEX_FROM_BIT(frame);

   u32int off = OFFSET_FROM_BIT(frame);

   return (frames[idx] & (0x1 << off));

}

// Static function to find the first free frame.

static u32int first_frame()

{

   u32int i, j;

   for (i = 0; i < INDEX_FROM_BIT(nframes); i++)

   {

       if (frames[i] != 0xFFFFFFFF) // nothing free, exit early.

       {

           // at least one bit is free here.

           for (j = 0; j < 32; j++)

           {

               u32int toTest = 0x1 << j;

               if ( !(frames[i]&toTest) )

               {

                   return i*4*8+j;

               }

           }

       }

   }

}

```

Hopefully that code shouldn't cause too many surprises. It just fancy bit twiddling. We then come to functions to allocate and deallocate frames. Now that we have an efficient bitset implementation, these functions total just a few lines!

```

// Function to allocate a frame.

void alloc_frame(page_t *page, int is_kernel, int is_writeable)

{

   if (page->frame != 0)

   {

       return; // Frame was already allocated, return straight away.

   }

   else

   {

       u32int idx = first_frame(); // idx is now the index of the first free frame.

       if (idx == (u32int)-1)

       {

           // PANIC is just a macro that prints a message to the screen then hits an infinite loop.

           PANIC("No free frames!");

       }

       set_frame(idx*0x1000); // this frame is now ours!

       page->present = 1; // Mark it as present.

       page->rw = (is_writeable)?1:0; // Should the page be writeable?

       page->user = (is_kernel)?0:1; // Should the page be user-mode?

       page->frame = idx;

   }

}

// Function to deallocate a frame.

void free_frame(page_t *page)

{

   u32int frame;

   if (!(frame=page->frame))

   {

       return; // The given page didn't actually have an allocated frame!

   }

   else

   {

       clear_frame(frame); // Frame is now free again.

       page->frame = 0x0; // Page now doesn't have a frame.

   }

}

```

Note that the PANIC macro just calls a global function called panic, with arguments of the message given and the _FILE_ and _LINE_ it occurred on. panic prints these out and enters an infinite loop, stopping all execution.

## Paging code finally

```

void initialise_paging()

{

   // The size of physical memory. For the moment we

   // assume it is 16MB big.

   u32int mem_end_page = 0x1000000;

   nframes = mem_end_page / 0x1000;

   frames = (u32int*)kmalloc(INDEX_FROM_BIT(nframes));

   memset(frames, 0, INDEX_FROM_BIT(nframes));

   // Let's make a page directory.

   kernel_directory = (page_directory_t*)kmalloc_a(sizeof(page_directory_t));

   memset(kernel_directory, 0, sizeof(page_directory_t));

   current_directory = kernel_directory;

   // We need to identity map (phys addr = virt addr) from

   // 0x0 to the end of used memory, so we can access this

   // transparently, as if paging wasn't enabled.

   // NOTE that we use a while loop here deliberately.

   // inside the loop body we actually change placement_address

   // by calling kmalloc(). A while loop causes this to be

   // computed on-the-fly rather than once at the start.

   int i = 0;

   while (i < placement_address)

   {

       // Kernel code is readable but not writeable from userspace.

       alloc_frame( get_page(i, 1, kernel_directory), 0, 0);

       i += 0x1000;

   }

   // Before we enable paging, we must register our page fault handler.

   register_interrupt_handler(14, page_fault);

   // Now, enable paging!

   switch_page_directory(kernel_directory);

}

void switch_page_directory(page_directory_t *dir)

{

   current_directory = dir;

   asm volatile("mov %0, %%cr3":: "r"(&dir->tablesPhysical));

   u32int cr0;

   asm volatile("mov %%cr0, %0": "=r"(cr0));

   cr0 |= 0x80000000; // Enable paging!

   asm volatile("mov %0, %%cr0":: "r"(cr0));

}

page_t *get_page(u32int address, int make, page_directory_t *dir)

{

   // Turn the address into an index.

   address /= 0x1000;

   // Find the page table containing this address.

   u32int table_idx = address / 1024;

   if (dir->tables[table_idx]) // If this table is already assigned

   {

       return &dir->tables[table_idx]->pages[address%1024];

   }

   else if(make)

   {

       u32int tmp;

       dir->tables[table_idx] = (page_table_t*)kmalloc_ap(sizeof(page_table_t), &tmp);

       memset(dir->tables[table_idx], 0, 0x1000);

       dir->tablesPhysical[table_idx] = tmp | 0x7; // PRESENT, RW, US.

       return &dir->tables[table_idx]->pages[address%1024];

   }

   else

   {

       return 0;

   }

}

```

Disas switch_page_directory:

```

001019ff :

  1019ff:       f3 0f 1e fb             endbr32

  101a03:       55                      push   ebp

  101a04:       89 e5                   mov    ebp,esp

  101a06:       83 ec 10                sub    esp,0x10

  101a09:       e8 f6 05 00 00          call   102004 <__x86.get_pc_thunk.ax>

  101a0e:       05 f2 15 00 00          add    eax,0x15f2

  101a13:       8b 55 08                mov    edx,DWORD PTR [ebp+0x8]

  101a16:       89 90 2c 00 00 00       mov    DWORD PTR [eax+0x2c],edx

  101a1c:       8b 45 08                mov    eax,DWORD PTR [ebp+0x8]

  101a1f:       05 00 10 00 00          add    eax,0x1000

  101a24:       0f 22 d8                mov    cr3,eax

  101a27:       0f 20 c0                mov    eax,cr0

  101a2a:       89 45 fc                mov    DWORD PTR [ebp-0x4],eax

  101a2d:       81 4d fc 00 00 00 80    or     DWORD PTR [ebp-0x4],0x80000000

  101a34:       8b 45 fc                mov    eax,DWORD PTR [ebp-0x4]

  101a37:       0f 22 c0                mov    cr0,eax

  101a3a:       90                      nop

  101a3b:       c9                      leave

  101a3c:       c3                      ret

```

Right, let's analyse that. First of all, the utility functions.

switch_page_directory does exactly what it says on the tin. It takes a page directory, and switches to it. It does this by moving the address of the tablesPhysical member of that directory into the CR3 register. Remember that the tablesPhysical member is an array of physical addresses. After that it first gets the contents of CR0, then OR-s the PG bit (0x80000000), then rewrites it. This enables paging and flushes the page-directory cache as well.

get_page returns a pointer to the page entry for a particular address. It can optionally be passed a parameter - make. If make is 1, and the page table that the requested page entry should reside in hasn't been created, then it will be created. Otherwise, the function would just return 0. So, if the table has already been assigned, it will look up the page entry and return it. If it hasn't (and make == 1), it will attempt to create it.

It uses our kmalloc_ap function to retrieve a memory block which is page-aligned, and also gets given it's physical location. The physical location gets stored in 'tablesPhysical' (after several bits have been set telling the CPU that it is present, writeable, and user-accessible), and the virtual location is stored in 'tables'.

initialise_paging firstly creates the frames bitset, and sets everything to zero using memset. Then it allocates space (which is page-aligned) for a page directory. After that, it allocates frames such that any page access will map to the frame with the same linear address, called identity-mapping. This is done for a small section of the address space, so the kernel code can continue to run as normal. It registers an interrupt handler for page faults (below) then enables paging.

## page fault handler

```

void page_fault(registers_t regs)

{

   // A page fault has occurred.

   // The faulting address is stored in the CR2 register.

   u32int faulting_address;

   asm volatile("mov %%cr2, %0" : "=r" (faulting_address));

   // The error code gives us details of what happened.

   int present   = !(regs.err_code & 0x1); // Page not present

   int rw = regs.err_code & 0x2;           // Write operation?

   int us = regs.err_code & 0x4;           // Processor was in user-mode?

   int reserved = regs.err_code & 0x8;     // Overwritten CPU-reserved bits of page entry?

   int id = regs.err_code & 0x10;          // Caused by an instruction fetch?

   // Output an error message.

   monitor_write("Page fault! ( ");

   if (present) {monitor_write("present ");}

   if (rw) {monitor_write("read-only ");}

   if (us) {monitor_write("user-mode ");}

   if (reserved) {monitor_write("reserved ");}

   monitor_write(") at 0x");

   monitor_write_hex(faulting_address);

   monitor_write("\n");

   PANIC("Page fault");

}

```

Disas page_fault:

```

00101aeb :

  101aeb:       f3 0f 1e fb             endbr32

  101aef:       55                      push   ebp

  101af0:       89 e5                   mov    ebp,esp

  101af2:       53                      push   ebx

  101af3:       83 ec 24                sub    esp,0x24

  101af6:       e8 05 05 00 00          call   102000 <__x86.get_pc_thunk.bx>

  101afb:       81 c3 05 15 00 00       add    ebx,0x1505

  101b01:       0f 20 d0                mov    eax,cr2

  101b04:       89 45 f4                mov    DWORD PTR [ebp-0xc],eax

  101b07:       8b 45 30                mov    eax,DWORD PTR [ebp+0x30]

  101b0a:       83 e0 01                and    eax,0x1

  101b0d:       85 c0                   test   eax,eax

  101b0f:       0f 94 c0                sete   al

  101b12:       0f b6 c0                movzx  eax,al

...

```

All this handler does is print out a nice error message. It gets the faulting address from CR2, and analyses the error code pushed by the processor to glean some information from it.

## Testing

Awesome! you now have code that enables paging and handles page faults! Let's just check it actually works, shall we ...?

```

main.c

int main(struct multiboot *mboot_ptr)

{

   // Initialise all the ISRs and segmentation

   init_descriptor_tables();

   // Initialise the screen (by clearing it)

   monitor_clear();

   initialise_paging();

   monitor_write("Hello, paging world!\n");

   u32int *ptr = (u32int*)0xA0000000;

   u32int do_page_fault = *ptr;

   return 0;

}

```

This will, obviously, initialise paging, print a string to make sure it's set up right and not faulting when it shoudn't, and then force a page fault by reading location 0xA0000000.

Copy floppy.img from project_dir/ to bochs/ directory and run Bochs debugger.

![paging_bochs](img/paging_bochs.png)

## More info

- https://cirosantilli.com/x86-paging

- https://wiki.osdev.org/Paging

- https://wiki.osdev.org/Page_Frame_Allocation

- https://wiki.osdev.org/Setting_Up_Paging

- https://wiki.osdev.org/Page_Tables

- https://wiki.osdev.org/Memory_Management

- https://wiki.osdev.org/Memory_Management_Unit

# Heap - bochs/x86osdev/heap/floppy.img

In order to be responsive to situations that you didn't envisage at the design stage, and to cut down the size of your kernel, you will need some kind of dynamic memory allocation. The current memory allocation system (allocation by placement address) is absolutely fine, and is in fact optimal for both time and space for allocations. The problem occurs when you try to free some memory, and want to reclaim it (this must happen eventually, otherwise you will run out!). The placement mechanism has absolutely no way to do this, and is thus not viable for the majority of kernel allocations.

As a sidepoint of general terminology, any data structure that provides both allocation and deallocation of contiguous memory can be referred to as a heap (or a pool). There is, as such, no standard 'heap algorithm' - Different algorithms are used depending on time/space/efficiency requirements. Our requirements are:

- (Relatively) simple to implement.

- Able to check consistency - debugging memory overwrites in a kernel is about ten times more difficult than in normal apps!

The algorithm and data structures presented here are ones which I developed myself. They are so simple however, that I am sure others will have used it first. It is similar to (though more simple than) Doug Lea's malloc which is used in the GNU C library.

## Data structure description

The algorithm uses two concepts: blocks and holes. Blocks are contiguous areas of memory containing user data currently in use (i.e. malloc()d but not free()d). Holes are blocks but their contents are not in use. So initially by this concept the entire area of heap space is one large hole.

For every hole there is a corresponding descriptor in an index table. The index table is always ordered ascending by the size of the hole pointed to.

Blocks and holes each contain descriptive data - a header and a footer. The header contains the most information about the block - the footer merely contains a pointer to the header (the reason for the footer will become apparent soon). Pseudocode:

```

typedef struct

{

  u32int magic;  // Magic number, used for error checking and identification.

  u8int is_hole; // 1 if this is a hole, 0 if this is a block.

  u32int size;   // Size of the block, including this and the footer.

} header_t;

typedef struct

{

  u32int magic;     // Magic number, same as in header_t.

  header_t *header; // Pointer to the block header.

} footer_t;

```

Notice that each also has a 'magic number' field. This is for error checking, and later will play a part in our 'free' algorithm. This is just a sentinel number - an unusual number that will stand out from others - much like 0xdeadbaba that we used in chapter 2. In the sample code I've gone for 0x123890AB arbitrarily.

Note also that within this tutorial I will refer to the size of a block being the number of bytes from the start of the header to the end of the footer - so within a block of size x, there will be x - sizeof(header_t) - sizeof(footer_t) user-useable bytes.

The index table with pointers to holes:

![heap_format](img/heap_format.png)

## Allocation

Allocation is straightforward, if a little long-winded. Most of the steps are error-checking and creating new holes to minimise memory leaks.

1. Search the index table to find the smallest hole that will fit the requested size. As the table is ordered, this just entails iterating through until we find a hole which will fit.

    * If we didn't find a hole large enough, then:

      1. Expand the heap.

      2. If the index table is empty (no holes have been recorded) then add a new entry to it.

      3. Else, adjust the last header's size member and rewrite the footer.

      4. To ease the number of control-flow statements, we can just recurse and call the allocation function again, trusting that this time there will be a hole large enough.

2. Decide if the hole should be split into two parts. This will normally be the case - we usually will want much less space than is available in the hole. The only time this will not happen is if there is less free space after allocating the block than the header/footer takes up. In this case we can just increase the block size and reclaim it all afterwards.

3. If the block should be page-aligned, we must alter the block starting address so that it is and create a new hole in the new unused area.

    * If it is not, we can just delete the hole from the index.

4. Write the new block's header and footer.

5. If the hole was to be split into two parts, do it now and write a new hole into the index.

6. Return the address of the block + sizeof(header_t) to the user.

## Deallocation

Deallocation (freeing) is a little more tricky. As mentioned earlier, this is where the efficiency of a memory-management algorithm is really tested. The problem is effective reclaimation of memory. The naive solution would be to change the given block to a hole and enter it back into the hole index. However, if I do this:

```

int a = kmalloc(8); // Allocate 8 bytes: returns 0xC0080000 for sake of argument

int b = kmalloc(8); // Allocate another 8 bytes: returns 0xC0080008.

kfree(a);           // Release a

kfree(b);           // Release b

int c = kmalloc(16);// What will this allocation return?

```

Note that in this example the space required for headers and footers have been purposely omitted for readability

Here we have allocated space for 8 bytes, twice. We then release both of those allocations. With the naive release algorithm we would then end up with two 8-byte sized holes in the index. When the next allocation (for 16 bytes) comes along, neither of those holes can fit it, so the kmalloc() call will return 0xC0080010. This is suboptimal. There are 16 bytes of space free at 0xC0080000, so we should be reallocating that!

The solution to this problem in most cases is a varation on a simple algorithm that I call unification - That is, converting two adjacent holes into one. (Please note that this coining of a term is not from a sense of self-importance, merely from the absence of a standardised name).

It works thus: When free()ing a block, look at what is immediately to the left (assuming 0-4GB left-to-right) of the header. If that is a footer, which can be discovered from the value of the magic number, then follow the pointer to it's header and query whether it is a hole or a block. If it is a hole, we can modify it's header's size attribute to take into account both it's size and ours, then point our footer to it's header. We have thus amalgamated both holes into one (and in this case there is no need to do an expensive insert operation on the index).

That is what I call unifying left. There is also unifying right, which should be performed on free() as well. Here we look at what is directly after the footer. If we find a header there, again identified by it's magic number, we check if it is a hole. We can then use it's size attribute to find it's footer. We rewrite the footer's pointer to point to our header. Then, all that needs to be done is to remove it's old entry from the hole index, and add our own.

Note also that in the name of reclaiming space, if we are free()ing the last block in the heap (there are no holes or blocks after us), then we can contract the size of the heap. To avoid this happening constantly, in my implementation I have defined a minimum heap size, below which it will not contract.

Unifying the two allocations in the top diagram into one in the lower diagram:

![unifying](img/unifying.png)

## Pseudocode

1. Find the header by taking the given pointer and subtracting the sizeof(header_t).

2. Sanity checks. Assert that the header and footer's magic numbers remain in tact.

3. Set the is_hole flag in our header to 1.

4. If the thing immediately to our left is a footer:

    * Unify left. In this case, at the end of the algorithm we shouldn't add our header to the hole index (the header we are unifying with is already there!) so set a flag which the algorithm checks later.

5. If the thing immediately to our right is a header:

    * Unify right.

6. If the footer is the last in the heap ( footer_location+sizeof(footer_t) == end_address ):

    * Contract.

7. Insert the header into the hole array unless the flag described in Unify left is set.

## Implementing an ordered list

So now we come to the implementation. As usual I'm going to try and explain the utility datatypes and functions first, and finish up with the allocation/free functions themselves.

The first datatype we need it an implementation of an ordered list. This concept will be used multiple times in your kernel (it is a common requirement) so it is probably a good idea to abstract it, so it can be
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/therealdreg/x86osdev

Awesome Lists containing this project

README