Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

ThemeliOS

ThemeliOS (from Greek θεμέλιο — “foundation”) is an experimental capability-based microkernel operating system written in Rust. It is designed from the ground up to do one thing well: run container workloads securely.

What is ThemeliOS?

ThemeliOS is a from-scratch kernel — it does not use or build on top of Linux. It implements its own memory management, process scheduling, inter-process communication, and security model.

The long-term vision is a minimal, immutable OS that:

  • Boots on virtual machines and bare metal
  • Runs OCI-compatible container images
  • Serves as a Kubernetes/K3s worker node
  • Provides hardware-enforced isolation between containers via capabilities
  • Has no SSH, no shell, and no way to “log in” — all management is via API

Why build a new kernel?

Existing container OSes (Bottlerocket, Talos Linux, Flatcar) all use the Linux kernel with a stripped-down userspace. This is practical, but it inherits Linux’s security model — namespaces and cgroups are opt-in isolation bolted onto a kernel designed for general-purpose computing.

ThemeliOS takes the opposite approach: isolation is the default. The capability-based security model means a process has zero access to anything unless explicitly granted. There’s nothing to escape from because there’s no ambient authority to escalate to.

Project status

ThemeliOS is in early development. See the Milestones page for the current roadmap.

License

MIT — Copyright (c) 2026 Rudi MK

Development Setup

This guide walks through setting up a development environment for ThemeliOS on macOS or Linux.

Prerequisites

1. Rust nightly toolchain

ThemeliOS requires Rust nightly because the kernel uses unstable features (#![no_std], #![no_main], inline assembly, custom allocators).

The project pins the exact toolchain via rust-toolchain.toml, so you just need rustup installed — it will automatically download the correct nightly version.

Install rustup (if you don’t have it):

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

After cloning the repo, the first cargo command will automatically install the pinned nightly toolchain plus the bare-metal targets (x86_64-unknown-none, aarch64-unknown-none).

You can verify with:

rustup show

You should see a nightly toolchain with the x86_64-unknown-none and aarch64-unknown-none targets listed.

2. QEMU

QEMU emulates the hardware that ThemeliOS runs on. You need qemu-system-x86_64 for the primary amd64 target and optionally qemu-system-aarch64 for arm64.

macOS (Homebrew):

brew install qemu

This installs all QEMU system emulators.

3. xorriso

xorriso creates bootable ISO images. The build pipeline uses it to package the kernel with the Limine bootloader into a hybrid BIOS+UEFI ISO.

macOS (Homebrew):

brew install xorriso

Ubuntu/Debian:

sudo apt install xorriso

Fedora:

sudo dnf install xorriso

4. C compiler (for Limine CLI tool)

The first cargo xtask run downloads and builds the Limine bootloader’s CLI tool, which is a small C program. This requires a C compiler.

  • macOS: Xcode Command Line Tools (xcode-select --install)
  • Linux: gcc or clang (usually pre-installed)

Ubuntu/Debian:

sudo apt install qemu-system-x86 qemu-system-arm

Fedora:

sudo dnf install qemu-system-x86 qemu-system-aarch64

Arch Linux:

sudo pacman -S qemu-full

Verify installation:

qemu-system-x86_64 --version
qemu-system-aarch64 --version

3. mdbook (optional, for building documentation)

cargo install mdbook

Building and running

All build and run commands go through the xtask tool. You never need to invoke cargo build for the kernel directly.

Build the kernel

cargo xtask build

This cross-compiles the kernel for x86_64-unknown-none (the default target).

For arm64:

cargo xtask build --arch arm64

Run in QEMU

cargo xtask run

This builds the kernel, creates a bootable ISO, and launches it in QEMU in headless mode — serial output is piped to your terminal, but no graphical window opens. Press Ctrl+A, X to exit QEMU.

For arm64 (not yet implemented):

cargo xtask run --arch arm64

Build ISO only (without launching QEMU)

cargo xtask iso

This builds the kernel and creates a bootable ISO at target/themelios.iso without launching QEMU. Useful when you want to run QEMU manually with custom flags.

Run with QEMU display window

To see the QEMU graphical window (shows the Limine bootloader screen and any framebuffer output):

cargo xtask run --display

This does everything cargo xtask run does but opens a QEMU window instead of running headless. Serial output still goes to your terminal.

Build documentation

cargo xtask docs

This builds both the mdbook (to docs/book/) and the rustdoc API docs.

Shorthand alias

The workspace defines a cargo xt alias, so these also work:

cargo xt build
cargo xt run
cargo xt docs

Project layout

themelios/
├── kernel/          # The kernel crate (#![no_std], bare-metal)
│   └── src/
│       ├── main.rs  # Kernel entry point, module declarations
│       ├── arch/    # Architecture-specific (x86_64, aarch64)
│       ├── mm/      # Memory management
│       ├── sched/   # Scheduler
│       ├── cap/     # Capability system
│       ├── ipc/     # Inter-process communication
│       ├── drivers/ # Device drivers (VirtIO, serial, etc.)
│       ├── fs/      # Filesystem
│       └── net/     # Networking
├── xtask/           # Build tooling (runs on host)
├── docs/            # mdbook documentation
├── .cargo/          # Cargo configuration
└── CLAUDE.md        # Project documentation for AI assistants

IDE setup

VS Code

Install the rust-analyzer extension. It should pick up the workspace configuration automatically.

If rust-analyzer struggles with the #![no_std] kernel crate, you may need to add this to .vscode/settings.json:

{
    "rust-analyzer.cargo.target": "x86_64-unknown-none",
    "rust-analyzer.cargo.buildScripts.enable": true
}

Other editors

Any editor with rust-analyzer LSP support should work. The key setting is ensuring the target is set to x86_64-unknown-none for the kernel crate.

Troubleshooting

“can’t find crate for core

This means the bare-metal target isn’t installed. Run:

rustup target add x86_64-unknown-none aarch64-unknown-none

Or let rust-toolchain.toml handle it by running any cargo command in the project.

“error: -Zbuild-std is unstable”

You need to be on the nightly toolchain. Check with rustup show — the project’s rust-toolchain.toml should select nightly automatically.

QEMU not found

Make sure QEMU is installed and on your $PATH. See the QEMU installation section above.

Bootloader

ThemeliOS uses the Limine bootloader. This page explains why, how it works, and how it fits into the build pipeline.

Why Limine?

We evaluated several options for booting ThemeliOS:

OptionProsCons
Custom UEFI appFull controlMassive effort, x86_64 UEFI only initially
Multiboot2Simple, QEMU -kernel flagBIOS only, no arm64, no UEFI
bootloader crateVery easy Rust integrationx86_64 only, no arm64
LimineBIOS + UEFI, x86_64 + arm64, well-maintainedExternal dependency

Limine was chosen because:

  1. Multi-architecture: Supports x86_64 and aarch64 (and RISC-V, LoongArch). We need both for our cloud targets.
  2. Multi-firmware: Works on both BIOS (legacy) and UEFI (modern). Cloud platforms use UEFI; QEMU defaults to BIOS.
  3. Higher-half kernel: Limine sets up page tables that map our kernel at 0xffffffff80000000, which is the standard layout for 64-bit kernels.
  4. Clean protocol: The Limine boot protocol gives us a memory map, framebuffer, and other boot info without writing any assembly.
  5. Active maintenance: Regular releases, good documentation.

Cloud compatibility

Limine’s UEFI support means ThemeliOS can boot on:

  • AWS EC2 (Nitro): UEFI supported on most instance types
  • GCP Compute Engine: UEFI supported
  • Azure Gen2 VMs: UEFI
  • Bare metal: UEFI is standard on modern server hardware
  • QEMU/KVM: Both BIOS (default) and UEFI (via OVMF)

The same kernel binary works on all platforms — only the bootloader firmware interface differs, and Limine handles that.

How it works

Boot sequence

  1. Firmware (BIOS or UEFI) loads the Limine bootloader from the boot media
  2. Limine reads limine.conf to find the kernel path and boot protocol
  3. Limine loads the kernel ELF into memory at the addresses specified in the linker script
  4. Limine sets up:
    • 64-bit long mode (x86_64) or EL1 (aarch64)
    • 4-level page tables with identity + higher-half mappings
    • A valid stack
  5. Limine scans the kernel’s .requests ELF section for boot protocol requests
  6. Limine fills in the requests (memory map, framebuffer, etc.)
  7. Limine jumps to the kernel entry point (kmain)

Boot protocol requests

The kernel communicates with Limine through static data structures placed in a special ELF section. These are “requests” — the kernel declares what boot information it needs, and Limine fills in the responses.

#![allow(unused)]
fn main() {
// Placed in the .requests ELF section via the linker script
#[used]
#[link_section = ".requests"]
static BASE_REVISION: BaseRevision = BaseRevision::new();
}

The linker script places these between start/end markers so Limine knows where to scan:

.data : {
    ...
    KEEP(*(.requests_start_marker))
    KEEP(*(.requests))
    KEEP(*(.requests_end_marker))
}

Configuration file

limine.conf (in the project root) uses the v8 format:

timeout: 0

/ThemeliOS
    protocol: limine
    kernel_path: boot():/boot/themelios
  • timeout: 0 — boot immediately without showing a menu
  • /ThemeliOS — defines a boot entry
  • protocol: limine — use the Limine protocol (not Linux or Multiboot)
  • kernel_path: boot():/boot/themelios — load the kernel from the boot volume

Linker script

The linker script (kernel/linker-x86_64.ld) controls the kernel’s memory layout:

  • Entry point: ENTRY(kmain) — tells the ELF where execution begins
  • Load address: 0xffffffff80000000 — the higher-half virtual address
  • Sections: .text (code), .rodata (constants), .data (mutable data + Limine requests), .bss (zeroed data)

The kernel must be compiled with -Crelocation-model=static to produce a non-PIE executable with fixed addresses that match the linker script.

Build pipeline

The cargo xtask run command handles the full pipeline:

  1. Cross-compile the kernel for x86_64-unknown-none
  2. Download Limine (one-time: git clone of the v8.x-binary branch to target/limine/)
  3. Build Limine CLI (one-time: make compiles limine.c)
  4. Create ISO via xorriso:
    • Copies kernel, Limine files, and limine.conf into an ISO directory structure
    • Creates a hybrid BIOS+UEFI bootable ISO
    • Installs BIOS boot sectors via limine bios-install
  5. Launch QEMU with the ISO attached as a CD-ROM

Limine version

  • Bootloader: v8.x (binary distribution from v8.x-binary branch)
  • Rust crate: limine = "0.5" (boot protocol structures)

The bootloader binaries are cached in target/limine/ and not committed to git.

Architecture Overview

ThemeliOS is a capability-based microkernel. This page explains the high-level design and the reasoning behind key architectural decisions.

Microkernel vs monolithic

In a monolithic kernel (like Linux), drivers, filesystems, and networking all run inside the kernel with full hardware access. A bug in any driver can crash or compromise the entire system.

In a microkernel, only the absolute minimum runs in kernel space:

Kernel spaceUserspace
Memory managementDevice drivers
Process schedulingFilesystem
IPC (message passing)Network stack
Capability enforcementContainer runtime
Management API

Everything else runs as isolated userspace processes that communicate via IPC. A buggy driver crashes its own process, not the kernel.

Why microkernel for ThemeliOS? Since we’re building an OS specifically for running untrusted container workloads, minimizing the trusted computing base (the code that can compromise the whole system) is critical. The smaller the kernel, the smaller the attack surface.

Capability-based security

ThemeliOS does not use Linux-style permissions (UID/GID, filesystem permissions) or Linux-style isolation (namespaces, cgroups). Instead, it uses capabilities.

What is a capability?

A capability is an unforgeable token that grants its holder specific permissions on a specific resource. For example:

  • “Read and write to memory region 0x1000–0x2000”
  • “Send messages to IPC endpoint #42”
  • “Access VirtIO block device at MMIO address 0xFE00”

Key properties

  1. No ambient authority: A newly created process has zero capabilities. It can’t do anything until its parent grants it capabilities.

  2. Unforgeable: Capabilities are managed by the kernel. Userspace can’t create them or guess valid ones.

  3. Transferable: Capabilities can be passed between processes via IPC, enabling controlled delegation.

  4. Revocable: A capability can be revoked, immediately cutting off access.

Why not namespaces?

Linux namespaces are “isolation after the fact” — processes start with broad access and namespaces restrict what they can see. Capabilities are “isolation by default” — processes start with nothing and are explicitly granted only what they need.

For a container OS, this means a compromised container literally cannot access resources it wasn’t given capabilities for. There’s no kernel interface to probe, no /proc to read, no syscall to escalate through — the authority simply doesn’t exist.

Inspiration

  • seL4: Formally verified capability microkernel. ThemeliOS borrows its capability model.
  • Fuchsia/Zircon: Google’s capability-based OS. Demonstrates the model works at scale.

Memory model

ThemeliOS uses hardware-enforced memory isolation:

  • Each process runs in its own virtual address space (page tables enforced by the MMU).
  • The kernel has its own address space that userspace cannot access.
  • Shared memory between processes requires explicit capabilities from both sides.

Physical memory management

A frame allocator tracks free physical memory pages (4 KiB). Frames are allocated to:

  • Process page tables
  • Kernel heap
  • Shared memory regions
  • DMA buffers for device drivers

Virtual memory layout

The virtual address space layout will be defined per-architecture, but the general structure is:

0x0000_0000_0000_0000  ┌──────────────────────┐
                        │   Userspace           │
                        │   (per-process)       │
0x0000_7FFF_FFFF_FFFF  └──────────────────────┘
                        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
                          Non-canonical hole
                        └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
0xFFFF_8000_0000_0000  ┌──────────────────────┐
                        │   Kernel space        │
                        │   (shared, all procs) │
0xFFFF_FFFF_FFFF_FFFF  └──────────────────────┘

(This is the x86_64 layout; aarch64 is similar but with different conventions.)

IPC

Inter-process communication is the backbone of the microkernel. Since drivers, filesystems, and networking all run in userspace, every system operation involves IPC.

Synchronous message passing

The primary mechanism: a client sends a message to a server and blocks until it gets a reply. This is used for request/response patterns like “read this file” or “send this network packet.”

Performance consideration

IPC overhead is the classic criticism of microkernels. ThemeliOS will address this by:

  • Keeping messages small (pointers to shared memory for bulk data)
  • Using register-based fast-path for small messages
  • Careful cache-aware scheduling of communicating processes

Immutability

The OS root filesystem is read-only. The entire OS image is a single artifact that is booted as-is.

  • Updates: Swap the entire image. No package managers, no apt-get, no partial updates.
  • Configuration: Injected at boot time via cloud-init-style metadata or the management API.
  • Ephemeral state: Container images and runtime state live on a RAM-backed ephemeral layer that is lost on reboot.

This model treats nodes as cattle: if a node is unhealthy, replace it with a fresh one. No debugging on the node, no SSHing in, no manual fixes.

Target platforms

ThemeliOS is designed to run as a virtual machine, with bare-metal support as a secondary goal.

PlatformStatusNotes
QEMU/KVM (x86_64)Primary dev targetUsed for all development and testing
QEMU (aarch64)Secondary dev targetARM64 support
AWS (EC2)FutureNitro hypervisor
GCP (Compute Engine)FutureKVM-based
Azure (VMs)FutureHyper-V
Bare metal (headless)FutureServer hardware, no GPU/display

Capability System

This document details the design of ThemeliOS’s capability system — the core security mechanism of the kernel.

Status: Design phase. Implementation begins in Phase 2.

Overview

In ThemeliOS, every resource is accessed through capabilities. A capability is a kernel-managed, unforgeable token that encodes:

  1. Which resource (identified by a kernel object ID)
  2. What operations are permitted (a bitmask of rights)

Capability types

Capability typeResourceExample rights
MemoryCapPhysical memory regionRead, Write, Execute, Map
EndpointCapIPC endpointSend, Receive
ThreadCapThread/processStart, Stop, Suspend, Resume
DeviceCapHardware device (MMIO region)Read, Write
IRQCapInterrupt lineAcknowledge, Bind

Capability spaces

Each process has a capability space (CSpace) — a table mapping local capability slots to kernel objects. A process refers to its capabilities by slot index, not by object ID. The kernel translates slot indices to objects on each syscall.

Process A's CSpace:
  Slot 0 → MemoryCap(region=0x1000, rights=RW)
  Slot 1 → EndpointCap(endpoint=#7, rights=Send)
  Slot 2 → (empty)
  Slot 3 → ThreadCap(thread=#12, rights=Start|Stop)

Process B's CSpace:
  Slot 0 → EndpointCap(endpoint=#7, rights=Receive)
  Slot 1 → MemoryCap(region=0x2000, rights=R)

Process A can send to endpoint #7 (slot 1), and Process B can receive from it (slot 0). Neither can access the other’s memory — they’d need explicit capabilities for that.

Capability operations

Grant

A parent process can grant a capability to a child process, optionally with reduced rights:

Parent has: MemoryCap(region=X, rights=RWX)
Parent grants child: MemoryCap(region=X, rights=R)

The child gets read-only access. Rights can only be reduced, never elevated.

Transfer via IPC

Capabilities can be attached to IPC messages. This is how services delegate access:

FileServer receives "open /config" request
FileServer replies with MemoryCap(region=file_data, rights=R)
Client now has read access to the file's memory region

Revoke

The kernel (or a process with the appropriate meta-capability) can revoke a capability, immediately invalidating it. Any future use of the revoked slot returns an error.

Container mapping

In ThemeliOS, a “container” is a group of processes sharing a common set of capabilities. The container’s capability set defines its sandbox:

  • Memory: Only the memory regions granted to it
  • Network: Only the network endpoints it has capabilities for
  • Filesystem: Only the filesystem views it’s been granted
  • IPC: Only the services it has endpoint capabilities for

A container cannot discover or access anything outside its capability set. Unlike Linux containers (where a kernel exploit can escape the namespace), escaping a capability sandbox requires forging a kernel object — which is impossible without a kernel memory corruption bug.

Comparison with Linux isolation

AspectLinux (namespaces/cgroups)ThemeliOS (capabilities)
DefaultAccess everything, restrict selectivelyAccess nothing, grant explicitly
EnforcementKernel checks on each syscallNo syscall exists without capability
Escape riskKernel bugs can bypass namespacesRequires kernel memory corruption
Resource discoveryCan probe for resourcesCan’t even address unknown resources
GranularityPer-namespacePer-object, per-right

Memory Management

This document describes ThemeliOS’s memory management subsystem design.

Status: Design phase. Implementation begins in Phase 1.

Overview

The memory management (MM) subsystem is responsible for:

  1. Physical frame allocation — tracking which 4 KiB pages of physical RAM are free or in use
  2. Virtual memory — creating and managing page tables for each process
  3. Kernel heap — providing dynamic allocation (alloc-style) for kernel data structures

Physical memory

Boot-time discovery

The bootloader provides a memory map describing which physical address ranges are usable RAM, reserved by firmware, or used for MMIO. The frame allocator uses this map to initialize its free list.

Frame allocator

The frame allocator hands out 4 KiB physical memory frames. Initial implementation will use a bitmap allocator:

  • One bit per physical frame (1 = allocated, 0 = free)
  • Simple, predictable, easy to implement
  • For 4 GiB of RAM: bitmap is 128 KiB (manageable)

Later optimization: replace with a buddy allocator for efficient allocation of contiguous multi-frame regions (needed for DMA buffers, large pages).

Capability integration

Physical frames are resources protected by capabilities. When a process requests memory:

  1. Kernel allocates a frame from the free pool
  2. Kernel creates a MemoryCap for that frame
  3. Kernel inserts the capability into the process’s CSpace
  4. Process can now map the frame into its address space using the capability

A process cannot access physical memory it doesn’t have a capability for — the page tables are configured to reflect capability permissions.

Virtual memory

Address space layout (x86_64)

 Lower half (user space, per-process):
   0x0000_0000_0000_0000 - 0x0000_7FFF_FFFF_FFFF

 Upper half (kernel space, shared across all processes):
   0xFFFF_8000_0000_0000 - 0xFFFF_FFFF_FFFF_FFFF
     ├── Physical memory direct map
     ├── Kernel code and data
     ├── Kernel heap
     └── Per-CPU data

Page tables

x86_64 uses 4-level page tables (PML4 → PDPT → PD → PT), each with 512 entries. Each entry is 8 bytes and can point to:

  • The next level table
  • A large page (2 MiB at PD level, 1 GiB at PDPT level)
  • A 4 KiB page (at PT level)

The kernel manages page tables for each process. When a context switch occurs, the CPU’s CR3 register is loaded with the new process’s PML4 physical address, instantly switching the entire address space.

aarch64 differences

aarch64 uses a similar 4-level translation table scheme but with different register names (TTBR0/TTBR1 instead of CR3) and different table entry formats. The architecture abstraction layer hides these differences from the rest of the kernel.

Kernel heap

The kernel needs dynamic allocation for data structures like:

  • Process control blocks
  • Capability tables
  • IPC message buffers
  • Driver state

We’ll use the linked_list_allocator crate initially (a simple free-list allocator suitable for #![no_std] kernels), backed by physical frames allocated from the frame allocator.

The kernel heap lives in the upper-half virtual address space and is shared across all contexts (but only accessible from kernel mode).

Memory safety

Rust’s ownership model provides compile-time guarantees against:

  • Use-after-free: The compiler prevents using a frame after it’s been freed
  • Double-free: The compiler prevents freeing a frame twice
  • Data races: Shared mutable access requires synchronization (Mutex, RefCell)

The unsafe keyword is required for raw pointer operations (hardware register access, page table manipulation) — these are confined to small, well-documented blocks.

Milestones

ThemeliOS development is organized into phases. Each phase builds on the previous one and produces a working, testable artifact.

Phase 0 — Boot

Goal: Get the kernel booting on QEMU and printing to the serial console.

Deliverables:

  • Bootloader integration (Limine or UEFI)
  • Architecture-specific early init (x86_64 first)
  • Serial console output (16550 UART on x86_64)
  • “Hello from ThemeliOS” printed on boot
  • cargo xtask run boots the kernel in QEMU end-to-end

What you’ll learn: Bare-metal Rust, the boot process, how hardware/QEMU works at the lowest level.

Phase 1 — Kernel basics

Goal: A kernel that can manage memory and schedule tasks.

Deliverables:

  • Physical frame allocator (bitmap-based)
  • Virtual memory manager (page table setup, higher-half kernel)
  • Kernel heap allocator
  • Interrupt handling (IDT on x86_64, GIC on aarch64)
  • Timer-driven preemptive scheduler (round-robin)
  • Basic kernel shell over serial (for debugging, will be removed later)
  • aarch64 port of Phase 0 + Phase 1

Phase 2 — Isolation

Goal: Implement the capability system and process isolation.

Deliverables:

  • Capability types and capability space (CSpace)
  • Process creation with isolated address spaces
  • Capability grant, transfer, and revocation
  • Synchronous IPC (message passing between processes)
  • First userspace process (init)

Phase 3 — Storage

Goal: Read from a virtual disk and present a filesystem.

Deliverables:

  • VirtIO block driver (for QEMU’s virtual disk)
  • Read-only filesystem (simple format, possibly custom or FAT)
  • RAM-backed ephemeral writable layer
  • Immutable root image creation tooling

Phase 4 — Networking

Goal: TCP/IP connectivity.

Deliverables:

  • VirtIO network driver
  • Ethernet, ARP, IPv4
  • TCP and UDP
  • Basic socket-like API via capabilities
  • DHCP client

Phase 5 — Containers

Goal: Run OCI container images.

Deliverables:

  • OCI image format parsing and layer unpacking
  • Container lifecycle (create, start, stop, destroy)
  • Container-to-capability mapping (each container gets a capability set)
  • Container networking (virtual interfaces, isolation)
  • Log streaming from containers

Phase 6 — Management

Goal: External API for managing the node.

Deliverables:

  • HTTP or gRPC management API
  • Container management endpoints (create, start, stop, list, logs)
  • Node status and health reporting
  • Configuration injection at boot time
  • No SSH — API is the only interface

Future — Kubernetes

Goal: Serve as a K8s/K3s worker node.

Deliverables (rough):

  • CRI-compatible container runtime
  • kubelet (or custom equivalent)
  • CNI plugin support
  • Node registration with K8s control plane
  • Pod lifecycle management

This phase is explicitly not v1 and will be scoped in detail after Phase 6 is complete.