Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Architecture Overview

ThemeliOS is a capability-based microkernel. This page explains the high-level design and the reasoning behind key architectural decisions.

Microkernel vs monolithic

In a monolithic kernel (like Linux), drivers, filesystems, and networking all run inside the kernel with full hardware access. A bug in any driver can crash or compromise the entire system.

In a microkernel, only the absolute minimum runs in kernel space:

Kernel spaceUserspace
Memory managementDevice drivers
Process schedulingFilesystem
IPC (message passing)Network stack
Capability enforcementContainer runtime
Management API

Everything else runs as isolated userspace processes that communicate via IPC. A buggy driver crashes its own process, not the kernel.

Why microkernel for ThemeliOS? Since we’re building an OS specifically for running untrusted container workloads, minimizing the trusted computing base (the code that can compromise the whole system) is critical. The smaller the kernel, the smaller the attack surface.

Capability-based security

ThemeliOS does not use Linux-style permissions (UID/GID, filesystem permissions) or Linux-style isolation (namespaces, cgroups). Instead, it uses capabilities.

What is a capability?

A capability is an unforgeable token that grants its holder specific permissions on a specific resource. For example:

  • “Read and write to memory region 0x1000–0x2000”
  • “Send messages to IPC endpoint #42”
  • “Access VirtIO block device at MMIO address 0xFE00”

Key properties

  1. No ambient authority: A newly created process has zero capabilities. It can’t do anything until its parent grants it capabilities.

  2. Unforgeable: Capabilities are managed by the kernel. Userspace can’t create them or guess valid ones.

  3. Transferable: Capabilities can be passed between processes via IPC, enabling controlled delegation.

  4. Revocable: A capability can be revoked, immediately cutting off access.

Why not namespaces?

Linux namespaces are “isolation after the fact” — processes start with broad access and namespaces restrict what they can see. Capabilities are “isolation by default” — processes start with nothing and are explicitly granted only what they need.

For a container OS, this means a compromised container literally cannot access resources it wasn’t given capabilities for. There’s no kernel interface to probe, no /proc to read, no syscall to escalate through — the authority simply doesn’t exist.

Inspiration

  • seL4: Formally verified capability microkernel. ThemeliOS borrows its capability model.
  • Fuchsia/Zircon: Google’s capability-based OS. Demonstrates the model works at scale.

Memory model

ThemeliOS uses hardware-enforced memory isolation:

  • Each process runs in its own virtual address space (page tables enforced by the MMU).
  • The kernel has its own address space that userspace cannot access.
  • Shared memory between processes requires explicit capabilities from both sides.

Physical memory management

A frame allocator tracks free physical memory pages (4 KiB). Frames are allocated to:

  • Process page tables
  • Kernel heap
  • Shared memory regions
  • DMA buffers for device drivers

Virtual memory layout

The virtual address space layout will be defined per-architecture, but the general structure is:

0x0000_0000_0000_0000  ┌──────────────────────┐
                        │   Userspace           │
                        │   (per-process)       │
0x0000_7FFF_FFFF_FFFF  └──────────────────────┘
                        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
                          Non-canonical hole
                        └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
0xFFFF_8000_0000_0000  ┌──────────────────────┐
                        │   Kernel space        │
                        │   (shared, all procs) │
0xFFFF_FFFF_FFFF_FFFF  └──────────────────────┘

(This is the x86_64 layout; aarch64 is similar but with different conventions.)

IPC

Inter-process communication is the backbone of the microkernel. Since drivers, filesystems, and networking all run in userspace, every system operation involves IPC.

Synchronous message passing

The primary mechanism: a client sends a message to a server and blocks until it gets a reply. This is used for request/response patterns like “read this file” or “send this network packet.”

Performance consideration

IPC overhead is the classic criticism of microkernels. ThemeliOS will address this by:

  • Keeping messages small (pointers to shared memory for bulk data)
  • Using register-based fast-path for small messages
  • Careful cache-aware scheduling of communicating processes

Immutability

The OS root filesystem is read-only. The entire OS image is a single artifact that is booted as-is.

  • Updates: Swap the entire image. No package managers, no apt-get, no partial updates.
  • Configuration: Injected at boot time via cloud-init-style metadata or the management API.
  • Ephemeral state: Container images and runtime state live on a RAM-backed ephemeral layer that is lost on reboot.

This model treats nodes as cattle: if a node is unhealthy, replace it with a fresh one. No debugging on the node, no SSHing in, no manual fixes.

Target platforms

ThemeliOS is designed to run as a virtual machine, with bare-metal support as a secondary goal.

PlatformStatusNotes
QEMU/KVM (x86_64)Primary dev targetUsed for all development and testing
QEMU (aarch64)Secondary dev targetARM64 support
AWS (EC2)FutureNitro hypervisor
GCP (Compute Engine)FutureKVM-based
Azure (VMs)FutureHyper-V
Bare metal (headless)FutureServer hardware, no GPU/display