2026-04-03 · 16 min read

SMMU Notes: Background, Internals, and Practical Use

> Source note: this article is based on the Linux IOMMU documentation and the arm-smmu-v3 driver implementation.

Why SMMU exists

On modern SoCs and server platforms, devices such as NICs, GPUs, NPUs, NVMe controllers, and DMA engines access memory directly through DMA.

That creates three immediate problems:

without address isolation, a device may access memory it should never touch;
in virtualization, device-visible addresses must be translated into the right guest or host memory regions;
once many devices are active at the same time, DMA address management becomes too complex to do by hand.

An ARM SMMU is essentially the ARM-world equivalent of an IOMMU:

it translates DMA addresses, usually from IOVA to physical memory;
it applies permissions and isolation;
it enables advanced features such as virtualization and shared virtual address spaces.

In one sentence: the SMMU pulls device DMA into a controlled memory-management model instead of leaving it as raw physical access.

Core concepts

Stream IDs and stream tables

When a request enters the SMMU it usually carries a Stream ID, or SID. The SID is used to find a Stream Table Entry that describes how the request should be handled.

That entry decides:

which page tables are used;
whether translation runs in Stage-1, Stage-2, both, or bypass mode;
which permission checks apply.

You can think of it as the device-side entry point into an address-space context.

Context descriptors

In Stage-1 mode, the stream entry points to a Context Descriptor.

That descriptor carries the execution context for translation, including:

page-table base information such as TTBR and TCR;
ASID or VMID style identifiers;
coherency and caching controls.

This is what makes per-process or per-address-space device access possible in more advanced flows such as SVA and PASID-like models.

Translation stages

Stage-1 maps a device-visible address into an intermediate or physical address.
Stage-2 maps a guest-visible address into host physical memory.
Nested translation combines both.

In virtualized systems, Stage-2 is especially important because it defines the security boundary that the hypervisor controls.

Queue-based programming model

SMMUv3 moved heavily toward queues:

CMDQ for commands from software;
EVTQ for faults and events from hardware;
PRIQ for page requests in advanced models.

This queue-based interface makes the software and hardware relationship cleaner and more scalable under concurrency.

The Linux view

In Linux, the main SMMUv3 implementation lives in:

drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h

The driver work can be summarized like this:

1. discover the SMMU and device topology through ACPI/IORT or Device Tree; 2. build and manage iommu_domain objects; 3. maintain mappings through the generic IOMMU framework; 4. issue invalidation and control commands through CMDQ; 5. process faults and recovery through EVTQ and PRIQ.

The runtime path looks roughly like:


device issues DMA with an IOVA
  -> SMMU finds the matching stream entry
  -> page tables are walked and permissions checked
  -> a physical address is produced
  -> memory is accessed
  -> faults are reported through event queues if anything fails

Why invalidation matters so much

When mappings change, stale translations inside the SMMU can survive unless the relevant TLB or cache state is invalidated.

If invalidation is wrong or late, the result can be surprisingly painful:

DMA continues to hit an old physical address;
permission updates do not take effect when expected;
virtualized workloads show intermittent corruption or cross-boundary access.

That is why "mapping update plus correct invalidation ordering" is one of the most important practical rules in SMMU work.

Common fault patterns

Typical causes include:

missing mappings;
permission mismatches;
broken invalidation ordering;
stream IDs that are wired or configured incorrectly;
virtualization stage mismatches.

When debugging, the fastest path is usually to correlate:

the fault record itself;
the active domain and page-table state;
the device that issued the transaction;
the most recent mapping or unmapping operations.

Practical use cases

Virtualization

SMMU is a key part of secure device assignment and DMA isolation for guests. Without it, device passthrough becomes much harder to trust.

Multi-tenant acceleration

When accelerators such as GPUs or NPUs are shared between workloads, the SMMU helps ensure that one job cannot read or overwrite another job's memory.

Safe DMA in general-purpose systems

Even outside virtualization, isolating DMA through an IOMMU-style mechanism reduces the blast radius of device or driver bugs.

Shared virtual addressing

In more advanced user-space driven systems, the SMMU becomes part of the path that lets devices participate in process-oriented address spaces.

Engineering advice

treat invalidation as a correctness issue before treating it as a performance issue;
map the device topology clearly so you know which stream IDs hit which domains;
debug faults with full path context rather than only decoding the fault word;
benchmark carefully when enabling advanced features such as nested translation or SVA;
remember that most drivers should use DMA and IOMMU abstractions instead of directly thinking in terms of SMMU registers.

Closing thought

SMMU is often invisible when everything works, but it becomes central the moment you care about isolation, virtualization, or reliable DMA at scale.

Understanding its queue model, translation stages, and invalidation rules goes a long way toward making low-level platform debugging much less mysterious.