2026-04-03 · 28 min read

Linux Performance Tuning in Practice: From Disk and Memory to TLB and Cache

> Source note: distilled from production tuning work and the Linux tooling that is most useful under real load.

Why so many optimizations fail

Performance problems in Linux are rarely isolated to one component. They are usually chain problems.

A few familiar mistakes:

seeing high iowait and tuning only the disk;
seeing high CPU and assuming more cores will fix everything;
seeing latency spikes and tuning only the thread pool.

That is how teams end up making one layer faster while the whole system becomes less stable.

Start with the full path

The useful mental model is:


disk -> page cache -> memory reclaim -> page tables / TLB -> CPU cache -> application latency

Read path


application reads through read() or mmap()
  -> page cache lookup
     -> hit: mostly memory bandwidth and CPU cache behavior
     -> miss: block IO to disk or NVMe
  -> page-table state must exist
  -> CPU address translation is helped by the TLB
  -> data is finally served through L1/L2/L3 cache or DRAM

Write path


application writes
  -> data lands in page cache as dirty pages
  -> background writeback flushes it later
  -> if dirty limits are exceeded, foreground threads may be throttled or blocked

Once you see the chain, a lot of symptoms become easier to interpret:

page-cache pressure increases disk traffic;
writeback pressure creates latency spikes even when bandwidth is not maxed out;
page faults increase translation and memory overhead;
TLB misses and cache misses burn cycles without doing useful work.

Memory is the center of stability

Know what your memory is doing

At a high level Linux memory usually falls into three buckets:

anonymous memory: heap, stack, process-private memory;
file-backed memory: page cache and file mappings;
kernel memory: slabs, socket buffers, and kernel-side structures.

High memory usage is not automatically a problem. High page cache is often healthy. What matters is whether the system starts reclaiming aggressively, swapping, or paying major fault costs.

Metrics that matter

Useful global commands:


vmstat 1
cat /proc/meminfo
sar -B 1
sar -W 1

Useful process commands:


pidstat -r -p <pid> 1
cat /proc/<pid>/status
cat /proc/<pid>/smaps_rollup

What to watch:

si/so staying non-zero means swap churn is real;
pgmajfault means a process is paying a very expensive miss;
pgscan and pgsteal jumping means reclaim is active;
allocstall often points to direct reclaim hitting foreground work.

`kswapd` versus direct reclaim

These two paths feel very different in production:

kswapd reclaims in the background and is usually less disruptive;
direct reclaim makes the application thread participate in reclaim, which is exactly how surprising P99 spikes appear.

That is why occasional long-tail pauses under stable traffic often lead back to reclaim pressure rather than pure CPU shortage.

Swap is not "always bad", but it is often bad for latency

vm.swappiness has to match the workload:

low-latency online services usually want it low, often around 1-10;
batch workloads may tolerate a higher value.

The important point is that this should be tested under the real workload, not chosen by habit.

Dirty pages and writeback

For write-heavy systems, dirty-page control directly affects tail latency.

Important knobs:

vm.dirty_background_ratio or vm.dirty_background_bytes
vm.dirty_ratio or vm.dirty_bytes
vm.dirty_expire_centisecs
vm.dirty_writeback_centisecs

The intuition:

the background threshold starts flushing in the background;
the hard dirty threshold can throttle or block writers.

If a service is usually fast but occasionally stalls during bursts, dirty-page waves are a strong suspect.

Huge pages and the TLB

The TLB caches address translations.

with 4 KB pages, large memory regions require many more TLB entries;
with larger pages such as 2 MB, the same TLB can cover far more memory.

That can reduce TLB miss cost, but only when the workload actually benefits from it. Transparent Huge Pages help some sequential or large-working-set systems a lot, while they hurt others through compaction or fragmentation side effects.

The practical rule is simple: test always, madvise, and never against P99, not just average throughput.

NUMA locality matters

On multi-socket systems, running threads on one NUMA node while their memory lives on another silently wastes performance.

Useful commands:


numactl --hardware
numastat -p <pid>

What usually helps:

pin important threads;
keep memory local when possible;
design sharded services so compute and memory stay together.

Containers change the failure mode

In containerized environments, it is common to see a host with free memory while the service still gets killed. That usually means the cgroup limit is the real boundary.

What to check:

memory.current versus memory.max;
memory.events and oom_kill;
whether page cache and anonymous memory are fighting inside the same limit.

TLB and CPU cache are often the real CPU story

High CPU usage does not always mean "more computation". It often means bad memory behavior.

TLB misses

Every virtual address access needs translation.

on a TLB hit, that is cheap;
on a TLB miss, page-table walks add extra memory work.

Random access across a working set that is too large for effective TLB coverage can degrade performance hard even when the code itself is simple.

CPU cache misses

Latency climbs sharply as accesses move from L1 to L2 to L3 and finally to DRAM.

Common causes:

pointer-heavy data structures;
large hot objects that spill across cache lines;
weak locality;
false sharing between threads.

Useful observation:


perf stat -e cycles,instructions,cache-references,cache-misses,dTLB-load-misses,iTLB-load-misses -p <pid> -- sleep 30

That tells you whether the workload is compute-bound or memory-bound.

Disk tuning only makes sense in context

Read the right metrics

With iostat -x, the most useful fields are usually:

r/s and w/s
rkB/s and wkB/s
await
avgqu-sz

%util by itself is not enough, especially on modern NVMe devices.

Page-cache hit rate is a performance multiplier

If random reads dominate and the page cache cannot hold the working set, read latency falls back to storage behavior.

That means optimization may involve:

keeping hot data resident longer;
changing access patterns;
adding application-level caching where appropriate.

Write bursts create deceptive failures

Under bursty writes:

dirty pages accumulate;
writeback falls behind;
foreground threads get dragged into the problem.

The system may feel slow long before the disk appears fully saturated.

Concrete optimization techniques

Data layout and code

keep hot fields close together and move cold fields away from the hot path;
reduce pointer chasing when arrays or compact layouts can do the job;
batch work where possible to improve locality and reduce overhead;
avoid false sharing between threads that update different values on the same cache line;
use object pools or arenas when allocator churn becomes measurable.

Cache-line alignment

Aligning a structure to a cache line can help, but only in the right cases.

It is usually worth it when:

multiple threads frequently write to independent counters or queue metadata;
false sharing is the real bottleneck.

It is usually not worth it when:

the object is mostly read-only;
alignment inflates memory usage so much that cache efficiency or TLB pressure gets worse overall.

Example:


struct alignas(64) ShardCounter {
  std::atomic<unsigned long long> value;
  char pad[64 - sizeof(std::atomic<unsigned long long>)];
};

The gain here does not come from magic alignment by itself. It comes from preventing unrelated hot writes from bouncing the same cache line across cores.

CPU and scheduling

keep thread count proportional to the number of cores;
reduce lock hold time before trying fancy lock-free structures;
pin critical threads if migration is hurting locality;
separate interrupts and critical worker threads if they compete on the same cores.

Memory and reclaim

keep enough headroom for both anonymous memory and page cache;
tune dirty-page limits for bursty write patterns;
benchmark THP choices instead of copying defaults from another system;
monitor reclaim counters continuously if the service is latency-sensitive.

Storage

convert random IO to more sequential IO where possible;
batch logs and reduce fsync pressure;
split heavy write paths from read-sensitive paths when practical;
choose the scheduler based on the actual storage device rather than habit.

Networking

tune backlog and socket buffers with the actual traffic pattern in mind;
distribute IRQ and softirq load sanely;
cut copies and serialization work where possible because that load eventually becomes CPU and cache pressure too.

A repeatable debugging workflow

1. identify the user-facing symptom: throughput loss, P99 growth, error-rate shift; 2. split the problem into CPU, memory, disk, and network layers; 3. inspect the chain: iostat, vmstat, /proc/vmstat, perf stat; 4. build a causal story instead of a metric scrapbook; 5. change one class of variables at a time; 6. validate under load and keep the rollback path ready.

Example patterns:

rising pgmajfault together with higher await usually means page-cache misses are falling through to disk;
rising allocstall together with long-tail spikes often means direct reclaim is hurting foreground latency;
rising cache misses together with TLB misses usually means locality or working-set shape got worse.

Final thought

The most useful Linux performance work is not about memorizing kernel parameters. It is about seeing the full chain, measuring where time is actually spent, and making the system more predictable rather than just faster in one benchmark.