Linux Performance Tuning in Practice: From Disk and Memory to TLB and Cache

> Source note: distilled from production tuning work and the Linux tooling that is most useful under real load.

Why so many optimizations fail

Performance problems in Linux are rarely isolated to one component. They are usually chain problems.

A few familiar mistakes:

That is how teams end up making one layer faster while the whole system becomes less stable.

Start with the full path

The useful mental model is:


disk -> page cache -> memory reclaim -> page tables / TLB -> CPU cache -> application latency

Read path


application reads through read() or mmap()
  -> page cache lookup
     -> hit: mostly memory bandwidth and CPU cache behavior
     -> miss: block IO to disk or NVMe
  -> page-table state must exist
  -> CPU address translation is helped by the TLB
  -> data is finally served through L1/L2/L3 cache or DRAM

Write path


application writes
  -> data lands in page cache as dirty pages
  -> background writeback flushes it later
  -> if dirty limits are exceeded, foreground threads may be throttled or blocked

Once you see the chain, a lot of symptoms become easier to interpret:

Memory is the center of stability

Know what your memory is doing

At a high level Linux memory usually falls into three buckets:

High memory usage is not automatically a problem. High page cache is often healthy. What matters is whether the system starts reclaiming aggressively, swapping, or paying major fault costs.

Metrics that matter

Useful global commands:


vmstat 1
cat /proc/meminfo
sar -B 1
sar -W 1

Useful process commands:


pidstat -r -p <pid> 1
cat /proc/<pid>/status
cat /proc/<pid>/smaps_rollup

What to watch:

kswapd versus direct reclaim

These two paths feel very different in production:

That is why occasional long-tail pauses under stable traffic often lead back to reclaim pressure rather than pure CPU shortage.

Swap is not "always bad", but it is often bad for latency

vm.swappiness has to match the workload:

The important point is that this should be tested under the real workload, not chosen by habit.

Dirty pages and writeback

For write-heavy systems, dirty-page control directly affects tail latency.

Important knobs:

The intuition:

If a service is usually fast but occasionally stalls during bursts, dirty-page waves are a strong suspect.

Huge pages and the TLB

The TLB caches address translations.

That can reduce TLB miss cost, but only when the workload actually benefits from it. Transparent Huge Pages help some sequential or large-working-set systems a lot, while they hurt others through compaction or fragmentation side effects.

The practical rule is simple: test always, madvise, and never against P99, not just average throughput.

NUMA locality matters

On multi-socket systems, running threads on one NUMA node while their memory lives on another silently wastes performance.

Useful commands:


numactl --hardware
numastat -p <pid>

What usually helps:

Containers change the failure mode

In containerized environments, it is common to see a host with free memory while the service still gets killed. That usually means the cgroup limit is the real boundary.

What to check:

TLB and CPU cache are often the real CPU story

High CPU usage does not always mean "more computation". It often means bad memory behavior.

TLB misses

Every virtual address access needs translation.

Random access across a working set that is too large for effective TLB coverage can degrade performance hard even when the code itself is simple.

CPU cache misses

Latency climbs sharply as accesses move from L1 to L2 to L3 and finally to DRAM.

Common causes:

Useful observation:


perf stat -e cycles,instructions,cache-references,cache-misses,dTLB-load-misses,iTLB-load-misses -p <pid> -- sleep 30

That tells you whether the workload is compute-bound or memory-bound.

Disk tuning only makes sense in context

Read the right metrics

With iostat -x, the most useful fields are usually:

%util by itself is not enough, especially on modern NVMe devices.

Page-cache hit rate is a performance multiplier

If random reads dominate and the page cache cannot hold the working set, read latency falls back to storage behavior.

That means optimization may involve:

Write bursts create deceptive failures

Under bursty writes:

The system may feel slow long before the disk appears fully saturated.

Concrete optimization techniques

Data layout and code

Cache-line alignment

Aligning a structure to a cache line can help, but only in the right cases.

It is usually worth it when:

It is usually not worth it when:

Example:


struct alignas(64) ShardCounter {
  std::atomic<unsigned long long> value;
  char pad[64 - sizeof(std::atomic<unsigned long long>)];
};

The gain here does not come from magic alignment by itself. It comes from preventing unrelated hot writes from bouncing the same cache line across cores.

CPU and scheduling

Memory and reclaim

Storage

Networking

A repeatable debugging workflow

1. identify the user-facing symptom: throughput loss, P99 growth, error-rate shift; 2. split the problem into CPU, memory, disk, and network layers; 3. inspect the chain: iostat, vmstat, /proc/vmstat, perf stat; 4. build a causal story instead of a metric scrapbook; 5. change one class of variables at a time; 6. validate under load and keep the rollback path ready.

Example patterns:

Final thought

The most useful Linux performance work is not about memorizing kernel parameters. It is about seeing the full chain, measuring where time is actually spent, and making the system more predictable rather than just faster in one benchmark.

Comments