Linux Performance Tuning in Practice: From Disk and Memory to TLB and Cache
> Source note: distilled from production tuning work and the Linux tooling that is most useful under real load.
Why so many optimizations fail
Performance problems in Linux are rarely isolated to one component. They are usually chain problems.
A few familiar mistakes:
- seeing high
iowaitand tuning only the disk; - seeing high CPU and assuming more cores will fix everything;
- seeing latency spikes and tuning only the thread pool.
That is how teams end up making one layer faster while the whole system becomes less stable.
Start with the full path
The useful mental model is:
disk -> page cache -> memory reclaim -> page tables / TLB -> CPU cache -> application latency
Read path
application reads through read() or mmap()
-> page cache lookup
-> hit: mostly memory bandwidth and CPU cache behavior
-> miss: block IO to disk or NVMe
-> page-table state must exist
-> CPU address translation is helped by the TLB
-> data is finally served through L1/L2/L3 cache or DRAM
Write path
application writes
-> data lands in page cache as dirty pages
-> background writeback flushes it later
-> if dirty limits are exceeded, foreground threads may be throttled or blocked
Once you see the chain, a lot of symptoms become easier to interpret:
- page-cache pressure increases disk traffic;
- writeback pressure creates latency spikes even when bandwidth is not maxed out;
- page faults increase translation and memory overhead;
- TLB misses and cache misses burn cycles without doing useful work.
Memory is the center of stability
Know what your memory is doing
At a high level Linux memory usually falls into three buckets:
- anonymous memory: heap, stack, process-private memory;
- file-backed memory: page cache and file mappings;
- kernel memory: slabs, socket buffers, and kernel-side structures.
High memory usage is not automatically a problem. High page cache is often healthy. What matters is whether the system starts reclaiming aggressively, swapping, or paying major fault costs.
Metrics that matter
Useful global commands:
vmstat 1
cat /proc/meminfo
sar -B 1
sar -W 1
Useful process commands:
pidstat -r -p <pid> 1
cat /proc/<pid>/status
cat /proc/<pid>/smaps_rollup
What to watch:
si/sostaying non-zero means swap churn is real;pgmajfaultmeans a process is paying a very expensive miss;pgscanandpgstealjumping means reclaim is active;allocstalloften points to direct reclaim hitting foreground work.
kswapd versus direct reclaim
These two paths feel very different in production:
kswapdreclaims in the background and is usually less disruptive;- direct reclaim makes the application thread participate in reclaim, which is exactly how surprising P99 spikes appear.
That is why occasional long-tail pauses under stable traffic often lead back to reclaim pressure rather than pure CPU shortage.
Swap is not "always bad", but it is often bad for latency
vm.swappiness has to match the workload:
- low-latency online services usually want it low, often around
1-10; - batch workloads may tolerate a higher value.
The important point is that this should be tested under the real workload, not chosen by habit.
Dirty pages and writeback
For write-heavy systems, dirty-page control directly affects tail latency.
Important knobs:
vm.dirty_background_ratioorvm.dirty_background_bytesvm.dirty_ratioorvm.dirty_bytesvm.dirty_expire_centisecsvm.dirty_writeback_centisecs
The intuition:
- the background threshold starts flushing in the background;
- the hard dirty threshold can throttle or block writers.
If a service is usually fast but occasionally stalls during bursts, dirty-page waves are a strong suspect.
Huge pages and the TLB
The TLB caches address translations.
- with 4 KB pages, large memory regions require many more TLB entries;
- with larger pages such as 2 MB, the same TLB can cover far more memory.
That can reduce TLB miss cost, but only when the workload actually benefits from it. Transparent Huge Pages help some sequential or large-working-set systems a lot, while they hurt others through compaction or fragmentation side effects.
The practical rule is simple: test always, madvise, and never against P99, not just average throughput.
NUMA locality matters
On multi-socket systems, running threads on one NUMA node while their memory lives on another silently wastes performance.
Useful commands:
numactl --hardware
numastat -p <pid>
What usually helps:
- pin important threads;
- keep memory local when possible;
- design sharded services so compute and memory stay together.
Containers change the failure mode
In containerized environments, it is common to see a host with free memory while the service still gets killed. That usually means the cgroup limit is the real boundary.
What to check:
memory.currentversusmemory.max;memory.eventsandoom_kill;- whether page cache and anonymous memory are fighting inside the same limit.
TLB and CPU cache are often the real CPU story
High CPU usage does not always mean "more computation". It often means bad memory behavior.
TLB misses
Every virtual address access needs translation.
- on a TLB hit, that is cheap;
- on a TLB miss, page-table walks add extra memory work.
Random access across a working set that is too large for effective TLB coverage can degrade performance hard even when the code itself is simple.
CPU cache misses
Latency climbs sharply as accesses move from L1 to L2 to L3 and finally to DRAM.
Common causes:
- pointer-heavy data structures;
- large hot objects that spill across cache lines;
- weak locality;
- false sharing between threads.
Useful observation:
perf stat -e cycles,instructions,cache-references,cache-misses,dTLB-load-misses,iTLB-load-misses -p <pid> -- sleep 30
That tells you whether the workload is compute-bound or memory-bound.
Disk tuning only makes sense in context
Read the right metrics
With iostat -x, the most useful fields are usually:
r/sandw/srkB/sandwkB/sawaitavgqu-sz
%util by itself is not enough, especially on modern NVMe devices.
Page-cache hit rate is a performance multiplier
If random reads dominate and the page cache cannot hold the working set, read latency falls back to storage behavior.
That means optimization may involve:
- keeping hot data resident longer;
- changing access patterns;
- adding application-level caching where appropriate.
Write bursts create deceptive failures
Under bursty writes:
- dirty pages accumulate;
- writeback falls behind;
- foreground threads get dragged into the problem.
The system may feel slow long before the disk appears fully saturated.
Concrete optimization techniques
Data layout and code
- keep hot fields close together and move cold fields away from the hot path;
- reduce pointer chasing when arrays or compact layouts can do the job;
- batch work where possible to improve locality and reduce overhead;
- avoid false sharing between threads that update different values on the same cache line;
- use object pools or arenas when allocator churn becomes measurable.
Cache-line alignment
Aligning a structure to a cache line can help, but only in the right cases.
It is usually worth it when:
- multiple threads frequently write to independent counters or queue metadata;
- false sharing is the real bottleneck.
It is usually not worth it when:
- the object is mostly read-only;
- alignment inflates memory usage so much that cache efficiency or TLB pressure gets worse overall.
Example:
struct alignas(64) ShardCounter {
std::atomic<unsigned long long> value;
char pad[64 - sizeof(std::atomic<unsigned long long>)];
};
The gain here does not come from magic alignment by itself. It comes from preventing unrelated hot writes from bouncing the same cache line across cores.
CPU and scheduling
- keep thread count proportional to the number of cores;
- reduce lock hold time before trying fancy lock-free structures;
- pin critical threads if migration is hurting locality;
- separate interrupts and critical worker threads if they compete on the same cores.
Memory and reclaim
- keep enough headroom for both anonymous memory and page cache;
- tune dirty-page limits for bursty write patterns;
- benchmark THP choices instead of copying defaults from another system;
- monitor reclaim counters continuously if the service is latency-sensitive.
Storage
- convert random IO to more sequential IO where possible;
- batch logs and reduce fsync pressure;
- split heavy write paths from read-sensitive paths when practical;
- choose the scheduler based on the actual storage device rather than habit.
Networking
- tune backlog and socket buffers with the actual traffic pattern in mind;
- distribute IRQ and softirq load sanely;
- cut copies and serialization work where possible because that load eventually becomes CPU and cache pressure too.
A repeatable debugging workflow
1. identify the user-facing symptom: throughput loss, P99 growth, error-rate shift; 2. split the problem into CPU, memory, disk, and network layers; 3. inspect the chain: iostat, vmstat, /proc/vmstat, perf stat; 4. build a causal story instead of a metric scrapbook; 5. change one class of variables at a time; 6. validate under load and keep the rollback path ready.
Example patterns:
- rising
pgmajfaulttogether with higherawaitusually means page-cache misses are falling through to disk; - rising
allocstalltogether with long-tail spikes often means direct reclaim is hurting foreground latency; - rising cache misses together with TLB misses usually means locality or working-set shape got worse.
Final thought
The most useful Linux performance work is not about memorizing kernel parameters. It is about seeing the full chain, measuring where time is actually spent, and making the system more predictable rather than just faster in one benchmark.
Comments