What performance metrics should SREs monitor on Linux?

Monitor: CPU utilization and context switching, memory (RSS, page faults), disk I/O (latency, throughput), network (bandwidth, packet loss, errors), and process-level metrics (file handles, threads).

How do you diagnose slow systems on Linux?

Diagnose using: top/htop for CPU/memory, iostat for disk I/O, netstat for network, vmstat for virtual memory, and perf for CPU profiling. Use these tools systematically to rule out layers.

What are common Linux tuning mistakes?

Common mistakes: over-tuning kernel parameters without understanding impact, ignoring application-level bottlenecks, not measuring before/after changes, and applying tuning changes during incidents.

Linux Performance Tuning: Proven Techniques ...

IN THIS ARTICLE

Table of Contents

Introduction

Did you know that 80% of production outages can be traced back to misconfigured or under-optimized Linux systems? Site Reliability Engineers (SREs) are constantly challenged to keep systems running optimally under high workloads, making Linux performance tuning an essential skill. In this guide, you’ll discover powerful, practical techniques to proactively optimize your Linux systems, enhancing reliability, performance, and operational efficiency.

Step-by-Step Linux Optimization Guide

Step 1: Adjust Swappiness for Optimal Memory Management

Check current swappiness:

cat /proc/sys/vm/swappiness

Set recommended swappiness value:

sudo sysctl vm.swappiness=10

Step 2: Increase File Descriptor Limits

Check current limits:

ulimit -n

Update limits:

echo '* soft nofile 65535' | sudo tee -a /etc/security/limits.conf
echo '* hard nofile 65535' | sudo tee -a /etc/security/limits.conf

Step 3: Resource Isolation with cgroups

Create a memory cgroup:

sudo cgcreate -g memory:/critical_service
echo $((1024*1024*1024)) | sudo tee /sys/fs/cgroup/memory/critical_service/memory.limit_in_bytes

Step 4: Networking Optimization

Adjust TCP parameters:

sudo sysctl -w net.ipv4.tcp_tw_reuse=1
sudo sysctl -w net.core.somaxconn=1024

Step 5: Select Appropriate I/O Scheduler

Check current scheduler:

cat /sys/block/sda/queue/scheduler

Set deadline scheduler:

echo 'deadline' | sudo tee /sys/block/sda/queue/scheduler

Step 6: Real-time Diagnostics with perf

Monitor kernel-level events:

sudo perf top

Step 7: Disable Transparent Huge Pages (THP)

Check THP status:

cat /sys/kernel/mm/transparent_hugepage/enabled

Disable THP:

echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

Step 8: Enable HugePages

Configure HugePages:

sudo sysctl vm.nr_hugepages=1024

Step 9: Tweak Cache Behavior

Adjust dirty ratios:

sudo sysctl -w vm.dirty_ratio=15
sudo sysctl -w vm.dirty_background_ratio=5

Step 10: Optimize IRQ Balancing

Install and configure irqbalance:

sudo apt-get install irqbalance
sudo systemctl enable irqbalance
sudo systemctl start irqbalance

Step 11: Network Throughput Optimization

Adjust network backlog:

sudo sysctl -w net.core.netdev_max_backlog=5000

Step 12: Manage TCP SYN Backlog

Increase SYN backlog:

sudo sysctl -w net.ipv4.tcp_max_syn_backlog=2048

Step 13: TCP Connection Timeout

Reduce FIN timeout:

sudo sysctl -w net.ipv4.tcp_fin_timeout=15

Step 14: Optimize TCP Buffer Sizes

Set TCP buffer sizes:

sudo sysctl -w net.core.rmem_max=16777216
sudo sysctl -w net.core.wmem_max=16777216

Step 15: Apply tuned-adm Profiles

Install and apply profiles:

sudo apt-get install tuned
sudo tuned-adm profile throughput-performance

Step 16: Scheduler Tunables

Optimize scheduler responsiveness:

sudo sysctl -w kernel.sched_autogroup_enabled=1

Step 17: Implement zswap

Enable zswap:

sudo sysctl -w vm.zswap.enabled=1

Step 18: SSD Optimization with udev

Create udev rule for SSD:

sudo echo 'ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="deadline"' | sudo tee /etc/udev/rules.d/60-ssd.rules

Step 19: Kernel Samepage Merging (KSM)

Enable KSM:

echo 1 | sudo tee /sys/kernel/mm/ksm/run

Step 20: Regular fstrim

Schedule fstrim:

sudo systemctl enable fstrim.timer
sudo systemctl start fstrim.timer

Step 21: CPU Governor Adjustment

Set performance governor:

sudo apt-get install cpufrequtils
sudo cpufreq-set -g performance

Automating Performance Tuning

Consistency in configuration across systems is crucial. Automate using tools like Ansible or Chef.

Example Ansible Playbook for Performance Tuning

- hosts: all
  tasks:
    - name: Set vm.swappiness
      sysctl:
        name: vm.swappiness
        value: '10'
        state: present
        reload: yes

    - name: Increase file descriptor limits
      lineinfile:
        path: /etc/security/limits.conf
        line: '* soft nofile 65535'
        create: yes

Actionable Takeaways: Your Tuning Checklist

Adjust kernel parameters (swappiness, file descriptors)
Implement cgroups for resource isolation
Optimize networking and TCP stack
Choose appropriate I/O schedulers
Automate tuning tasks with Ansible or Chef
Monitor continuously using tools like perf
Apply additional advanced optimization techniques listed above

By implementing these Linux performance tuning techniques step-by-step, you’re empowering your infrastructure to handle peak loads, ensuring optimal uptime and reliability.

Related operator notes

🚨

Incident Management with AI →

How AI is changing incident response: intelligent triage, automated Runbooks, LLM-powered postmortems, and on-call health.

🔭

Observability for SRE →

Metrics, distributed tracing, structured logs, SLOs, and Error Budgets — and how to extend them for AI systems.

Stay Sharp

New articles on AIOps and SRE, straight to your inbox.

Practical content for practitioners. No noise, no vendor pitches.

No spam. Unsubscribe any time.

What's Hot

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

Linux Performance Tuning: Proven Techniques Every SRE Must Master

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

SRE vs Platform Engineering: Where the Line Actually Is

From Postmortems to Prevention: Building a Real Risk Registry

The Invisible Meter Running Behind Every AI System

The 5 Whys in a postmortem: getting to a fixable cause

AIOps tools: what matters in production and what does not

Eliminate Alert Fatigue for Good: Powerful AIOps Techniques

Key Performance Indicators (KPIs)

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

SRE vs Platform Engineering: Where the Line Actually Is

Most Popular

AIOps tools: what matters in production and what does not

Eliminate Alert Fatigue for Good: Powerful AIOps Techniques

Key Performance Indicators (KPIs)

Our Picks

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

What's Hot

Linux Performance Tuning: Proven Techniques Every SRE Must Master

Introduction

Step-by-Step Linux Optimization Guide

Step 1: Adjust Swappiness for Optimal Memory Management

Step 2: Increase File Descriptor Limits

Step 3: Resource Isolation with cgroups

Step 4: Networking Optimization

Step 5: Select Appropriate I/O Scheduler

Step 6: Real-time Diagnostics with perf

Step 7: Disable Transparent Huge Pages (THP)

Step 8: Enable HugePages

Step 9: Tweak Cache Behavior

Step 10: Optimize IRQ Balancing

Step 11: Network Throughput Optimization

Step 12: Manage TCP SYN Backlog

Step 13: TCP Connection Timeout

Step 14: Optimize TCP Buffer Sizes

Step 15: Apply tuned-adm Profiles

Step 16: Scheduler Tunables

Step 17: Implement zswap

Step 18: SSD Optimization with udev

Step 19: Kernel Samepage Merging (KSM)

Step 20: Regular fstrim

Step 21: CPU Governor Adjustment

Automating Performance Tuning

Example Ansible Playbook for Performance Tuning

Actionable Takeaways: Your Tuning Checklist

Related operator notes

New articles on AIOps and SRE, straight to your inbox.

Related Posts