Load Testing: Finding Your Breaking Point Before Users Do
TL;DR
Load test before launch. Use k6 or Artillery to simulate traffic. Test realistic scenarios, not just homepage. Find your breaking point. Monitor CPU, memory, database connections. Fix bottlenecks before users find them.
We launched our Black Friday sale at midnight. By 12:03am, the site was down. Database connections exhausted. API timeouts everywhere. Users couldn't checkout. We lost $47,000 in sales in 3 minutes.
We'd tested the site, but only with 10 concurrent users. Black Friday brought 2,500. The database connection pool maxed out at 100 connections. Every request hung waiting for a free connection.
Load testing would have found this in 5 minutes. Here's how to load test properly, find your breaking points before users do, and avoid the mistakes that killed our Black Friday.
What Is Load Testing?
Load testing is simulating real user traffic to find performance problems before production.
// Your API might work fine with 10 users
GET /api/products - 120ms ✓
// But what about 1,000 users?
GET /api/products - 45 seconds ✗
// Or 10,000 users?
GET /api/products - timeout ✗
You need to know:
- How many users can your system handle?
- What breaks first? (database, API, memory, CPU)
- How does it fail? (slow? errors? crash?)
- Can it recover? (or does it stay down)
Types of Load Tests
1. Load Test (Normal Traffic)
Simulate expected traffic levels:
// Example: E-commerce site expects 500 concurrent users
// Load test with 500 virtual users for 10 minutes
// Verify: response times < 2s, error rate < 1%
Goal: Verify system handles normal load
2. Stress Test (Find Breaking Point)
Gradually increase load until something breaks:
// Start: 100 users
// Every 30s: add 100 users
// Continue until: errors > 5% or response time > 5s
// Result: System breaks at 1,200 users
Goal: Find maximum capacity
3. Spike Test (Sudden Traffic)
Simulate sudden traffic spike:
// Normal: 200 users
// Spike: Jump to 2,000 users instantly
// Duration: 5 minutes
// Check: Does system handle it or crash?
Goal: Verify system handles sudden load (HackerNews, Reddit traffic)
4. Soak Test (Sustained Load)
Run at normal load for extended period:
// Load: 500 users
// Duration: 4-24 hours
// Check: Memory leaks, connection leaks, degradation over time
Goal: Find issues that only appear after hours
Load Testing Tools
k6 (My Favorite)
Modern, JavaScript-based, excellent performance:
// load-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
stages: [
{ duration: '1m', target: 100 }, // Ramp up to 100 users
{ duration: '5m', target: 100 }, // Stay at 100 users
{ duration: '1m', target: 0 }, // Ramp down
],
thresholds: {
http_req_duration: ['p(95)<500'], // 95% of requests < 500ms
http_req_failed: ['rate<0.01'], // Error rate < 1%
},
};
export default function() {
// Simulate user behavior
const res = http.get('https://api.example.com/products');
check(res, {
'status is 200': (r) => r.status === 200,
'response time < 500ms': (r) => r.timings.duration < 500,
});
sleep(1); // Think time between requests
}
# Run test
k6 run load-test.js
# Output:
# execution: local
# vus: 100, duration: 7m
#
# ✓ status is 200
# ✓ response time < 500ms
#
# http_req_duration..........: avg=234ms p(95)=456ms
# http_req_failed............: 0.12%
Artillery (Simple, YAML-Based)
# load-test.yml
config:
target: 'https://api.example.com'
phases:
- duration: 60
arrivalRate: 10 # 10 new users per second
- duration: 300
arrivalRate: 50 # 50 new users per second
scenarios:
- name: "Browse products"
flow:
- get:
url: "/products"
- think: 2 # Wait 2 seconds
- get:
url: "/products/{{ $randomNumber(1, 1000) }}"
- think: 1
- post:
url: "/cart"
json:
productId: "{{ $randomNumber(1, 1000) }}"
quantity: 1
# Run test
artillery run load-test.yml
# Results:
# Request rate: 47/sec
# Response time: p50=120ms, p95=450ms, p99=890ms
# Errors: 0.5%
Locust (Python-Based)
# locustfile.py
from locust import HttpUser, task, between
class WebsiteUser(HttpUser):
wait_time = between(1, 3) # Wait 1-3s between requests
@task(3) # Weight: 3x more likely than other tasks
def browse_products(self):
self.client.get("/products")
@task(1)
def view_product(self):
product_id = random.randint(1, 1000)
self.client.get(f"/products/{product_id}")
@task(1)
def add_to_cart(self):
self.client.post("/cart", json={
"productId": random.randint(1, 1000),
"quantity": 1
})
def on_start(self):
# Login once when user starts
self.client.post("/login", json={
"username": "test@example.com",
"password": "password"
})
# Run with web UI
locust -f locustfile.py --host=https://api.example.com
# Open browser: http://localhost:8089
# Configure: 1000 users, spawn rate 10/sec
# Watch real-time graphs
Realistic Load Test Scenarios
Don't just test the homepage - simulate real user behavior:
E-commerce User Journey
// k6 test
export default function() {
// 1. Browse products (40% of time)
const products = http.get('https://api.example.com/products');
check(products, { 'products loaded': (r) => r.status === 200 });
sleep(randomIntBetween(2, 5));
// 2. View product details (30% of time)
const productId = products.json()[0].id;
const product = http.get(`https://api.example.com/products/${productId}`);
check(product, { 'product loaded': (r) => r.status === 200 });
sleep(randomIntBetween(3, 8));
// 3. Add to cart (20% of time)
if (Math.random() < 0.67) { // 67% add to cart
const cart = http.post('https://api.example.com/cart', JSON.stringify({
productId: productId,
quantity: 1
}));
check(cart, { 'added to cart': (r) => r.status === 200 });
sleep(randomIntBetween(1, 3));
}
// 4. Checkout (10% of time)
if (Math.random() < 0.15) { // 15% checkout
const checkout = http.post('https://api.example.com/checkout');
check(checkout, { 'checkout success': (r) => r.status === 200 });
sleep(1);
}
}
API Load Test with Authentication
// k6 test with auth
export function setup() {
// Login once, get token
const loginRes = http.post('https://api.example.com/login', JSON.stringify({
username: 'test@example.com',
password: 'password'
}));
return { token: loginRes.json('token') };
}
export default function(data) {
const params = {
headers: {
'Authorization': `Bearer ${data.token}`,
'Content-Type': 'application/json'
}
};
// Authenticated requests
http.get('https://api.example.com/profile', params);
sleep(1);
http.get('https://api.example.com/orders', params);
sleep(2);
}
Database-Heavy Operations
// Test expensive queries
export default function() {
// Search (hits database hard)
http.get('https://api.example.com/search?q=javascript');
sleep(1);
// Analytics dashboard (complex aggregations)
http.get('https://api.example.com/dashboard');
sleep(2);
// Reports (large data exports)
http.get('https://api.example.com/reports/monthly');
sleep(5);
}
Understanding Load Test Results
Key Metrics
// k6 output
http_req_duration..........: avg=234ms min=45ms med=198ms max=2.1s p(90)=345ms p(95)=456ms
http_req_failed............: 0.12%
http_reqs..................: 45230 (150/s)
vus........................: 100
What to watch:
- Response Time (p95, p99)
p50 (median): 198ms - Half of requests faster than this
p95: 456ms - 95% of requests faster than this
p99: 890ms - 99% of requests faster than this
If p95 is acceptable but p99 is bad:
- 1% of users have terrible experience
- Usually indicates occasional bottleneck
- Error Rate
< 0.1%: Good
0.1% - 1%: Acceptable
> 1%: Problem
> 5%: Critical
- Throughput (Requests/Second)
Current: 150 req/s
Target: 200 req/s
Result: Need optimization
- Resource Usage
CPU: 85% - High, approaching limit
Memory: 2.1GB / 4GB - OK
Database connections: 98 / 100 - Maxed out! ⚠️
Finding Bottlenecks
Monitor During Load Test
# Server CPU and memory
htop
# Database connections
# PostgreSQL:
SELECT count(*) FROM pg_stat_activity;
# MySQL:
SHOW STATUS LIKE 'Threads_connected';
# Slow queries
# PostgreSQL:
SELECT query, calls, mean_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
# Application metrics
# Node.js memory
process.memoryUsage()
# {
# rss: 134217728, # 128 MB
# heapTotal: 52428800, # 50 MB
# heapUsed: 41943040, # 40 MB
# external: 1024000
# }
# Event loop lag
const start = Date.now();
setImmediate(() => {
console.log('Lag:', Date.now() - start, 'ms');
// > 100ms = event loop blocked
});
Common Bottlenecks
1. Database Connection Pool Exhausted
// Symptom: Requests hang, timeout after 30s
// Fix: Increase pool size
// Before
const pool = new Pool({
max: 10 // Too small!
});
// After
const pool = new Pool({
max: 50, // Match expected concurrent requests
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 2000
});
2. N+1 Queries
// Symptom: Response time increases with data size
// Load test reveals 1000 queries per request
// Before (N+1)
const users = await User.findAll();
for (const user of users) {
user.posts = await Post.findAll({ userId: user.id });
}
// After (eager loading)
const users = await User.findAll({
include: [Post]
});
3. Memory Leak
// Symptom: Memory usage grows over time, crashes after hours
// Fix: Find and fix leaks
// Before (leak - global cache never cleared)
const cache = {};
app.get('/product/:id', (req, res) => {
cache[req.params.id] = heavyData; // Grows forever
});
// After (LRU cache with limits)
const LRU = require('lru-cache');
const cache = new LRU({ max: 1000 });
4. Blocking Operations
// Symptom: Event loop lag, all requests slow
// Before (blocking)
app.get('/report', (req, res) => {
const data = fs.readFileSync('large-file.csv'); // Blocks!
res.send(processData(data));
});
// After (non-blocking)
app.get('/report', async (req, res) => {
const data = await fs.promises.readFile('large-file.csv');
res.send(processData(data));
});
5. No Caching
// Symptom: Database hammered with same queries
// Before (no cache)
app.get('/products', async (req, res) => {
const products = await db.query('SELECT * FROM products');
res.json(products);
});
// After (with cache)
app.get('/products', async (req, res) => {
const cached = await redis.get('products');
if (cached) {
return res.json(JSON.parse(cached));
}
const products = await db.query('SELECT * FROM products');
await redis.setex('products', 300, JSON.stringify(products));
res.json(products);
});
Load Testing Strategy
1. Start Small
// Don't start with 10,000 users
// Start with 10, verify correctness
export const options = {
vus: 10,
duration: '1m'
};
// Check:
// - No errors
// - Response times reasonable
// - Test scenario is correct
2. Find Baseline
// What's normal performance?
export const options = {
stages: [
{ duration: '1m', target: 50 },
{ duration: '5m', target: 50 },
]
};
// Record:
// p50: 145ms
// p95: 280ms
// p99: 450ms
// Error rate: 0.02%
3. Stress Test (Find Breaking Point)
// Gradually increase until it breaks
export const options = {
stages: [
{ duration: '2m', target: 100 },
{ duration: '2m', target: 200 },
{ duration: '2m', target: 400 },
{ duration: '2m', target: 800 },
{ duration: '2m', target: 1600 },
]
};
// Results:
// 100 users: p95=290ms, errors=0.01% ✓
// 200 users: p95=320ms, errors=0.02% ✓
// 400 users: p95=450ms, errors=0.08% ✓
// 800 users: p95=1.2s, errors=2.1% ✗
// Breaking point: ~600 users
4. Optimize and Retest
// Fixed database pool, added caching
// Retest at breaking point
export const options = {
vus: 800,
duration: '10m'
};
// New results:
// p95: 380ms (was 1.2s)
// Errors: 0.05% (was 2.1%)
// Success! ✓
Interpreting Results: Real Example
Before Optimization
Load Test: 500 concurrent users
http_req_duration:
avg: 1.2s
p95: 4.5s
p99: 12.3s
max: 45s
http_req_failed: 3.2%
throughput: 120 req/s
Database:
Connections: 100/100 (maxed out)
Slow queries: 847
CPU: 92%
Memory: 3.8GB / 4GB
Problems identified:
- Database pool exhausted (100/100)
- P99 response time terrible (12.3s)
- Error rate too high (3.2%)
- Slow queries hitting database
After Optimization
Changes:
- Increased DB pool: 10 → 50
- Added Redis caching
- Fixed N+1 queries
- Added database indexes
Load Test: 500 concurrent users
http_req_duration:
avg: 185ms (85% faster)
p95: 340ms (92% faster)
p99: 580ms (95% faster)
max: 1.2s (96% faster)
http_req_failed: 0.08% (40x better)
throughput: 420 req/s (3.5x better)
Database:
Connections: 28/50 (healthy)
Slow queries: 0
CPU: 45%
Memory: 2.1GB / 4GB
Result: System handles 500 users easily, can likely handle 2000+
Load Testing Checklist
Before Launch:
- [ ] Load test at expected traffic (1.5x normal)
- [ ] Stress test to find breaking point
- [ ] Spike test for sudden traffic
- [ ] Soak test for 4+ hours
- [ ] Monitor database connections during test
- [ ] Monitor memory usage during test
- [ ] Check error logs for issues
- [ ] Verify response times meet SLA
- [ ] Test with realistic user scenarios
- [ ] Test authenticated endpoints
- [ ] Test database-heavy operations
- [ ] Fix all issues found
- [ ] Retest after fixes
Common Mistakes
Mistake 1: Testing Only Homepage
// BAD - Only testing one endpoint
export default function() {
http.get('https://example.com/');
}
// GOOD - Realistic user flow
export default function() {
http.get('/');
http.get('/products');
http.get('/products/123');
http.post('/cart', ...);
}
Mistake 2: No Think Time
// BAD - Unrealistic (users don't click instantly)
export default function() {
http.get('/page1');
http.get('/page2');
http.get('/page3');
}
// GOOD - Realistic pauses
export default function() {
http.get('/page1');
sleep(2);
http.get('/page2');
sleep(3);
http.get('/page3');
sleep(1);
}
Mistake 3: Testing from One Location
# BAD - Only test from your office
k6 run load-test.js
# GOOD - Test from different regions
k6 cloud run load-test.js --zones us-east-1,eu-west-1,ap-southeast-1
Mistake 4: Not Monitoring Backend
// BAD - Only looking at k6 output
// No idea what's failing
// GOOD - Monitor everything
// - Application logs
// - Database connections
// - CPU/memory usage
// - Error tracking (Sentry)
// - APM (New Relic, Datadog)
Mistake 5: Testing Production
// BAD - Load testing production
// Causes real user impact!
// GOOD - Load test staging
// Identical to production
// No real users affected
CI/CD Integration
# .github/workflows/load-test.yml
name: Load Test
on:
pull_request:
branches: [main]
jobs:
load-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Deploy to staging
run: ./deploy-staging.sh
- name: Run load test
run: |
k6 run --out json=results.json load-test.js
- name: Check thresholds
run: |
# Fail if p95 > 500ms or errors > 1%
python check-results.py results.json
The Bottom Line
Load testing is not optional. It's how you find problems before users do.
Test realistic scenarios - user journeys, not just homepage hits.
Find your breaking point - gradually increase load until something breaks. Fix it.
Monitor everything - CPU, memory, database connections, error logs. Find the bottleneck.
Test before launch - especially before Black Friday, product launches, marketing campaigns.
We lost $47,000 in 3 minutes because we didn't load test. The database pool maxed out at 100 connections. Five minutes of testing would have found it.
Load test your application today. Start with k6 or Artillery. Simulate real user behavior. Find your breaking point. Fix it before your users find it.