Graceful Shutdown: Why Your App Crashes on Deploy

Our deploys were dropping requests. Load balancer logs showed 502 errors every time we pushed code. Users would see errors mid-checkout. I thought it was a load balancer issue. It was us—our app killed itself immediately when receiving a shutdown signal instead of finishing what it was doing.

Here's how to fix it.

What Graceful Shutdown Actually Does

When your process manager (systemd, Kubernetes, Docker) wants to stop your app, it sends SIGTERM. Without graceful shutdown, your app dies immediately:

# What happens without graceful shutdown
User A: POST /checkout (50ms into a 200ms request)
↓
Deploy happens → SIGTERM sent → Process killed immediately
↓
User A: Connection reset error
Database: Transaction left open
Payment service: Charge deducted, order not created

With graceful shutdown:

# With graceful shutdown
User A: POST /checkout (50ms into a 200ms request)
↓
Deploy happens → SIGTERM received → Stop accepting NEW connections
↓
User A: Request continues and completes normally
↓
No more in-flight requests → Process exits cleanly

The difference is: finish what you started, then quit.

Node.js: Graceful Shutdown

// BAD - default behavior, no graceful shutdown
const app = express();
const server = app.listen(3000);

// Process dies immediately on SIGTERM
// All in-flight requests are dropped

// GOOD - graceful shutdown
const app = express();
const server = app.listen(3000, () => {
    console.log('Server running on port 3000');
});

let isShuttingDown = false;

// Track active connections
const connections = new Set();
server.on('connection', (conn) => {
    connections.add(conn);
    conn.on('close', () => connections.delete(conn));
});

function shutdown(signal) {
    console.log(`Received ${signal}, starting graceful shutdown`);
    isShuttingDown = true;

    // Stop accepting new connections
    server.close(() => {
        console.log('HTTP server closed');
        process.exit(0);
    });

    // Close idle connections immediately
    // Active connections will close naturally when done
    for (const conn of connections) {
        conn.end();
    }

    // Force exit after timeout
    setTimeout(() => {
        console.error('Forcing shutdown after timeout');
        process.exit(1);
    }, 30000).unref();
}

process.on('SIGTERM', () => shutdown('SIGTERM'));
process.on('SIGINT', () => shutdown('SIGINT'));

// Reject new requests during shutdown
app.use((req, res, next) => {
    if (isShuttingDown) {
        res.set('Connection', 'close');
        return res.status(503).json({ error: 'Server shutting down' });
    }
    next();
});

With async routes

// Track in-flight async operations
let activeRequests = 0;

app.use((req, res, next) => {
    activeRequests++;
    res.on('finish', () => activeRequests--);
    res.on('close', () => activeRequests--);
    next();
});

async function shutdown(signal) {
    console.log(`${signal} received, shutting down gracefully`);

    // Stop accepting new connections
    server.close();

    // Wait for in-flight requests (poll every 100ms)
    while (activeRequests > 0) {
        console.log(`Waiting for ${activeRequests} requests to finish...`);
        await new Promise(resolve => setTimeout(resolve, 100));
    }

    // Cleanup resources
    await database.disconnect();
    await redisClient.quit();

    console.log('Shutdown complete');
    process.exit(0);
}

Go: Graceful Shutdown

Go's standard library makes this straightforward:

// BAD - no graceful shutdown
func main() {
    http.HandleFunc("/", handler)
    http.ListenAndServe(":8080", nil)
}

// GOOD - graceful shutdown
package main

import (
    "context"
    "log"
    "net/http"
    "os"
    "os/signal"
    "syscall"
    "time"
)

func main() {
    mux := http.NewServeMux()
    mux.HandleFunc("/", handler)

    server := &http.Server{
        Addr:    ":8080",
        Handler: mux,
    }

    // Start server in goroutine
    go func() {
        log.Println("Server starting on :8080")
        if err := server.ListenAndServe(); err != http.ErrServerClosed {
            log.Fatalf("Server error: %v", err)
        }
    }()

    // Wait for interrupt signal
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)
    <-quit

    log.Println("Shutting down...")

    // Give server 30 seconds to finish in-flight requests
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    if err := server.Shutdown(ctx); err != nil {
        log.Fatalf("Forced shutdown: %v", err)
    }

    log.Println("Server stopped")
}

Go's server.Shutdown() stops accepting new connections and waits for active ones to complete. Clean and built-in.

Cleanup with context cancellation

func main() {
    ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGTERM, syscall.SIGINT)
    defer stop()

    server := &http.Server{Addr: ":8080", Handler: mux}

    // Start server
    go server.ListenAndServe()

    // Wait for signal
    <-ctx.Done()
    stop() // Reset signal handling

    // Shutdown with timeout
    shutdownCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    server.Shutdown(shutdownCtx)
    db.Close()
    cache.Close()
}

Python: Graceful Shutdown

# GOOD - with gunicorn (production WSGI server)
# gunicorn.conf.py
def worker_exit(server, worker):
    """Called when a worker exits"""
    db.disconnect()
    cache.disconnect()
    server.log.info("Worker cleanup complete")

# Start with:
# gunicorn -c gunicorn.conf.py -w 4 -b 0.0.0.0:5000 app:app

For FastAPI with uvicorn:

from contextlib import asynccontextmanager
from fastapi import FastAPI

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup
    await database.connect()
    print("Database connected")

    yield  # App runs here

    # Shutdown (triggered by SIGTERM)
    await database.disconnect()
    print("Database disconnected, shutdown complete")

app = FastAPI(lifespan=lifespan)

Uvicorn's --graceful-timeout flag controls how long it waits for in-flight requests.

Kubernetes Adds Complexity

In Kubernetes, shutdown is a multi-step process:

# pod spec
spec:
  terminationGracePeriodSeconds: 60  # Default: 30 seconds
  containers:
  - name: app
    lifecycle:
      preStop:
        exec:
          # Sleep gives load balancer time to stop routing traffic
          command: ["sleep", "10"]

The sequence when Kubernetes terminates a pod:

1. Pod marked as Terminating
2. preStop hook runs (sleep 10)
3. SIGTERM sent to container
4. Load balancer stops routing NEW requests (may take 5-10 seconds)
5. Your graceful shutdown drains in-flight requests
6. After terminationGracePeriodSeconds: SIGKILL

The preStop sleep is critical. Without it, Kubernetes sends SIGTERM before the load balancer stops routing traffic. Requests arrive during shutdown, get dropped.

Common Mistakes

Mistake 1: Timeout that's too long

// BAD - 10 minute timeout blocks deployments
setTimeout(() => process.exit(1), 600000);

// GOOD - 30 seconds is usually plenty
setTimeout(() => {
    console.error('Shutdown timeout exceeded, forcing exit');
    process.exit(1);
}, 30000).unref();

Mistake 2: Not closing database connections

// BAD - connections left open, DB thinks clients still connected
process.on('SIGTERM', () => {
    server.close();
    process.exit(0);
});

// GOOD - cleanup in order
process.on('SIGTERM', async () => {
    server.close();
    await db.pool.end();      // Wait for queries to finish
    await redis.quit();       // Close Redis connection
    process.exit(0);
});

Mistake 3: Ignoring SIGINT in development

// Add SIGINT (Ctrl+C) handling too
process.on('SIGTERM', () => shutdown('SIGTERM'));
process.on('SIGINT', () => shutdown('SIGINT'));   // Ctrl+C in dev

Testing Graceful Shutdown

# Start your app
./myapp &
APP_PID=$!

# Send some requests
for i in {1..100}; do
    curl -s http://localhost:8080/slow-endpoint &
done

# Immediately send SIGTERM
kill -TERM $APP_PID

# Check if requests completed or got errors
wait
echo "Done - check for connection reset errors above"

The Bottom Line

Graceful shutdown is table stakes for production applications. Without it, every deploy drops requests, breaks transactions, and frustrates users.

Key points:

Catch SIGTERM and SIGINT
Stop accepting new connections immediately
Wait for in-flight requests to finish (with a timeout)
Close database connections and cleanup resources
In Kubernetes, add a preStop sleep to account for load balancer lag

The implementation takes 20 lines. The alternative is explaining to your users why checkout fails every time you push code.