Back to blog

Docker in Production: Handling Startup Failures

dockerdevopscontainersproductionbackend
Docker in Production: Handling Startup Failures

You've containerized your app. Compose file looks clean. Works perfectly on your machine.

Then you deploy to production and one of two things happens:

  1. The new image fails to pull. Your container crashes. The old version is gone. The service is down.
  2. The app starts before the database is ready. It crashes with a connection error. It doesn't come back.

These aren't edge cases. They happen to every team deploying Docker in production, usually at the worst possible time. This post gives you the concrete patterns to handle both.


Problem 1: Image Pull Failures on Deploy

Why It Happens

When you run docker compose up or docker pull, Docker contacts the registry to download the image. This can fail for several reasons:

  • Registry is down — Docker Hub has outages; private registries have network issues
  • Authentication expired — registry credentials rotated, token timed out
  • Image tag doesn't exist — typo in tag, image wasn't pushed before deploy triggered
  • Network timeout — slow connection, large image, corporate proxy issue
  • Rate limiting — Docker Hub free tier limits unauthenticated pulls

The dangerous scenario is when you've already stopped the old container before the new image finishes pulling.

The Wrong Way to Deploy

# ❌ DANGEROUS — stops old container first, then tries to pull new one
docker compose down
docker compose pull
docker compose up -d

If docker compose pull fails here, you have no running service. You've turned a deploy failure into an outage.

Strategy 1: Pull Before Stop

Always pull the new image first, verify it exists locally, then swap:

# ✅ Pull new image BEFORE stopping the old container
docker compose pull
 
# Only stop and restart if pull succeeded
if [ $? -eq 0 ]; then
  docker compose up -d
else
  echo "Pull failed — keeping current containers running"
  exit 1
fi

Strategy 2: Use --no-recreate and Roll Forward Carefully

# Pull the new image (old containers still running)
docker pull myapp:v2.0.0
 
# Tag it explicitly — never deploy with :latest in production
docker tag myapp:v2.0.0 myapp:current
 
# Recreate only the service with the new image
docker compose up -d --no-deps myapp

Strategy 3: Keep the Previous Image as Fallback

Pin image versions so you can roll back instantly:

# docker-compose.yml
services:
  api:
    image: myregistry.io/myapp:${APP_VERSION}  # never use :latest
# deploy.sh
PREVIOUS_VERSION=$(docker inspect myapp:current --format='{{.Config.Image}}' 2>/dev/null)
NEW_VERSION="myregistry.io/myapp:${APP_VERSION}"
 
echo "Pulling $NEW_VERSION..."
docker pull "$NEW_VERSION"
 
if [ $? -ne 0 ]; then
  echo "❌ Pull failed. Staying on $PREVIOUS_VERSION"
  exit 1
fi
 
echo "✅ Pull succeeded. Deploying..."
APP_VERSION=$APP_VERSION docker compose up -d
 
echo "Previous version for rollback: $PREVIOUS_VERSION"

Strategy 4: Configure Pull Retry in Docker Compose

Docker Compose doesn't natively retry pulls, but you can configure the Docker daemon to retry:

// /etc/docker/daemon.json
{
  "max-concurrent-downloads": 3,
  "registry-mirrors": ["https://your-mirror.example.com"]
}

For CI/CD pipelines, add explicit retry logic:

# Retry pull up to 3 times with exponential backoff
pull_with_retry() {
  local image=$1
  local max_attempts=3
  local attempt=1
 
  while [ $attempt -le $max_attempts ]; do
    echo "Attempt $attempt/$max_attempts: pulling $image"
    docker pull "$image" && return 0
 
    wait_time=$((attempt * 15))
    echo "Pull failed. Waiting ${wait_time}s before retry..."
    sleep $wait_time
    attempt=$((attempt + 1))
  done
 
  echo "❌ All pull attempts failed for $image"
  return 1
}
 
pull_with_retry "myregistry.io/myapp:${APP_VERSION}"

Strategy 5: Use a Local Registry Mirror

For production servers, run a local registry mirror so pulls don't depend on external network:

# registry-mirror/docker-compose.yml
services:
  registry:
    image: registry:2
    ports:
      - "5000:5000"
    environment:
      REGISTRY_PROXY_REMOTEURL: https://registry-1.docker.io
    volumes:
      - registry-data:/var/lib/registry
 
volumes:
  registry-data:

Configure Docker daemon to use it:

// /etc/docker/daemon.json
{
  "registry-mirrors": ["http://localhost:5000"]
}

Now docker pull nginx hits your local mirror first. If the mirror has the image cached, external registry outages don't affect you.

Strategy 6: Pre-pull Images in CI Before Deploy

The safest approach is to verify the image exists and is pullable in your CI pipeline, before the deploy step even starts:

# GitHub Actions example
jobs:
  verify-image:
    runs-on: ubuntu-latest
    steps:
      - name: Pull and verify image
        run: |
          docker pull myregistry.io/myapp:${{ github.sha }}
          echo "Image verified ✅"
 
  deploy:
    needs: verify-image
    runs-on: ubuntu-latest
    steps:
      - name: Deploy
        run: |
          ssh deploy@server "APP_VERSION=${{ github.sha }} ./deploy.sh"

This way, if the image is missing or the registry is down, the pipeline fails before any production containers are touched.


Problem 2: Database Connection Loss on Startup

Why It Happens

Containers start fast. Databases start slow. When you run docker compose up, your application container may be ready to accept connections within 1-2 seconds — but PostgreSQL, MySQL, or MongoDB might take 10-30 seconds to finish initializing.

Without explicit coordination, this is what happens:

t=0s   → docker compose up
t=1s   → api container starts, app code runs
t=1s   → app tries to connect to db: "Connection refused"
t=1s   → app crashes (exit code 1)
t=2s   → Docker restarts api (restart: always)
t=3s   → app tries to connect again: "Connection refused"
t=3s   → app crashes again
...
t=15s  → db finally ready
t=20s  → api eventually connects on retry N

This means your app takes 20+ seconds to become healthy, and logs are full of connection errors during startup. In production with no restart: always, the app just stays down.

The Root Cause

depends_on in Docker Compose only waits for the container to start, not for the service inside to be ready:

# ❌ This does NOT wait for postgres to be ready
services:
  api:
    depends_on:
      - db
  db:
    image: postgres:16

The api container starts as soon as the db container process starts — even if Postgres hasn't finished loading.

Solution 1: Healthcheck + depends_on condition (Best for Compose)

Add a healthcheck to the database service, then wait for it with condition: service_healthy:

services:
  api:
    build: .
    depends_on:
      db:
        condition: service_healthy  # wait until db passes healthcheck
      cache:
        condition: service_healthy
 
  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: app
      POSTGRES_PASSWORD: secret
      POSTGRES_DB: myapp
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U app -d myapp"]
      interval: 5s      # check every 5 seconds
      timeout: 5s       # fail if no response in 5s
      retries: 10       # mark unhealthy after 10 consecutive failures
      start_period: 10s # don't count failures during first 10s of startup
 
  cache:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5

With this config, Docker Compose won't start api until both db and cache pass their healthchecks. No retry logic needed in your app for the startup race condition.

Healthcheck commands for common databases:

# PostgreSQL
pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}
 
# MySQL / MariaDB
mysqladmin ping -h localhost -u root -p${MYSQL_ROOT_PASSWORD}
 
# MongoDB
mongosh --eval "db.adminCommand('ping')"
 
# Redis
redis-cli ping
 
# Elasticsearch
curl -f http://localhost:9200/_cluster/health?wait_for_status=yellow

Solution 2: Retry Logic in Your Application

Healthchecks handle the startup race, but you still need retry logic in your app for two reasons:

  1. The database can go down while the app is running (restart, failover, network blip)
  2. Healthchecks aren't always available in all environments (Kubernetes, bare metal)

Node.js with Prisma:

import { PrismaClient } from '@prisma/client'
 
const prisma = new PrismaClient()
 
async function connectWithRetry(maxAttempts = 10, delayMs = 3000): Promise<void> {
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      await prisma.$connect()
      console.log('✅ Database connected')
      return
    } catch (error) {
      console.error(`❌ DB connection attempt ${attempt}/${maxAttempts} failed:`, error.message)
 
      if (attempt === maxAttempts) {
        throw new Error(`Failed to connect to database after ${maxAttempts} attempts`)
      }
 
      const backoff = Math.min(delayMs * attempt, 30000) // cap at 30s
      console.log(`Retrying in ${backoff / 1000}s...`)
      await new Promise(resolve => setTimeout(resolve, backoff))
    }
  }
}
 
// On startup
await connectWithRetry()

Python with SQLAlchemy:

import time
import logging
from sqlalchemy import create_engine, text
from sqlalchemy.exc import OperationalError
 
logger = logging.getLogger(__name__)
 
def connect_with_retry(database_url: str, max_attempts: int = 10, delay: float = 3.0):
    engine = create_engine(database_url, pool_pre_ping=True)
 
    for attempt in range(1, max_attempts + 1):
        try:
            with engine.connect() as conn:
                conn.execute(text("SELECT 1"))
            logger.info("✅ Database connected")
            return engine
        except OperationalError as e:
            logger.error(f"❌ DB attempt {attempt}/{max_attempts}: {e}")
 
            if attempt == max_attempts:
                raise
 
            backoff = min(delay * attempt, 30)
            logger.info(f"Retrying in {backoff:.0f}s...")
            time.sleep(backoff)
 
engine = connect_with_retry(os.environ["DATABASE_URL"])

Go:

package db
 
import (
    "database/sql"
    "fmt"
    "log"
    "math"
    "time"
    _ "github.com/lib/pq"
)
 
func ConnectWithRetry(dsn string, maxAttempts int) (*sql.DB, error) {
    var db *sql.DB
    var err error
 
    for attempt := 1; attempt <= maxAttempts; attempt++ {
        db, err = sql.Open("postgres", dsn)
        if err == nil {
            err = db.Ping()
        }
 
        if err == nil {
            log.Println("✅ Database connected")
            return db, nil
        }
 
        log.Printf("❌ DB attempt %d/%d: %v", attempt, maxAttempts, err)
 
        if attempt == maxAttempts {
            break
        }
 
        backoff := math.Min(float64(attempt)*3, 30)
        log.Printf("Retrying in %.0fs...", backoff)
        time.Sleep(time.Duration(backoff) * time.Second)
    }
 
    return nil, fmt.Errorf("failed to connect after %d attempts: %w", maxAttempts, err)
}

Solution 3: Use pool_pre_ping / Connection Validation

Configure your connection pool to test connections before using them. This handles not just startup, but also stale connections after a DB restart:

# SQLAlchemy — validates connection before each use
engine = create_engine(
    DATABASE_URL,
    pool_pre_ping=True,       # test connection before use
    pool_recycle=3600,        # recycle connections every hour
    pool_size=10,
    max_overflow=20,
)
// Prisma — reconnects automatically
const prisma = new PrismaClient({
  datasources: {
    db: { url: process.env.DATABASE_URL }
  }
})
// Prisma handles reconnection internally
// database/sql — set connection lifetime
db.SetConnMaxLifetime(time.Hour)
db.SetConnMaxIdleTime(30 * time.Minute)
db.SetMaxOpenConns(25)
db.SetMaxIdleConns(10)

Solution 4: Kubernetes Readiness Probes

In Kubernetes, you don't rely on depends_on — you use readiness probes. A pod is only added to the Service's endpoint list when its readiness probe passes:

# kubernetes deployment
spec:
  containers:
    - name: api
      image: myapp:v1.0.0
      readinessProbe:
        httpGet:
          path: /health/ready   # returns 200 only when DB is connected
          port: 8080
        initialDelaySeconds: 5
        periodSeconds: 5
        failureThreshold: 6
 
      livenessProbe:
        httpGet:
          path: /health/live    # returns 200 if process is alive
          port: 8080
        initialDelaySeconds: 15
        periodSeconds: 10

Your /health/ready endpoint should check the actual database connection:

// Express health endpoint
app.get('/health/ready', async (req, res) => {
  try {
    await prisma.$queryRaw`SELECT 1`
    res.json({ status: 'ready', db: 'connected' })
  } catch (error) {
    res.status(503).json({ status: 'not ready', db: 'disconnected' })
  }
})
 
app.get('/health/live', (req, res) => {
  res.json({ status: 'alive' })
})

The liveness probe (/health/live) only checks if the process is alive — it should always return 200 as long as the process is running. The readiness probe (/health/ready) checks if the app can serve traffic — it returns 503 when DB is unavailable. This prevents Kubernetes from routing traffic to the pod while DB is down, without killing and restarting the pod unnecessarily.


Putting It All Together

Here's a production-grade docker-compose.yml that handles both failure modes:

services:
  api:
    image: myregistry.io/myapp:${APP_VERSION:?APP_VERSION is required}
    restart: unless-stopped
    depends_on:
      db:
        condition: service_healthy
      cache:
        condition: service_healthy
    environment:
      DATABASE_URL: postgresql://app:${DB_PASSWORD}@db:5432/myapp
      REDIS_URL: redis://cache:6379
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health/live"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 15s
    ports:
      - "127.0.0.1:8080:8080"
 
  db:
    image: postgres:16-alpine
    restart: unless-stopped
    environment:
      POSTGRES_USER: app
      POSTGRES_PASSWORD: ${DB_PASSWORD:?DB_PASSWORD is required}
      POSTGRES_DB: myapp
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U app -d myapp"]
      interval: 5s
      timeout: 5s
      retries: 10
      start_period: 10s
 
  cache:
    image: redis:7-alpine
    restart: unless-stopped
    command: redis-server --appendonly yes
    volumes:
      - redisdata:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5
 
volumes:
  pgdata:
  redisdata:

And a deploy script that handles image pull failures:

#!/bin/bash
# deploy.sh
set -e
 
APP_VERSION=${1:?Usage: ./deploy.sh <version>}
IMAGE="myregistry.io/myapp:${APP_VERSION}"
 
echo "🚀 Deploying $IMAGE"
 
# Step 1: Pull new image BEFORE touching running containers
echo "📦 Pulling image..."
MAX_RETRIES=3
for i in $(seq 1 $MAX_RETRIES); do
  docker pull "$IMAGE" && break
 
  if [ $i -eq $MAX_RETRIES ]; then
    echo "❌ Failed to pull image after $MAX_RETRIES attempts. Aborting."
    exit 1
  fi
 
  echo "Retry $i/$MAX_RETRIES in 15s..."
  sleep 15
done
 
echo "✅ Image pulled successfully"
 
# Step 2: Deploy (old containers still running until here)
echo "🔄 Updating service..."
APP_VERSION=$APP_VERSION docker compose up -d --no-deps api
 
echo "⏳ Waiting for api to become healthy..."
timeout 60 bash -c 'until docker compose ps api | grep -q "healthy"; do sleep 2; done'
 
echo "✅ Deployment complete"

Quick Reference

Image pull failures:

ScenarioSolution
Registry downPull before stopping old containers; local mirror
Auth expiredRotate credentials in CI/CD secrets; verify before deploy
Tag missingPin versions explicitly; verify tag in CI before deploy
Network timeoutRetry with backoff; local mirror; pre-pull in CI
Rate limitingAuthenticate with Docker Hub; use private registry

DB connection failures:

ScenarioSolution
DB not ready at startuphealthcheck + condition: service_healthy
DB restarts while app runspool_pre_ping=True; retry logic in app
DB port not yet openRetry with exponential backoff on startup
Kubernetes deploymentReadiness probe checking /health/ready
DB failover/replica switchConnection pool with reconnect; circuit breaker

Summary and Key Takeaways

✅ Always pull the new image before stopping running containers — never the other way around
✅ Pin image versions explicitly in production — :latest makes rollback impossible
✅ Add retry with exponential backoff to pull steps in CI/CD pipelines
✅ A local registry mirror eliminates external registry outages as a deploy risk
depends_on alone does NOT wait for the database to be ready — add a healthcheck
condition: service_healthy in Compose + pg_isready in healthcheck is the correct pattern
✅ Add startup retry logic in your app as defense-in-depth — healthchecks aren't always available
✅ Use pool_pre_ping=True (or equivalent) so stale connections are detected before use
✅ In Kubernetes: readiness probe on /health/ready (checks DB) + liveness on /health/live (checks process)
✅ The APP_VERSION:? syntax in Compose makes missing env vars a hard error before deploy starts


Related:
Docker Fundamentals — containers, images, core concepts
Docker Compose & Multi-Container Apps — Compose deep dive
Docker Networking & Volumes — networking and storage
Docker & Kubernetes Roadmap — full learning path


Have a production Docker failure story? Feel free to reach out or leave a comment!

📬 Subscribe to Newsletter

Get the latest blog posts delivered to your inbox every week. No spam, unsubscribe anytime.

We respect your privacy. Unsubscribe at any time.

💬 Comments

Sign in to leave a comment

We'll never post without your permission.