Docker in Production: Handling Startup Failures

You've containerized your app. Compose file looks clean. Works perfectly on your machine.
Then you deploy to production and one of two things happens:
- The new image fails to pull. Your container crashes. The old version is gone. The service is down.
- The app starts before the database is ready. It crashes with a connection error. It doesn't come back.
These aren't edge cases. They happen to every team deploying Docker in production, usually at the worst possible time. This post gives you the concrete patterns to handle both.
Problem 1: Image Pull Failures on Deploy
Why It Happens
When you run docker compose up or docker pull, Docker contacts the registry to download the image. This can fail for several reasons:
- Registry is down — Docker Hub has outages; private registries have network issues
- Authentication expired — registry credentials rotated, token timed out
- Image tag doesn't exist — typo in tag, image wasn't pushed before deploy triggered
- Network timeout — slow connection, large image, corporate proxy issue
- Rate limiting — Docker Hub free tier limits unauthenticated pulls
The dangerous scenario is when you've already stopped the old container before the new image finishes pulling.
The Wrong Way to Deploy
# ❌ DANGEROUS — stops old container first, then tries to pull new one
docker compose down
docker compose pull
docker compose up -dIf docker compose pull fails here, you have no running service. You've turned a deploy failure into an outage.
Strategy 1: Pull Before Stop
Always pull the new image first, verify it exists locally, then swap:
# ✅ Pull new image BEFORE stopping the old container
docker compose pull
# Only stop and restart if pull succeeded
if [ $? -eq 0 ]; then
docker compose up -d
else
echo "Pull failed — keeping current containers running"
exit 1
fiStrategy 2: Use --no-recreate and Roll Forward Carefully
# Pull the new image (old containers still running)
docker pull myapp:v2.0.0
# Tag it explicitly — never deploy with :latest in production
docker tag myapp:v2.0.0 myapp:current
# Recreate only the service with the new image
docker compose up -d --no-deps myappStrategy 3: Keep the Previous Image as Fallback
Pin image versions so you can roll back instantly:
# docker-compose.yml
services:
api:
image: myregistry.io/myapp:${APP_VERSION} # never use :latest# deploy.sh
PREVIOUS_VERSION=$(docker inspect myapp:current --format='{{.Config.Image}}' 2>/dev/null)
NEW_VERSION="myregistry.io/myapp:${APP_VERSION}"
echo "Pulling $NEW_VERSION..."
docker pull "$NEW_VERSION"
if [ $? -ne 0 ]; then
echo "❌ Pull failed. Staying on $PREVIOUS_VERSION"
exit 1
fi
echo "✅ Pull succeeded. Deploying..."
APP_VERSION=$APP_VERSION docker compose up -d
echo "Previous version for rollback: $PREVIOUS_VERSION"Strategy 4: Configure Pull Retry in Docker Compose
Docker Compose doesn't natively retry pulls, but you can configure the Docker daemon to retry:
// /etc/docker/daemon.json
{
"max-concurrent-downloads": 3,
"registry-mirrors": ["https://your-mirror.example.com"]
}For CI/CD pipelines, add explicit retry logic:
# Retry pull up to 3 times with exponential backoff
pull_with_retry() {
local image=$1
local max_attempts=3
local attempt=1
while [ $attempt -le $max_attempts ]; do
echo "Attempt $attempt/$max_attempts: pulling $image"
docker pull "$image" && return 0
wait_time=$((attempt * 15))
echo "Pull failed. Waiting ${wait_time}s before retry..."
sleep $wait_time
attempt=$((attempt + 1))
done
echo "❌ All pull attempts failed for $image"
return 1
}
pull_with_retry "myregistry.io/myapp:${APP_VERSION}"Strategy 5: Use a Local Registry Mirror
For production servers, run a local registry mirror so pulls don't depend on external network:
# registry-mirror/docker-compose.yml
services:
registry:
image: registry:2
ports:
- "5000:5000"
environment:
REGISTRY_PROXY_REMOTEURL: https://registry-1.docker.io
volumes:
- registry-data:/var/lib/registry
volumes:
registry-data:Configure Docker daemon to use it:
// /etc/docker/daemon.json
{
"registry-mirrors": ["http://localhost:5000"]
}Now docker pull nginx hits your local mirror first. If the mirror has the image cached, external registry outages don't affect you.
Strategy 6: Pre-pull Images in CI Before Deploy
The safest approach is to verify the image exists and is pullable in your CI pipeline, before the deploy step even starts:
# GitHub Actions example
jobs:
verify-image:
runs-on: ubuntu-latest
steps:
- name: Pull and verify image
run: |
docker pull myregistry.io/myapp:${{ github.sha }}
echo "Image verified ✅"
deploy:
needs: verify-image
runs-on: ubuntu-latest
steps:
- name: Deploy
run: |
ssh deploy@server "APP_VERSION=${{ github.sha }} ./deploy.sh"This way, if the image is missing or the registry is down, the pipeline fails before any production containers are touched.
Problem 2: Database Connection Loss on Startup
Why It Happens
Containers start fast. Databases start slow. When you run docker compose up, your application container may be ready to accept connections within 1-2 seconds — but PostgreSQL, MySQL, or MongoDB might take 10-30 seconds to finish initializing.
Without explicit coordination, this is what happens:
t=0s → docker compose up
t=1s → api container starts, app code runs
t=1s → app tries to connect to db: "Connection refused"
t=1s → app crashes (exit code 1)
t=2s → Docker restarts api (restart: always)
t=3s → app tries to connect again: "Connection refused"
t=3s → app crashes again
...
t=15s → db finally ready
t=20s → api eventually connects on retry NThis means your app takes 20+ seconds to become healthy, and logs are full of connection errors during startup. In production with no restart: always, the app just stays down.
The Root Cause
depends_on in Docker Compose only waits for the container to start, not for the service inside to be ready:
# ❌ This does NOT wait for postgres to be ready
services:
api:
depends_on:
- db
db:
image: postgres:16The api container starts as soon as the db container process starts — even if Postgres hasn't finished loading.
Solution 1: Healthcheck + depends_on condition (Best for Compose)
Add a healthcheck to the database service, then wait for it with condition: service_healthy:
services:
api:
build: .
depends_on:
db:
condition: service_healthy # wait until db passes healthcheck
cache:
condition: service_healthy
db:
image: postgres:16-alpine
environment:
POSTGRES_USER: app
POSTGRES_PASSWORD: secret
POSTGRES_DB: myapp
healthcheck:
test: ["CMD-SHELL", "pg_isready -U app -d myapp"]
interval: 5s # check every 5 seconds
timeout: 5s # fail if no response in 5s
retries: 10 # mark unhealthy after 10 consecutive failures
start_period: 10s # don't count failures during first 10s of startup
cache:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 5With this config, Docker Compose won't start api until both db and cache pass their healthchecks. No retry logic needed in your app for the startup race condition.
Healthcheck commands for common databases:
# PostgreSQL
pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}
# MySQL / MariaDB
mysqladmin ping -h localhost -u root -p${MYSQL_ROOT_PASSWORD}
# MongoDB
mongosh --eval "db.adminCommand('ping')"
# Redis
redis-cli ping
# Elasticsearch
curl -f http://localhost:9200/_cluster/health?wait_for_status=yellowSolution 2: Retry Logic in Your Application
Healthchecks handle the startup race, but you still need retry logic in your app for two reasons:
- The database can go down while the app is running (restart, failover, network blip)
- Healthchecks aren't always available in all environments (Kubernetes, bare metal)
Node.js with Prisma:
import { PrismaClient } from '@prisma/client'
const prisma = new PrismaClient()
async function connectWithRetry(maxAttempts = 10, delayMs = 3000): Promise<void> {
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
await prisma.$connect()
console.log('✅ Database connected')
return
} catch (error) {
console.error(`❌ DB connection attempt ${attempt}/${maxAttempts} failed:`, error.message)
if (attempt === maxAttempts) {
throw new Error(`Failed to connect to database after ${maxAttempts} attempts`)
}
const backoff = Math.min(delayMs * attempt, 30000) // cap at 30s
console.log(`Retrying in ${backoff / 1000}s...`)
await new Promise(resolve => setTimeout(resolve, backoff))
}
}
}
// On startup
await connectWithRetry()Python with SQLAlchemy:
import time
import logging
from sqlalchemy import create_engine, text
from sqlalchemy.exc import OperationalError
logger = logging.getLogger(__name__)
def connect_with_retry(database_url: str, max_attempts: int = 10, delay: float = 3.0):
engine = create_engine(database_url, pool_pre_ping=True)
for attempt in range(1, max_attempts + 1):
try:
with engine.connect() as conn:
conn.execute(text("SELECT 1"))
logger.info("✅ Database connected")
return engine
except OperationalError as e:
logger.error(f"❌ DB attempt {attempt}/{max_attempts}: {e}")
if attempt == max_attempts:
raise
backoff = min(delay * attempt, 30)
logger.info(f"Retrying in {backoff:.0f}s...")
time.sleep(backoff)
engine = connect_with_retry(os.environ["DATABASE_URL"])Go:
package db
import (
"database/sql"
"fmt"
"log"
"math"
"time"
_ "github.com/lib/pq"
)
func ConnectWithRetry(dsn string, maxAttempts int) (*sql.DB, error) {
var db *sql.DB
var err error
for attempt := 1; attempt <= maxAttempts; attempt++ {
db, err = sql.Open("postgres", dsn)
if err == nil {
err = db.Ping()
}
if err == nil {
log.Println("✅ Database connected")
return db, nil
}
log.Printf("❌ DB attempt %d/%d: %v", attempt, maxAttempts, err)
if attempt == maxAttempts {
break
}
backoff := math.Min(float64(attempt)*3, 30)
log.Printf("Retrying in %.0fs...", backoff)
time.Sleep(time.Duration(backoff) * time.Second)
}
return nil, fmt.Errorf("failed to connect after %d attempts: %w", maxAttempts, err)
}Solution 3: Use pool_pre_ping / Connection Validation
Configure your connection pool to test connections before using them. This handles not just startup, but also stale connections after a DB restart:
# SQLAlchemy — validates connection before each use
engine = create_engine(
DATABASE_URL,
pool_pre_ping=True, # test connection before use
pool_recycle=3600, # recycle connections every hour
pool_size=10,
max_overflow=20,
)// Prisma — reconnects automatically
const prisma = new PrismaClient({
datasources: {
db: { url: process.env.DATABASE_URL }
}
})
// Prisma handles reconnection internally// database/sql — set connection lifetime
db.SetConnMaxLifetime(time.Hour)
db.SetConnMaxIdleTime(30 * time.Minute)
db.SetMaxOpenConns(25)
db.SetMaxIdleConns(10)Solution 4: Kubernetes Readiness Probes
In Kubernetes, you don't rely on depends_on — you use readiness probes. A pod is only added to the Service's endpoint list when its readiness probe passes:
# kubernetes deployment
spec:
containers:
- name: api
image: myapp:v1.0.0
readinessProbe:
httpGet:
path: /health/ready # returns 200 only when DB is connected
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 6
livenessProbe:
httpGet:
path: /health/live # returns 200 if process is alive
port: 8080
initialDelaySeconds: 15
periodSeconds: 10Your /health/ready endpoint should check the actual database connection:
// Express health endpoint
app.get('/health/ready', async (req, res) => {
try {
await prisma.$queryRaw`SELECT 1`
res.json({ status: 'ready', db: 'connected' })
} catch (error) {
res.status(503).json({ status: 'not ready', db: 'disconnected' })
}
})
app.get('/health/live', (req, res) => {
res.json({ status: 'alive' })
})The liveness probe (/health/live) only checks if the process is alive — it should always return 200 as long as the process is running. The readiness probe (/health/ready) checks if the app can serve traffic — it returns 503 when DB is unavailable. This prevents Kubernetes from routing traffic to the pod while DB is down, without killing and restarting the pod unnecessarily.
Putting It All Together
Here's a production-grade docker-compose.yml that handles both failure modes:
services:
api:
image: myregistry.io/myapp:${APP_VERSION:?APP_VERSION is required}
restart: unless-stopped
depends_on:
db:
condition: service_healthy
cache:
condition: service_healthy
environment:
DATABASE_URL: postgresql://app:${DB_PASSWORD}@db:5432/myapp
REDIS_URL: redis://cache:6379
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health/live"]
interval: 10s
timeout: 5s
retries: 3
start_period: 15s
ports:
- "127.0.0.1:8080:8080"
db:
image: postgres:16-alpine
restart: unless-stopped
environment:
POSTGRES_USER: app
POSTGRES_PASSWORD: ${DB_PASSWORD:?DB_PASSWORD is required}
POSTGRES_DB: myapp
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U app -d myapp"]
interval: 5s
timeout: 5s
retries: 10
start_period: 10s
cache:
image: redis:7-alpine
restart: unless-stopped
command: redis-server --appendonly yes
volumes:
- redisdata:/data
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 5
volumes:
pgdata:
redisdata:And a deploy script that handles image pull failures:
#!/bin/bash
# deploy.sh
set -e
APP_VERSION=${1:?Usage: ./deploy.sh <version>}
IMAGE="myregistry.io/myapp:${APP_VERSION}"
echo "🚀 Deploying $IMAGE"
# Step 1: Pull new image BEFORE touching running containers
echo "📦 Pulling image..."
MAX_RETRIES=3
for i in $(seq 1 $MAX_RETRIES); do
docker pull "$IMAGE" && break
if [ $i -eq $MAX_RETRIES ]; then
echo "❌ Failed to pull image after $MAX_RETRIES attempts. Aborting."
exit 1
fi
echo "Retry $i/$MAX_RETRIES in 15s..."
sleep 15
done
echo "✅ Image pulled successfully"
# Step 2: Deploy (old containers still running until here)
echo "🔄 Updating service..."
APP_VERSION=$APP_VERSION docker compose up -d --no-deps api
echo "⏳ Waiting for api to become healthy..."
timeout 60 bash -c 'until docker compose ps api | grep -q "healthy"; do sleep 2; done'
echo "✅ Deployment complete"Quick Reference
Image pull failures:
| Scenario | Solution |
|---|---|
| Registry down | Pull before stopping old containers; local mirror |
| Auth expired | Rotate credentials in CI/CD secrets; verify before deploy |
| Tag missing | Pin versions explicitly; verify tag in CI before deploy |
| Network timeout | Retry with backoff; local mirror; pre-pull in CI |
| Rate limiting | Authenticate with Docker Hub; use private registry |
DB connection failures:
| Scenario | Solution |
|---|---|
| DB not ready at startup | healthcheck + condition: service_healthy |
| DB restarts while app runs | pool_pre_ping=True; retry logic in app |
| DB port not yet open | Retry with exponential backoff on startup |
| Kubernetes deployment | Readiness probe checking /health/ready |
| DB failover/replica switch | Connection pool with reconnect; circuit breaker |
Summary and Key Takeaways
✅ Always pull the new image before stopping running containers — never the other way around
✅ Pin image versions explicitly in production — :latest makes rollback impossible
✅ Add retry with exponential backoff to pull steps in CI/CD pipelines
✅ A local registry mirror eliminates external registry outages as a deploy risk
✅ depends_on alone does NOT wait for the database to be ready — add a healthcheck
✅ condition: service_healthy in Compose + pg_isready in healthcheck is the correct pattern
✅ Add startup retry logic in your app as defense-in-depth — healthchecks aren't always available
✅ Use pool_pre_ping=True (or equivalent) so stale connections are detected before use
✅ In Kubernetes: readiness probe on /health/ready (checks DB) + liveness on /health/live (checks process)
✅ The APP_VERSION:? syntax in Compose makes missing env vars a hard error before deploy starts
Related:
Docker Fundamentals — containers, images, core concepts
Docker Compose & Multi-Container Apps — Compose deep dive
Docker Networking & Volumes — networking and storage
Docker & Kubernetes Roadmap — full learning path
Have a production Docker failure story? Feel free to reach out or leave a comment!
📬 Subscribe to Newsletter
Get the latest blog posts delivered to your inbox every week. No spam, unsubscribe anytime.
We respect your privacy. Unsubscribe at any time.
💬 Comments
Sign in to leave a comment
We'll never post without your permission.