Skip to content

API Deployment — Operational Notes & Known Issues

This page documents known limitations, bottlenecks, and operational concerns in the current staging deployment. Items are grouped by priority. Each entry includes a recommended fix or mitigation.


High Priority

✅ FIXED (2026-04-27) — API healthcheck added

Was: Caddy had depends_on: api but there was no healthcheck on the API container. If the API crashed after RabbitMQ became healthy, Caddy was already forwarding traffic to a dead upstream.

Fix applied in docker-compose.staging.yml:

yaml
api:
  healthcheck:
    test: ["CMD-SHELL", "wget -qO- http://127.0.0.1:3000/health || exit 1"]
    interval: 15s
    timeout: 5s
    retries: 3
    start_period: 30s

caddy:
  depends_on:
    api:
      condition: service_healthy

Note: Uses 127.0.0.1 not localhost — Alpine's minimal resolver doesn't reliably resolve localhost inside containers.


✅ FIXED (2026-04-27) — RabbitMQ named volume added

Was: No named volume declared for RabbitMQ. Container destroy wiped all queue state (in-flight messages, exchange definitions).

Fix applied in docker-compose.staging.yml:

yaml
services:
  rabbitmq:
    volumes:
      - rabbitmq_data:/var/lib/rabbitmq

volumes:
  caddy_data:
  rabbitmq_data:

RabbitMQ password rotation procedure (runbook)

RABBITMQ_DEFAULT_PASS only takes effect when RabbitMQ initialises a new data directory. If RABBITMQ_PASS is rotated in GitHub Secrets, Compose recreates the container — but the Mnesia database on the named volume still holds the old credentials. RabbitMQ starts but the API gets ACCESS_REFUSED.

When rotating RABBITMQ_PASS, run on the droplet:

sh
docker compose -f docker-compose.staging.yml stop rabbitmq
docker compose -f docker-compose.staging.yml rm -f rabbitmq
docker volume rm semalink-api_rabbitmq_data    # wipes old Mnesia DB
docker compose -f docker-compose.staging.yml up -d rabbitmq

Then trigger a full redeploy (or just docker compose up -d api) so the API reconnects with the new credentials.


Single droplet — single point of failure

What happens: All three containers (caddy, rabbitmq, api) run on one VM. A droplet reboot (DigitalOcean scheduled maintenance, kernel update, OOM kill) takes down the entire staging environment simultaneously. Staging downtime is tolerable, but this architecture should not be copied for production.

Production mitigation plan:

  • Move to a load-balanced setup with at least two API instances behind the reverse proxy
  • Use a managed RabbitMQ service (CloudAMQP, AWS MQ) for queue durability across restarts
  • Use DigitalOcean Managed PostgreSQL if Neon's serverless cold-start latency becomes noticeable at production traffic levels

Medium Priority

No container resource limits

What happens: Neither api nor rabbitmq has memory or CPU limits set. On a 4 GB droplet, a memory leak in the API can OOM-kill RabbitMQ (Linux kills the process with highest memory consumption), which in turn causes the API to lose its AMQP connection and crash, leading to a full stack outage.

Fix: Add resource limits appropriate to the droplet size:

yaml
services:
  api:
    deploy:
      resources:
        limits:
          memory: 1g
  rabbitmq:
    deploy:
      resources:
        limits:
          memory: 512m

Runtime Docker image includes all devDependencies

What happens: The Dockerfile runs npm ci (without --omit=dev) in the runtime stage because drizzle-kit — a devDependency — is needed to run migrations at startup. This inflates the final image to ~500–600 MB instead of the ~150 MB it would be with only production dependencies. Larger images mean slower deploys (longer push/pull times) and a larger attack surface.

Two options to fix this:

Option A — move drizzle-kit to dependencies in package.json. Clean, no build changes needed, but you're shipping a CLI tool into the production image.

Option B — split migrations into a separate Docker target or an init container, so the runtime image can use npm ci --omit=dev:

dockerfile
FROM node:22-alpine AS runtime
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev   # now possible
...

The migration job runs before the API container starts (e.g. as a GitHub Actions step via docker compose run --rm api sh -c "npx drizzle-kit migrate" using the full image, or as a separate migrate service in Compose with restart: "no").


No Docker build layer caching in CI/CD

What happens: Every docker compose build on the droplet is a full rebuild from scratch if package.json has changed. The droplet's local Docker cache persists between deploys, which helps for code-only changes, but a dependency update invalidates the npm ci layer and forces a full reinstall (~60–120 seconds).

Fix for now: The local cache on the droplet is sufficient for staging. Do not use --no-cache in the deploy script unless you're actively debugging a caching issue.

Longer-term improvement: Use BuildKit's inline cache or a GitHub Actions cache for the builder stage:

yaml
- name: Build
  run: |
    docker buildx build \
      --cache-from type=local,src=/tmp/.buildx-cache \
      --cache-to type=local,dest=/tmp/.buildx-cache-new \
      -t semalink-api:latest .

Migration failure causes an infinite restart loop

What happens: restart: unless-stopped on the api container means a failed migration causes Docker to restart the container repeatedly. Each restart re-attempts the migration, generating load and burning RabbitMQ connections on each attempt.

Mitigation options:

Option A — change to restart: on-failure with a maximum retry count:

yaml
api:
  restart: on-failure   # stops retrying after 3 consecutive failures (Docker default)

To set a max retry count, use the extended syntax:

yaml
api:
  deploy:
    restart_policy:
      condition: on-failure
      max_attempts: 3

Option B — keep unless-stopped (so the container recovers from transient network errors) but add monitoring (see Log Aggregation below) so you get alerted before the loop causes damage.


Lower Priority / Future Planning

Caddy tls internal — not compatible with Cloudflare Full (strict)

What happens: tls internal generates a self-signed certificate from Caddy's built-in CA. Cloudflare's Full SSL mode accepts self-signed certs; Cloudflare's Full (strict) mode does not (it validates the certificate's CA chain).

If SSL mode is ever switched to Full (strict) — which provides stronger security — the origin will fail TLS verification and Cloudflare will return a 526 error.

Fix when needed: Replace tls internal with a real certificate:

  • Let's Encrypt (automatic): Remove tls internal entirely. Caddy handles ACME automatically. This requires the origin to be reachable on port 80 from the public internet (for HTTP-01 challenge), which it currently is. The caddy_data volume already persists the issued cert.

    staging-arc.semalink.africa {
        reverse_proxy api:3000
    }
  • Cloudflare Origin CA: Issue an origin certificate in the Cloudflare dashboard (15-year validity, trusted only by Cloudflare's edge). Mount it into the Caddy container and reference it explicitly in the Caddyfile.


Neon free tier — connection limit

What happens: The Neon staging branch is on a free plan with a connection limit (typically 100–500 depending on compute size). If the API opens connections without a pool or leaks them, you'll hit the limit and get FATAL: remaining connection slots are reserved errors.

Verify and fix:

  1. Confirm DATABASE_URL uses the -pooler hostname (e.g. ep-...-pooler.c-3.eu-central-1.aws.neon.tech). This routes through Neon's PgBouncer, which multiplexes many application connections onto a small number of real Postgres connections.

  2. Confirm the API's database client is configured with a reasonable max pool size (e.g. max: 10 in postgres.js or drizzle-orm/postgres-js).

  3. Monitor active connections in the Neon dashboard under Monitoring → Active connections.


Upstash Redis — LRU eviction on free tier

What happens: The Upstash free tier enforces a data size limit (~256 MB). When the limit is reached, Upstash evicts keys using LRU (Least Recently Used). This means old sessions, rate-limit counters, or cache entries can disappear silently.

For staging: Acceptable. The data volume is low.

For production: Upgrade to a paid Upstash tier with a higher data limit and no eviction, or switch to a DigitalOcean Managed Redis cluster where eviction policy is configurable.


No log aggregation

What happens: Logs are only accessible via docker compose logs api on the droplet over SSH. There is no way to query historical logs, set up alerting, or correlate errors across services without manually SSHing in.

Recommended approach: Pipe Docker container logs to a log aggregation service. Options in order of setup complexity:

ServiceSetup effortFree tier
Logtail (Better Stack)Docker log driver, one line1 GB/month
PapertrailDocker log driver, one line50 MB/day
Grafana Loki (self-hosted)Additional container in ComposeUnlimited (self-hosted)

Docker log driver config for Logtail:

yaml
services:
  api:
    logging:
      driver: json-file   # keep local logs as fallback
    # or use the syslog driver pointing to Logtail's syslog endpoint

At minimum, set up log forwarding before production launch.

Internal use only — Sema Link Engineering