API Deployment — Operational Notes & Known Issues
This page documents known limitations, bottlenecks, and operational concerns in the current staging deployment. Items are grouped by priority. Each entry includes a recommended fix or mitigation.
High Priority
✅ FIXED (2026-04-27) — API healthcheck added
Was: Caddy had depends_on: api but there was no healthcheck on the API container. If the API crashed after RabbitMQ became healthy, Caddy was already forwarding traffic to a dead upstream.
Fix applied in docker-compose.staging.yml:
api:
healthcheck:
test: ["CMD-SHELL", "wget -qO- http://127.0.0.1:3000/health || exit 1"]
interval: 15s
timeout: 5s
retries: 3
start_period: 30s
caddy:
depends_on:
api:
condition: service_healthyNote: Uses
127.0.0.1notlocalhost— Alpine's minimal resolver doesn't reliably resolvelocalhostinside containers.
✅ FIXED (2026-04-27) — RabbitMQ named volume added
Was: No named volume declared for RabbitMQ. Container destroy wiped all queue state (in-flight messages, exchange definitions).
Fix applied in docker-compose.staging.yml:
services:
rabbitmq:
volumes:
- rabbitmq_data:/var/lib/rabbitmq
volumes:
caddy_data:
rabbitmq_data:RabbitMQ password rotation procedure (runbook)
RABBITMQ_DEFAULT_PASS only takes effect when RabbitMQ initialises a new data directory. If RABBITMQ_PASS is rotated in GitHub Secrets, Compose recreates the container — but the Mnesia database on the named volume still holds the old credentials. RabbitMQ starts but the API gets ACCESS_REFUSED.
When rotating RABBITMQ_PASS, run on the droplet:
docker compose -f docker-compose.staging.yml stop rabbitmq
docker compose -f docker-compose.staging.yml rm -f rabbitmq
docker volume rm semalink-api_rabbitmq_data # wipes old Mnesia DB
docker compose -f docker-compose.staging.yml up -d rabbitmqThen trigger a full redeploy (or just docker compose up -d api) so the API reconnects with the new credentials.
Single droplet — single point of failure
What happens: All three containers (caddy, rabbitmq, api) run on one VM. A droplet reboot (DigitalOcean scheduled maintenance, kernel update, OOM kill) takes down the entire staging environment simultaneously. Staging downtime is tolerable, but this architecture should not be copied for production.
Production mitigation plan:
- Move to a load-balanced setup with at least two API instances behind the reverse proxy
- Use a managed RabbitMQ service (CloudAMQP, AWS MQ) for queue durability across restarts
- Use DigitalOcean Managed PostgreSQL if Neon's serverless cold-start latency becomes noticeable at production traffic levels
Medium Priority
No container resource limits
What happens: Neither api nor rabbitmq has memory or CPU limits set. On a 4 GB droplet, a memory leak in the API can OOM-kill RabbitMQ (Linux kills the process with highest memory consumption), which in turn causes the API to lose its AMQP connection and crash, leading to a full stack outage.
Fix: Add resource limits appropriate to the droplet size:
services:
api:
deploy:
resources:
limits:
memory: 1g
rabbitmq:
deploy:
resources:
limits:
memory: 512mRuntime Docker image includes all devDependencies
What happens: The Dockerfile runs npm ci (without --omit=dev) in the runtime stage because drizzle-kit — a devDependency — is needed to run migrations at startup. This inflates the final image to ~500–600 MB instead of the ~150 MB it would be with only production dependencies. Larger images mean slower deploys (longer push/pull times) and a larger attack surface.
Two options to fix this:
Option A — move drizzle-kit to dependencies in package.json. Clean, no build changes needed, but you're shipping a CLI tool into the production image.
Option B — split migrations into a separate Docker target or an init container, so the runtime image can use npm ci --omit=dev:
FROM node:22-alpine AS runtime
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev # now possible
...The migration job runs before the API container starts (e.g. as a GitHub Actions step via docker compose run --rm api sh -c "npx drizzle-kit migrate" using the full image, or as a separate migrate service in Compose with restart: "no").
No Docker build layer caching in CI/CD
What happens: Every docker compose build on the droplet is a full rebuild from scratch if package.json has changed. The droplet's local Docker cache persists between deploys, which helps for code-only changes, but a dependency update invalidates the npm ci layer and forces a full reinstall (~60–120 seconds).
Fix for now: The local cache on the droplet is sufficient for staging. Do not use --no-cache in the deploy script unless you're actively debugging a caching issue.
Longer-term improvement: Use BuildKit's inline cache or a GitHub Actions cache for the builder stage:
- name: Build
run: |
docker buildx build \
--cache-from type=local,src=/tmp/.buildx-cache \
--cache-to type=local,dest=/tmp/.buildx-cache-new \
-t semalink-api:latest .Migration failure causes an infinite restart loop
What happens: restart: unless-stopped on the api container means a failed migration causes Docker to restart the container repeatedly. Each restart re-attempts the migration, generating load and burning RabbitMQ connections on each attempt.
Mitigation options:
Option A — change to restart: on-failure with a maximum retry count:
api:
restart: on-failure # stops retrying after 3 consecutive failures (Docker default)To set a max retry count, use the extended syntax:
api:
deploy:
restart_policy:
condition: on-failure
max_attempts: 3Option B — keep unless-stopped (so the container recovers from transient network errors) but add monitoring (see Log Aggregation below) so you get alerted before the loop causes damage.
Lower Priority / Future Planning
Caddy tls internal — not compatible with Cloudflare Full (strict)
What happens: tls internal generates a self-signed certificate from Caddy's built-in CA. Cloudflare's Full SSL mode accepts self-signed certs; Cloudflare's Full (strict) mode does not (it validates the certificate's CA chain).
If SSL mode is ever switched to Full (strict) — which provides stronger security — the origin will fail TLS verification and Cloudflare will return a 526 error.
Fix when needed: Replace tls internal with a real certificate:
Let's Encrypt (automatic): Remove
tls internalentirely. Caddy handles ACME automatically. This requires the origin to be reachable on port 80 from the public internet (for HTTP-01 challenge), which it currently is. Thecaddy_datavolume already persists the issued cert.staging-arc.semalink.africa { reverse_proxy api:3000 }Cloudflare Origin CA: Issue an origin certificate in the Cloudflare dashboard (15-year validity, trusted only by Cloudflare's edge). Mount it into the Caddy container and reference it explicitly in the Caddyfile.
Neon free tier — connection limit
What happens: The Neon staging branch is on a free plan with a connection limit (typically 100–500 depending on compute size). If the API opens connections without a pool or leaks them, you'll hit the limit and get FATAL: remaining connection slots are reserved errors.
Verify and fix:
Confirm
DATABASE_URLuses the-poolerhostname (e.g.ep-...-pooler.c-3.eu-central-1.aws.neon.tech). This routes through Neon's PgBouncer, which multiplexes many application connections onto a small number of real Postgres connections.Confirm the API's database client is configured with a reasonable max pool size (e.g.
max: 10inpostgres.jsordrizzle-orm/postgres-js).Monitor active connections in the Neon dashboard under Monitoring → Active connections.
Upstash Redis — LRU eviction on free tier
What happens: The Upstash free tier enforces a data size limit (~256 MB). When the limit is reached, Upstash evicts keys using LRU (Least Recently Used). This means old sessions, rate-limit counters, or cache entries can disappear silently.
For staging: Acceptable. The data volume is low.
For production: Upgrade to a paid Upstash tier with a higher data limit and no eviction, or switch to a DigitalOcean Managed Redis cluster where eviction policy is configurable.
No log aggregation
What happens: Logs are only accessible via docker compose logs api on the droplet over SSH. There is no way to query historical logs, set up alerting, or correlate errors across services without manually SSHing in.
Recommended approach: Pipe Docker container logs to a log aggregation service. Options in order of setup complexity:
| Service | Setup effort | Free tier |
|---|---|---|
| Logtail (Better Stack) | Docker log driver, one line | 1 GB/month |
| Papertrail | Docker log driver, one line | 50 MB/day |
| Grafana Loki (self-hosted) | Additional container in Compose | Unlimited (self-hosted) |
Docker log driver config for Logtail:
services:
api:
logging:
driver: json-file # keep local logs as fallback
# or use the syslog driver pointing to Logtail's syslog endpointAt minimum, set up log forwarding before production launch.