Architecture Review

This page documents known gaps, risks, and open design questions in the current architecture. Items are grouped by severity. Use this as a working reference — check off items as they are resolved and add new ones as they are discovered.

Summary

#	Issue	Severity	Status
1	Single aggregator — no failover	Critical	Open
2	Credit reservation race condition	Critical	Open
3	Customer webhook delivery unreliable	Critical	Open
4	Agent customer billing logic unspecified	Critical	Open
5	sms.schedule queue is a busy-wait loop	Significant	Resolved in design
6	messages table will become a hotspot	Significant	Open
7	No dead-letter queue strategy	Significant	Open
8	Rate limiting undefined	Significant	Open
9	Content check logic duplicated across two services	Moderate	Open
10	No observability strategy	Moderate	Open
11	No disaster recovery plan	Moderate	Open

Critical

1. Single Aggregator — No Failover

Problem: The entire platform's ability to send SMS depends on one external service — Celcom Africa. If they have an outage, rate-limit the account, or experience degraded delivery rates, every customer on the platform is affected simultaneously. This contradicts any delivery rate SLA.

Suggested fix: Contract a second aggregator (Africa's Talking is the natural choice — REST API, East Africa coverage). The SMS Worker should have a routing layer:

Primary: Celcom Africa
Fallback: Africa's Talking — auto-triggered when Celcom's error rate exceeds a threshold over a rolling 60-second window, or toggled manually per account in the Internal Admin App
Enterprise option: dedicated or preferred aggregator route per account

2. Credit Reservation Race Condition

Problem: The single-send flow checks the balance and then reserves credits as two separate steps. Under concurrent requests from the same account, two requests can both read the same balance, both pass the check, and both reserve — pushing the account into a negative balance. This is a time-of-check / time-of-use (TOCTOU) race that will occur in production under any meaningful load.

Suggested fix: The balance check and deduction must be a single atomic operation. Two options:

Option A — Redis Lua script (preferred for throughput):

lua

local balance = tonumber(redis.call('GET', KEYS[1]))
if balance >= tonumber(ARGV[1]) then
  return redis.call('DECRBY', KEYS[1], ARGV[1])
else
  return -1
end

If the return value is -1, reject with 402 Payment Required.

Option B — PostgreSQL atomic UPDATE:

sql

UPDATE credits
SET balance = balance - $cost
WHERE account_id = $id AND balance >= $cost
RETURNING balance;

If zero rows are returned, the balance was insufficient. No separate read required.

Option A is better for high-frequency single sends. Option B is simpler and produces a cleaner audit trail.

3. Customer Webhook Delivery Unreliable

Problem: The DLR Processor fires customer webhook callbacks inline as part of its processing loop (step 8). If the customer's endpoint times out or returns a 5xx, the webhook is silently dropped. The message status is saved to the database correctly, but the customer's system never receives the push notification it may be relying on — a silent failure for any customer running automated workflows on DLR webhooks (e.g. OTP confirmation flows).

Suggested fix: Webhook delivery should be decoupled from the DLR Processor entirely:

Add a webhooks.outbound queue and a dedicated Webhook Delivery Worker
When the DLR Processor determines a webhook needs firing, it publishes an event to webhooks.outbound instead of making the HTTP call inline
The Webhook Delivery Worker handles the HTTP call with up to 5 retries over 24 hours (exponential backoff)
Each attempt is persisted to a webhook_delivery_attempts table
After all retries are exhausted, the delivery is marked failed — visible to the customer in the dashboard

This also means webhook delivery failures never slow down DLR processing.

4. Agent Customer Billing Logic Unspecified

Problem: Agent's customers draw from the agent's balance, not their own. But the Main API billing flow checks and deducts credits from the requesting account. An agent's customer account has no balance. The billing module has no documented mechanism for knowing it should look up the agent relationship and check the agent's balance instead. Without this, agent customer sends will either fail (no balance found) or go through without any charge.

Suggested fix: Add a billing_account_id column to the accounts table:

Account type	`billing_account_id` value
Direct customer	`account_id` (self)
Agent's customer	`agent_account_id`

The billing module always operates on billing_account_id, regardless of which account submitted the request. One field handles all cases uniformly — the credit check, reservation, finalisation, and refund logic stays identical for both account types.

Significant

5. `sms.schedule` Queue Is a Busy-Wait Loop

Problem: The SMS Worker dequeues scheduled jobs, checks send_at, and requeues them if the time hasn't been reached. At scale — thousands of campaigns scheduled for the same window — the queue fills with messages being dequeued and immediately requeued over and over. This wastes CPU, creates unnecessary queue churn, and makes it harder to enforce priority ordering since the worker burns cycles on premature jobs.

Suggested fix: Remove the sms.schedule queue entirely. Use a scheduler table instead:

When a campaign is approved and has a send_at, it is stored in PostgreSQL with status = 'scheduled' — nothing is published to RabbitMQ yet

A lightweight scheduler process (cron job, runs every 30 seconds) queries:

sql

SELECT id FROM campaigns
WHERE send_at <= NOW() AND status = 'scheduled'

For each ready campaign, it publishes jobs to sms.normal and marks the campaign queued

Jobs only enter RabbitMQ when they are actually ready to send. The queue stays clean.

6. `messages` Table Will Become a Hotspot

Problem: Every SMS generates 4–5 writes to the messages table across its lifecycle (submitted → queued → sent → delivered/failed/expired). At 100k messages per day this is 400–500k writes/day to a single table, with multiple services contending on the same rows. At higher volume this becomes the primary performance bottleneck in the database.

Suggested fixes (in order of impact):

Partition by month: PostgreSQL declarative range partitioning on created_at splits the table into monthly partitions. Queries against recent data never touch historical partitions. Old partitions can be archived and dropped.

sql

CREATE TABLE messages (...) PARTITION BY RANGE (created_at);
CREATE TABLE messages_2025_04 PARTITION OF messages
  FOR VALUES FROM ('2025-04-01') TO ('2025-05-01');

Append-only event log: Replace status UPDATE writes with inserts into a message_events table (message_id, status, timestamp). Current status = latest event. Eliminates UPDATE lock contention entirely and provides a full audit trail.

Archival policy: Messages older than 90 days are exported to object storage (DigitalOcean Spaces as CSV or Parquet) and removed from PostgreSQL. Most delivery analytics queries don't need individual rows after 30–90 days.

7. No Dead-Letter Queue Strategy

Problem: If a message in any queue causes a worker to crash (malformed payload, bug in processing logic), RabbitMQ redelivers it and crashes the worker again — indefinitely. A single poison pill message can stall an entire queue, blocking all messages behind it. There is no documented way to detect, inspect, or recover from this.

Suggested fix: Configure a Dead Letter Exchange (DLX) on every queue with a x-delivery-limit of 5. After 5 failed deliveries, RabbitMQ automatically moves the message to the corresponding dead-letter queue:

Source queue	Dead-letter queue
`sms.priority`	`dlq.sms.priority`
`sms.normal`	`dlq.sms.normal`
`sms.content`	`dlq.sms.content`
`dlr.inbound`	`dlq.dlr.inbound`
`contacts.validate`	`dlq.contacts.validate`
`webhooks.outbound`	`dlq.webhooks.outbound`

The Internal Admin App should include a DLQ monitor: list stuck messages, inspect the raw payload, and offer Re-publish (fix and retry) or Discard (permanently reject) actions.

8. Rate Limiting Undefined

Problem: Redis has a ratelimit:{accountId}:{window} key documented, but no actual limits are specified anywhere in the architecture. Without hard limits, a single runaway API client — or a customer with a misconfigured bulk-send loop — can flood sms.normal, drain credits in seconds, and delay every other customer's messages behind theirs, including priority lane traffic.

Suggested fix: Define rate limits explicitly and make them part of the account tier:

Tier	Sustained rate	Burst
Starter	10 msg/sec	50
Growth	100 msg/sec	500
Enterprise	Negotiated	Negotiated

Limits are enforced at the Main API before the credit check or queue publish. Exceeding the limit returns 429 Too Many Requests with a Retry-After header. Limits are configurable per account in the Internal Admin App. Rate limiting should be per-API-key, not just per-account, so one misbehaving key doesn't block all of an account's traffic.

Moderate

9. Content Check Logic Duplicated Across Two Services

Problem: The sync path (single send) runs content checks inline in the Main API. The async path (bulk campaign) runs them in the SMS Content Worker. Identical logic in two places. When a new blocklist rule is added, it must be updated in both places. When encoding detection behaviour changes, it must be changed in both. They will inevitably drift.

Suggested fix: Extract all content check logic into a shared content-checker module inside the backend monorepo. Both the Main API (sync, function call) and the SMS Content Worker (async, same function call) import it. One implementation, two call sites. Rule updates happen in one place and both paths get them automatically.

10. No Observability Strategy

Problem: Six services, three data stores, one message queue — and nothing in the architecture describes how the team will know when something is broken. This is not a future consideration; it is needed from day one.

Suggested fix: Define a baseline observability stack before launch:

Structured logging:

Use pino in all Node.js services — JSON output, messageId as a correlation field on every log line so a message can be traced from submission through the queue to DLR
Ship logs to Datadog, Logtail, or BetterStack (all have DigitalOcean integrations)

Metrics and alerting — minimum viable set:

Metric	Alert threshold
`sms.priority` queue depth	> 500 messages
SMS Worker error rate	> 5% over 5 min
DLR Processor lag (dlr.inbound depth)	> 1000 messages
DLR latency (sent → delivered) per MNO	> 10 min average
Celcom Africa API error rate	> 2% over 1 min
Postpay account approaching credit limit	< 20% remaining

Tracing: Attach messageId to every log line and every queue message payload so you can grep across all services for a single message's full journey.

11. No Disaster Recovery Plan

Problem: The architecture runs in a single DigitalOcean region. A regional outage takes down the entire platform. DigitalOcean Managed PostgreSQL takes automated daily backups, but there is no documented RPO (acceptable data loss window) or RTO (acceptable downtime). For a billing-sensitive platform where credits and message history are the financial record, this is a meaningful risk.

The DLR Webhook Service is specifically exposed: if it goes down, Celcom Africa's callbacks may be lost depending on Celcom's retry window and policy — which is not documented.

Suggested fix:

Confirm and document Celcom Africa's DLR callback retry policy (how long do they retry, how many attempts). Add this to the aggregator integration docs.
Document the current RPO/RTO based on DigitalOcean's automated backup schedule (daily snapshot = up to 24h RPO, ~30–60 min RTO for a restore).
Add a health check endpoint to the DLR Webhook Service and set up an uptime monitor (Better Uptime or Cloudflare Health Checks). Alert immediately if it goes down — DLR loss is invisible without it.
For a longer-term fix: the DLR Webhook Service is a good candidate for multi-region standby since it is stateless (validate → queue → 200 OK). A secondary instance in another region behind a failover DNS record would be relatively cheap insurance.

Architecture Review ​

Summary ​

Critical ​

1. Single Aggregator — No Failover ​

2. Credit Reservation Race Condition ​

3. Customer Webhook Delivery Unreliable ​

4. Agent Customer Billing Logic Unspecified ​

Significant ​

5. sms.schedule Queue Is a Busy-Wait Loop ​

6. messages Table Will Become a Hotspot ​

7. No Dead-Letter Queue Strategy ​

8. Rate Limiting Undefined ​

Moderate ​

9. Content Check Logic Duplicated Across Two Services ​

10. No Observability Strategy ​

11. No Disaster Recovery Plan ​