Designing Tenant Isolation in a High-Throughput Messaging Platform

Overview

I led the design and implementation of core messaging infrastructure for a multi-tenant AI platform. The central challenge: ensuring complete isolation between tenants in a NATS JetStream-based event pipeline, while maintaining high throughput and predictable latency.

The Problem

Early in the platform's lifecycle, all tenant events flowed through a single shared NATS stream. It worked fine with three tenants. With fifteen, we started seeing consumer lag bleed across tenants during busy periods. One tenant's aggressive publishing would push messages through faster than other tenants' consumers could keep up.

The root issue: a shared stream means shared backpressure.

Key constraints:

At-least-once delivery guarantees
Sub-second event processing for real-time AI workflows
Zero cross-tenant data leakage (hard requirement for compliance)
Dynamic tenant provisioning (new tenants could be onboarded at any time)

Architecture Decision: Stream-Per-Domain

The core decision was to assign each tenant their own NATS JetStream stream, namespaced by domain. Each stream gets:

Its own retention policy (some domains need 7-day replay, others only 24 hours)
Independent consumer groups that can't interfere with other tenants
Per-tenant subject prefixes: tenant.<id>.events.*

func createTenantStream(ctx context.Context, js nats.JetStreamContext, tenantID string) error {
    streamName := fmt.Sprintf("TENANT_%s", strings.ToUpper(tenantID))

    _, err := js.AddStream(&nats.StreamConfig{
        Name:       streamName,
        Subjects:   []string{fmt.Sprintf("tenant.%s.>", tenantID)},
        Retention:  nats.LimitsPolicy,
        MaxAge:     7 * 24 * time.Hour,
        MaxMsgs:    1_000_000,
        Replicas:   3,
        Discard:    nats.DiscardOld,
    })
    return err
}

This gave us full isolation at the NATS layer. A tenant experiencing high throughput couldn't affect another tenant's consumer lag.

Transactional Outbox Pattern

Stream isolation solved the noisy neighbor problem. But it introduced a new one: how do you guarantee a message gets published to the tenant's stream when the publish is part of a larger database transaction? A temporary NATS outage would fail the entire transaction.

The answer is the transactional outbox pattern:

Events are written to an outbox table within the same database transaction as the business event
A background outbox worker polls for pending messages and publishes to the appropriate tenant stream
Successfully published messages are marked published; failures increment a retry counter with exponential backoff

type OutboxMessage struct {
    ID         uuid.UUID
    TenantID   string
    Subject    string
    Payload    []byte
    Status     string // "pending" | "published" | "failed"
    CreatedAt  time.Time
    PublishedAt *time.Time
}

💡

Use a SELECT FOR UPDATE SKIP LOCKED query to let multiple outbox workers process messages in parallel without stepping on each other. Postgres handles the locking efficiently.

Concurrency Controls

With tenant isolation, we also got per-tenant concurrency controls for free. Each tenant stream has a bound on in-flight messages, so a burst from one tenant doesn't degrade another's processing latency.

func createConsumer(js nats.JetStreamContext, tenantID string) (*nats.Subscription, error) {
    return js.Subscribe(
        fmt.Sprintf("tenant.%s.>", tenantID),
        handleMessage,
        nats.ManualAck(),
        nats.MaxAckPending(500), // bound in-flight per consumer
        nats.AckWait(30*time.Second),
    )
}

Observability

We instrumented the full pipeline with OpenTelemetry:

Trace spans from event creation -> outbox write -> NATS publish -> consumer processing
Custom metrics: consumer lag per tenant, outbox queue depth, publish error rate
Per-tenant SLO tracking via distributed tracing dashboards

This observability proved essential when debugging a scenario where one tenant's consumer had stalled due to a malformed event, letting us trace the exact message and consumer state without grepping logs.

Results

Eliminated cross-tenant interference entirely
Significant reduction in p95 event processing latency
Outbox pattern achieved zero message loss across an extended production run
Tenant onboarding fully automated, with stream creation triggered on account provisioning webhook

Lessons Learned

Dynamic stream creation is fine. NATS handles hundreds of streams without issue. Don't pre-create everything.
Monitor consumer lag per stream, not aggregate. An aggregate metric masks per-tenant degradation.
The outbox worker needs its own dead-letter strategy. Eventually-failing messages should move to a separate table and alert, not retry forever.

Trade-offs

The stream-per-tenant model increases NATS server resource usage linearly with tenant count. At the scale we operated this was fine; at significantly higher tenant counts we'd evaluate NATS subject-based namespacing within fewer streams, with application-level isolation. The total infrastructure cost was higher than a single shared stream, but the operational clarity was worth it. When a tenant has a problem, we know exactly which stream to look at.