Thiago Finch
All Writing
WebSocketsFintechNode.jsArchitecture

Pix, WebSockets, and the Architecture of Real-Time Money

Thiago Finch

Polling Is a Tax

Before the rebuild, Stone's merchant dashboard refreshed transaction data every 30 seconds via polling. For a merchant doing high-volume transactions on Black Friday, this meant the reconciliation screen was always 30 seconds stale. Merchants called support. Support escalated.

The obvious fix was to poll faster. We tried 5-second intervals. The API team came to talk to us.

Designing the Stream

The real-time architecture we landed on has three components:

1. Event producer: A Node.js service subscribes to Stone's internal Kafka topic for payment events. For each event, it publishes to a Redis channel keyed by merchant ID.

2. WebSocket gateway: A stateless gateway process manages client connections and subscribes each connected client to their merchant's Redis channel. When a message arrives, it fans out to all open sessions for that merchant.

3. Client state machine: The React client maintains a local transaction log as a ring buffer. New events prepend to the head; old events fall off the tail. This caps memory usage regardless of session duration.

const MAX_EVENTS = 500
const [events, setEvents] = useState<Transaction[]>([])

ws.onmessage = (e) => {
  const tx = JSON.parse(e.data)
  setEvents(prev => [tx, ...prev].slice(0, MAX_EVENTS))
}

What Broke in Production

The gateway worked perfectly in staging with synthetic load. On the first Pix-heavy Monday in November 2020, we hit a file descriptor limit — the gateway was opening too many Redis subscription handles per process. The fix was to multiplex using a single Redis subscriber per merchant ID across all connections rather than per-connection subscriptions.

We also hadn't accounted for reconnection storms: when the gateway restarted during deploys, thousands of clients reconnected simultaneously. Exponential backoff with jitter on the client side, combined with connection rate limiting on the gateway, tamed the spike.

The Outcome

The rebuilt dashboard now processes up to 50,000 events per minute per gateway instance, horizontally scaled behind a load balancer. Merchant support tickets related to reconciliation accuracy dropped 74% in the first quarter after launch.

Real-time isn't free — but it's worth paying for when the alternative is letting merchants make decisions on 30-second-old data.