All posts
Node.jsWebSocketNetworkingFrom scratch

I built a WebSocket server from raw TCP to actually understand it

Published August 5, 2025 · 7 min read

Why build something that already exists?

Honest answer: because I'd been using WebSockets for years and I still couldn't answer "what actually is a frame?" or "why does masking exist?" I knew the browser API. I knew how to wire up socket.io. I had zero understanding of what was happening between the client and the server.

So I built noServer: a zero-dependency Node.js HTTP and WebSocket server on raw TCP. No express, no ws library, no socket.io — just Node's net module and the RFC.

Here's what I actually learned by doing it.

Raw TCP is where you start

Node's net.createServer() gives you a socket for each connection. That socket is bytes in, bytes out. No protocol, no framing — just a stream.

First task: parse the incoming bytes as HTTP to figure out if this is a regular request or an upgrade request. WebSocket starts with a normal HTTP GET that has specific headers. You're basically writing a minimal HTTP parser — read until you hit `

`, then extract headers.

The handshake

Here's what the client sends to upgrade to WebSocket:

Your server has to respond with 101 Switching Protocols. The critical part is the Sec-WebSocket-Accept header, which is derived from the client's key.

The magic GUID is hardcoded in RFC 6455. Its only purpose is to prevent non-WebSocket servers from accidentally accepting upgrade requests — if a regular HTTP server echoed back the key, the SHA1+GUID combination wouldn't match.

After the 101 response, the connection is upgraded. No more HTTP. From here it's the WebSocket framing protocol.

Frames: the actual protocol

A WebSocket message is broken into frames. Each frame has a binary header followed by a payload. This is the thing I had no model for before building it.

The header:

  • Byte 0: bit 7 = FIN (is this the last frame of a message?), bits 0-3 = opcode. 0x1 = text frame, 0x2 = binary frame, 0x8 = close, 0x9 = ping, 0xA = pong.
  • Byte 1: bit 7 = MASK flag, bits 0-6 = payload length (or 126 / 127 for extended).
  • Then 0, 2, or 8 bytes of extended length.
  • Then 4 bytes masking key (if masked).
  • Then the payload.

Parsing this by hand means I now have this entire structure in my head. Reading the RFC was fine. Writing the parser made it permanent.

Masking and why it exists

Client frames MUST be masked. Server frames MUST NOT be masked. This isn't optional.

Masking is XOR with a 4-byte key:

It's not encryption. It's not security. Its entire purpose is preventing WebSocket frames from looking like valid HTTP to intermediate proxies that might cache or modify them. Masking randomises the bytes so a proxy can't misinterpret them as HTTP.

Before I built this I'd seen "masking" mentioned and assumed it was a security feature. It's not. It's a proxy-confusion prevention mechanism.

Fragmentation

A message can span multiple frames. The FIN bit tells you if more frames are coming. If FIN is 0, buffer the payload and wait. If FIN is 1, concatenate everything you've buffered and emit the message.

Continuation frames have opcode 0. I got this wrong the first time and couldn't figure out why messages over a certain size were corrupted. Turned out I was treating the continuation frames as new messages because I forgot to check for opcode 0.

What I know now

The ws library is about 2000 lines. After building noServer I understand what most of those lines are doing. I understand why ping/pong exists (keep-alive and dead connection detection), why the FIN bit exists (large message fragmentation), and why masking is client-only.

I'm still using ws in production. This wasn't about replacing libraries — it was about understanding them.

If you use WebSockets and you've never read RFC 6455, build a toy implementation. You'll understand the protocol in an afternoon in a way that no amount of documentation reading will give you.