Agent Protocol

Most users never need to understand the wire protocol. But when debugging connectivity, planning network architecture, or evaluating the security model, knowing how it works matters.

The basics

Transport: WebSocket over TLS (wss://)
Authentication: Mutual TLS (both sides present certs)
Direction: Outbound-only from the agent
Wire format: JSON frames inside the WebSocket — one frame envelope, request/response correlated by ID

Certificate authority

Every dockmesh server has its own internal CA. On first boot it generates an ECDSA P-256 CA keypair and a matching server leaf cert. Both land as PEM files in the data directory next to the SQLite DB:

File	Purpose
`agents-ca.crt`	CA public certificate, 10-year validity
`agents-ca.key`	CA private key (mode 0400)
`agents-server.crt`	Server cert for the `:8443` listener, 5-year validity
`agents-server.key`	Matching server private key (mode 0400)

No separate database encryption layer protects the CA key — its security relies on filesystem permissions (0700 on the data dir + 0400 on the key file). Back up the data directory to back up the CA.

Enrolment

When you add a host:

Server generates a one-time bootstrap token (cryptographically random, hashed before storage) and stores it on the new agent row in pending status.
The UI shows the install command — curl -fsSL https://<server>/install/agent.sh?token=<token> | sudo bash — which you paste on the target host.
The install script fetches the agent binary from the same server, drops a systemd unit, and starts the agent.
The agent calls the enrolment endpoint with the token. The server matches the hash, sees pending, generates an ECDSA P-256 client keypair, signs a cert with the CA (1-year validity, agent’s UUID as Common Name), and returns cert + key + CA bundle.
The agent writes those into its data dir as agent.crt, agent.key, ca.crt, plus the dial URL into agent.url.
The token is invalidated server-side. The agent now uses its client cert for every subsequent connection.

Token TTL: one-time use, no time-based expiry. If you don’t use it in the next 10 minutes you can still use it next week. Rotate it manually if you suspect it leaked (Agents page → host → Rotate enrolment token).

Connection lifecycle

The agent dials wss://<server>:8443/connect as soon as its service starts:

TLS handshake — agent presents agent.crt, server verifies against the CA, server presents agents-server.crt which the agent verifies against the pinned ca.crt.
WebSocket upgrade — HTTP/1.1 Upgrade.
agent.hello — agent announces itself with its UUID, version, hostname, Docker version, OS info.
server.welcome — server records the connection, marks the host online.
Ready — request frames flow.

On any disconnect the agent reconnects with exponential backoff. Server-side, a pingInterval of 30 seconds + a heartbeatGrace of 60 seconds mean the host flips to offline ~60s after the agent stops responding.

Message framing

The wire format is JSON, not binary. Every frame looks like:

{
  "type": "req.containers.list",
  "id":   "8d21e3f0…",
  "payload": { /* type-specific JSON */ }
}

type — string identifier (e.g. agent.hello, req.containers.list, res.containers.list).
id — opaque correlation id; responses echo the matching request’s id so the server’s waiting goroutine can route the reply.
payload — JSON value; shape depends on type.

The choice of JSON-over-WebSocket (rather than gRPC or a binary protocol) is deliberate: easier to inspect with Wireshark or a proxy, simpler to debug from journalctl, no codegen step. The per-message overhead is negligible for the kind of small control-plane traffic dockmesh sends.

Multiplexing

A single WebSocket carries many concurrent operations. There is no explicit “stream id” channel — multiplexing happens at the frame level: request/response frames share the connection, and longer-running streams (log tails, exec sessions, stats subscriptions) are layered on by sending frames with the same id until the requester sends a close frame.

When you close a log viewer tab in the UI the server sends a close frame; the agent stops tailing the container’s log.

Frame types

dockmesh has ~40 frame types covering lifecycle + every operation. A few representative ones:

Frame	Direction	Purpose
`agent.hello`	agent → server	Initial announcement on connect
`server.welcome`	server → agent	Acknowledgement + initial config
`agent.heartbeat`	agent → server	Sent in response to `server.ping`
`server.ping`	server → agent	Every 30s; missing replies for 60s flip the host offline
`req.containers.list`	server → agent	Ask the agent for its container list
`req.containers.start`	server → agent	Lifecycle action
`req.images.list`	server → agent	Resource listing
`req.volume.browse`	server → agent	List one directory inside a named volume
`req.deploy.stack`	server → agent	Apply a compose project
`req.agent.upgrade`	server → agent	Hot-swap the agent binary

The full list lives in internal/agents/protocol.go in the source tree — that file is the single source of truth.

Cert rotation

Client certs are issued for 1 year. There is no auto-renewal: the agent does not currently send a renewal request as expiry approaches. Long-running agents need to be re-enrolled manually before their cert expires — rotate the token, run the install one-liner again, the cert is replaced. An agent with an expired cert sees TLS handshake failures and is marked offline; no silent breakage. Auto-renewal is on the roadmap.

Revocation

When you remove a host from dockmesh, the agent row is deleted (or marked revoked). On the next handshake the server checks the row in the database and refuses the connection. The agent logs the rejection and exits. No formal CRL or OCSP — the check is a single DB lookup per connection attempt, takes effect immediately.

Agent upgrade

The upgrade controller (see Upgrade Guide) dispatches a req.agent.upgrade frame to drifted agents with the download URL + checksum. The agent downloads the new binary from <server>/install/dockmesh-agent-linux-<arch>, verifies the SHA, writes it atomically, and re-executes. Open operations (log tails, exec) are interrupted but the UI’s WebSocket clients reconnect automatically and resume.

Why outbound-only?

Tradition says “server-to-agent” models pull from a central server. dockmesh flips this for practical reasons:

No inbound firewall holes on agent hosts (NAT, corporate firewalls, home routers all work unchanged)
Works behind CGNAT (common in home labs)
Works from anywhere the agent has internet access (coffee shop, traveling laptop, edge locations)
Agent knows when it has Docker running — no need for server to poll

The tradeoff: the server is reachable to all agents, which means protecting it matters. See Hardening.

Why not gRPC?

dockmesh deliberately picked JSON-over-WebSocket over gRPC:

WebSocket survives corporate proxies that strip HTTP/2 — gRPC does not always
Easier debugging — journalctl, Wireshark, browser DevTools, anything that reads text speaks the protocol
Smaller agent binary (no gRPC runtime)
Performance is more than sufficient for the small control-plane messages dockmesh sends; binary serialisation would be over-engineering at this scale

Troubleshooting protocol issues

See Troubleshooting → Agent won’t connect.

For deep debugging, enable debug logging:

# On the agent
systemctl set-environment DOCKMESH_LOG_LEVEL=debug
systemctl restart dockmesh-agent
journalctl -u dockmesh-agent -f