Troubleshooting

Start here when something’s not working. If this page doesn’t cover your case, check the FAQ or open a GitHub Discussion.

Agent won’t connect

Symptom

Host shows Offline or Connecting… indefinitely.

Checklist

Agent service running?

systemctl status dockmesh-agent
journalctl -u dockmesh-agent -n 100

Can the agent resolve the server DNS?

# On the agent host
getent hosts dockmesh.example.com

Can it reach the server port?

openssl s_client -connect dockmesh.example.com:8443 -servername dockmesh.example.com

Is the certificate valid? If you rotated the CA, re-enroll the agent:

# Get a new enrollment token from the UI
dockmesh-agent enroll --server https://dockmesh.example.com --token <new-token>

Clock skew? mTLS is sensitive to clock drift. Both sides should run NTP:
Terminal window
```
timedatectl status
```

Common root causes

Firewall rule change blocking outbound 8443
TLS cert expired on server (uncommon, auto-renews if ACME)
Agent cert revoked (check Hosts → revoked list on server)
Network split between server and agent subnets

Stack deploy fails

Symptom

Deploy logs show an error and the stack status goes to error.

Check the log output first

The streaming deploy log has the real cause. Common ones:

pull access denied — image is private and the registry credentials aren’t configured. See Images → Registry auth.

port is already allocated — another container is using the host port. Find it: Containers → filter by port. Either stop the existing container or change the port in the new stack.

driver failed programming external connectivity — usually means the host ran out of available ports in the ephemeral range, or iptables is misconfigured. Restart the Docker daemon on that host.

network <name> declared as external, but could not be found — the external network isn’t there. Create it first (Networks → New network) or remove external: true.

no space left on device — the host disk is full. Usually /var/lib/docker — prune images/volumes via dockmesh or clean up host logs.

Symptom

Clicking the SSO button sends you to the IdP, you log in, come back, and see “Authentication failed” or get bounced to the login page.

Checklist

Redirect URI matches exactly? The URI in your IdP config must match <your-dockmesh-url>/auth/oidc/callback character-for-character. http vs https, trailing slash, port — all must match.
Clock skew? OIDC tokens have short expiry (usually 60s). If server and IdP clocks differ by more than that, tokens are rejected.
Group claim present? If you use group mappings, the ID token must include the groups claim. Some IdPs require enabling “groups scope” explicitly.
Logs on the dockmesh server:
Terminal window
```
journalctl -u dockmesh | grep -i oidc
```
Look for specific error like invalid token signature, missing claim, discovery failed.

Slow UI

Symptom

Pages take seconds to load.

Diagnose

Server load?

top  # check dockmesh CPU/mem
iostat  # check disk wait

Database size?
Terminal window
```
ls -lh /opt/dockmesh/data/dockmesh.db
```
If it’s > 1 GB, consider enabling audit log retention (see Audit Log).

Common fixes

Vacuum the SQLite DB if fragmentation is high:
Terminal window
```
sqlite3 /opt/dockmesh/data/dockmesh.db "VACUUM;"
```
Migrate to PostgreSQL for fleets > 200 hosts. Set DOCKMESH_DB_URL=postgres://....
Reduce stats retention in Settings if disk I/O is the bottleneck.

Backup fails

Symptom

Backup job shows failed.

Check

Job log — click the failed run, read the error
Target still reachable? — test in Backups → Targets → [target] → Test connection
Disk space on target — SFTP/NAS with a full disk silently fails
Encryption passphrase known? — restore tests require it; rotating it orphans old backups

Common errors

dial tcp ... i/o timeout — target host is unreachable (firewall? DNS?)
permission denied — credentials have read but not write access on target
pre-backup hook exited 1 — the hook script failed (check the hook command/image)

Stack migration fails

Symptom

Migration aborts partway, stack is back on source host.

Diagnose

Pre-flight — did any check fail? Volume size mismatch is common.
Network — bandwidth between source and destination; migrations of 100+ GB volumes can take hours on slow links
Destination disk full mid-transfer — pre-flight checks free space, but if something else fills it up mid-transfer, migration aborts

Automatic rollback should leave you in the starting state. If it doesn’t, manual cleanup:

# On source
docker compose -f /opt/dockmesh/stacks/<host>/<stack>/compose.yaml up -d

# On destination
docker compose -f /opt/dockmesh/stacks/<host>/<stack>/compose.yaml down

Alerts not firing

Check

Rule enabled? (Alerts → Rules → check toggle)
Mute rule active? (Alerts → Mutes — any matching mute?)
Channel working? — Settings → Channels → [channel] → Send test
Cooldown? — a recent fire for the same resource suppresses re-alerts

Logs aren’t streaming

Symptom

Open Container → Logs, nothing shows up or stops after a few seconds.

Fixes

Click Reconnect — WebSocket may have dropped
Check agent version on the host (old agents had a streaming bug fixed in 1.0.0-beta.3)
If behind a corporate proxy, WebSocket might be stripped — contact your network admin

Can’t log in

Symptom

The login page rejects your credentials, or returns:

account temporarily locked — try again in N minutes

Cause

Five failed login attempts in a row trigger a 15-minute lockout per user (default — configurable via auth.lockout_max_attempts and auth.lockout_duration_minutes). This usually comes from:

Browser autofill replaying a stale saved password for the same URL
Copy-paste from a password manager that got the wrong entry
An actual forgotten password
Someone on your network probing with wrong credentials (rare for homelab, more relevant on public-internet deploys)

Fixes

Wait 15 minutes — the lockout is time-based, no admin action needed. The login error tells you how long is left.
If you know the password but the lockout is annoying:
Terminal window
```
sudo dockmesh admin unlock --user admin
```
Clears the lockout without touching the password.

If you forgot the password:

cd /var/lib/dockmesh    # service's working directory
sudo dockmesh admin reset-password --user admin --password 'NewSecure#2026'

Also clears the lockout as a side effect.

If the login page rejects you silently (not a lockout error):
- Delete the saved password for the dockmesh URL in your browser’s password manager, then type the password by hand
- Try an Incognito/Private window — rules out autofill + cookie issues

Prevention

Set a strong, memorable password you type rather than auto-fill
Tune auth.lockout_max_attempts up (e.g. to 10) under Settings → System if you find 5 too strict

Getting help

If none of the above fixes your issue:

GitHub Discussions — searchable, other users can help, answers benefit everyone
GitHub Issues — for bugs (include dockmesh version, OS, minimal reproduction)
Security issues only: security@dockmesh.dev

Always include:

dockmesh version (dockmesh --version)
OS + Docker version
Relevant log snippets (journalctl or in-UI logs)
Steps to reproduce

Troubleshooting

Agent won’t connect

Symptom

Checklist

Common root causes

Stack deploy fails

Symptom

Check the log output first

SSO login fails

Symptom

Checklist

Slow UI

Symptom

Diagnose

Common fixes

Backup fails

Symptom

Check

Common errors

Stack migration fails

Symptom

Diagnose

Alerts not firing

Check

Logs aren’t streaming

Symptom

Fixes

Can’t log in

Symptom

Cause

Fixes

Prevention

Getting help

See also