Troubleshooting
Start here when something’s not working. If this page doesn’t cover your case, check the FAQ or open a GitHub Discussion.
Agent won’t connect
Section titled “Agent won’t connect”Symptom
Section titled “Symptom”Host shows Offline or Connecting… indefinitely.
Checklist
Section titled “Checklist”- Agent service running?
Terminal window systemctl status dockmesh-agentjournalctl -u dockmesh-agent -n 100 - Can the agent resolve the server DNS?
Terminal window # On the agent hostgetent hosts dockmesh.example.com - Can it reach the server port?
Terminal window openssl s_client -connect dockmesh.example.com:8443 -servername dockmesh.example.com - Is the certificate valid? If you rotated the CA, re-enroll the agent:
Terminal window # Get a new enrollment token from the UIdockmesh-agent enroll --server https://dockmesh.example.com --token <new-token> - Clock skew? mTLS is sensitive to clock drift. Both sides should run NTP:
Terminal window timedatectl status
Common root causes
Section titled “Common root causes”- Firewall rule change blocking outbound 8443
- TLS cert expired on server (uncommon, auto-renews if ACME)
- Agent cert revoked (check Hosts → revoked list on server)
- Network split between server and agent subnets
Stack deploy fails
Section titled “Stack deploy fails”Symptom
Section titled “Symptom”Deploy logs show an error and the stack status goes to error.
Check the log output first
Section titled “Check the log output first”The streaming deploy log has the real cause. Common ones:
pull access denied — image is private and the registry credentials aren’t configured. See Images → Registry auth.
port is already allocated — another container is using the host port. Find it: Containers → filter by port. Either stop the existing container or change the port in the new stack.
driver failed programming external connectivity — usually means the host ran out of available ports in the ephemeral range, or iptables is misconfigured. Restart the Docker daemon on that host.
network <name> declared as external, but could not be found — the external network isn’t there. Create it first (Networks → New network) or remove external: true.
no space left on device — the host disk is full. Usually /var/lib/docker — prune images/volumes via dockmesh or clean up host logs.
SSO login fails
Section titled “SSO login fails”Symptom
Section titled “Symptom”Clicking the SSO button sends you to the IdP, you log in, come back, and see “Authentication failed” or get bounced to the login page.
Checklist
Section titled “Checklist”- Redirect URI matches exactly?
The URI in your IdP config must match
<your-dockmesh-url>/auth/oidc/callbackcharacter-for-character.httpvshttps, trailing slash, port — all must match. - Clock skew? OIDC tokens have short expiry (usually 60s). If server and IdP clocks differ by more than that, tokens are rejected.
- Group claim present? If you use group mappings, the ID token must include the
groupsclaim. Some IdPs require enabling “groups scope” explicitly. - Logs on the dockmesh server:
Look for specific error like
Terminal window journalctl -u dockmesh | grep -i oidcinvalid token signature,missing claim,discovery failed.
Slow UI
Section titled “Slow UI”Symptom
Section titled “Symptom”Pages take seconds to load.
Diagnose
Section titled “Diagnose”- Server load?
Terminal window top # check dockmesh CPU/memiostat # check disk wait - Database size?
If it’s > 1 GB, consider enabling audit log retention (see Audit Log).
Terminal window ls -lh /opt/dockmesh/data/dockmesh.db
Common fixes
Section titled “Common fixes”- Vacuum the SQLite DB if fragmentation is high:
Terminal window sqlite3 /opt/dockmesh/data/dockmesh.db "VACUUM;" - Migrate to PostgreSQL for fleets > 200 hosts. Set
DOCKMESH_DB_URL=postgres://.... - Reduce stats retention in Settings if disk I/O is the bottleneck.
Backup fails
Section titled “Backup fails”Symptom
Section titled “Symptom”Backup job shows failed.
- Job log — click the failed run, read the error
- Target still reachable? — test in Backups → Targets → [target] → Test connection
- Disk space on target — SFTP/NAS with a full disk silently fails
- Encryption passphrase known? — restore tests require it; rotating it orphans old backups
Common errors
Section titled “Common errors”dial tcp ... i/o timeout— target host is unreachable (firewall? DNS?)permission denied— credentials have read but not write access on targetpre-backup hook exited 1— the hook script failed (check the hook command/image)
Stack migration fails
Section titled “Stack migration fails”Symptom
Section titled “Symptom”Migration aborts partway, stack is back on source host.
Diagnose
Section titled “Diagnose”- Pre-flight — did any check fail? Volume size mismatch is common.
- Network — bandwidth between source and destination; migrations of 100+ GB volumes can take hours on slow links
- Destination disk full mid-transfer — pre-flight checks free space, but if something else fills it up mid-transfer, migration aborts
Automatic rollback should leave you in the starting state. If it doesn’t, manual cleanup:
# On sourcedocker compose -f /opt/dockmesh/stacks/<host>/<stack>/compose.yaml up -d
# On destinationdocker compose -f /opt/dockmesh/stacks/<host>/<stack>/compose.yaml downAlerts not firing
Section titled “Alerts not firing”- Rule enabled? (Alerts → Rules → check toggle)
- Mute rule active? (Alerts → Mutes — any matching mute?)
- Channel working? — Settings → Channels → [channel] → Send test
- Cooldown? — a recent fire for the same resource suppresses re-alerts
Logs aren’t streaming
Section titled “Logs aren’t streaming”Symptom
Section titled “Symptom”Open Container → Logs, nothing shows up or stops after a few seconds.
- Click Reconnect — WebSocket may have dropped
- Check agent version on the host (old agents had a streaming bug fixed in 1.0.0-beta.3)
- If behind a corporate proxy, WebSocket might be stripped — contact your network admin
Can’t log in
Section titled “Can’t log in”Symptom
Section titled “Symptom”The login page rejects your credentials, or returns:
account temporarily locked — try again in N minutes
Five failed login attempts in a row trigger a 15-minute lockout per user (default — configurable via auth.lockout_max_attempts and auth.lockout_duration_minutes). This usually comes from:
- Browser autofill replaying a stale saved password for the same URL
- Copy-paste from a password manager that got the wrong entry
- An actual forgotten password
- Someone on your network probing with wrong credentials (rare for homelab, more relevant on public-internet deploys)
-
Wait 15 minutes — the lockout is time-based, no admin action needed. The login error tells you how long is left.
-
If you know the password but the lockout is annoying:
Terminal window sudo dockmesh admin unlock --user adminClears the lockout without touching the password.
-
If you forgot the password:
Terminal window cd /var/lib/dockmesh # service's working directorysudo dockmesh admin reset-password --user admin --password 'NewSecure#2026'Also clears the lockout as a side effect.
-
If the login page rejects you silently (not a lockout error):
- Delete the saved password for the dockmesh URL in your browser’s password manager, then type the password by hand
- Try an Incognito/Private window — rules out autofill + cookie issues
Prevention
Section titled “Prevention”- Set a strong, memorable password you type rather than auto-fill
- Tune
auth.lockout_max_attemptsup (e.g. to 10) under Settings → System if you find 5 too strict
Getting help
Section titled “Getting help”If none of the above fixes your issue:
- GitHub Discussions — searchable, other users can help, answers benefit everyone
- GitHub Issues — for bugs (include dockmesh version, OS, minimal reproduction)
- Security issues only:
security@dockmesh.dev
Always include:
- dockmesh version (
dockmesh --version) - OS + Docker version
- Relevant log snippets (journalctl or in-UI logs)
- Steps to reproduce
See also
Section titled “See also”- FAQ — common conceptual questions
- Hardening — preventive measures
- Upgrade Guide — if the issue started after an upgrade