Skip to content

Troubleshooting

Start here when something’s not working. If this page doesn’t cover your case, check the FAQ or open a GitHub Discussion.

Host shows Offline or Connecting… indefinitely.

  1. Agent service running?
    Terminal window
    systemctl status dockmesh-agent
    journalctl -u dockmesh-agent -n 100
  2. Can the agent resolve the server DNS?
    Terminal window
    # On the agent host
    getent hosts dockmesh.example.com
  3. Can it reach the server port?
    Terminal window
    openssl s_client -connect dockmesh.example.com:8443 -servername dockmesh.example.com
  4. Is the certificate valid? If you rotated the CA, the agent must re-enrol. Get a fresh enrolment token from Agents → host → Rotate enrolment token, then re-run the install one-liner on the agent host (it overwrites the existing cert):
    Terminal window
    curl -fsSL "https://<server>/install/agent.sh?token=<new-token>" | sudo bash
  5. Clock skew? mTLS is sensitive to clock drift. Both sides should run NTP:
    Terminal window
    timedatectl status
  • Firewall rule change blocking outbound 8443
  • TLS cert expired on server (uncommon, auto-renews if ACME)
  • Agent revoked on the server (Agents page shows a red revoked badge)
  • Network split between server and agent subnets

Deploy logs show an error and the stack status goes to error.

The streaming deploy log has the real cause. Common ones:

pull access denied — image is private and the registry credentials aren’t configured. See Images → Registry auth.

port is already allocated — another container is using the host port. Find it: Containers → filter by port. Either stop the existing container or change the port in the new stack.

driver failed programming external connectivity — usually means the host ran out of available ports in the ephemeral range, or iptables is misconfigured. Restart the Docker daemon on that host.

network <name> declared as external, but could not be found — the external network isn’t there. Create it first (Networks → New network) or remove external: true.

no space left on device — the host disk is full. Usually /var/lib/docker — prune images/volumes via dockmesh or clean up host logs.

Clicking the SSO button sends you to the IdP, you log in, come back, and see “Authentication failed” or get bounced to the login page.

  1. Redirect URI matches exactly? The URI in your IdP must match <your-dockmesh-url>/api/v1/auth/oidc/<slug>/callback character-for-character — where <slug> is the provider’s slug from the Authentication page. http vs https, trailing slash, port, slug — all must match. The exact URI is shown at the top of the provider form in the UI, copy it from there.
  2. Clock skew? OIDC tokens have short expiry (usually 60s). If server and IdP clocks differ by more than that, tokens are rejected.
  3. Group claim present? If you use group mappings, the ID token must include the groups claim. Some IdPs require enabling “groups scope” explicitly.
  4. Logs on the dockmesh server:
    Terminal window
    journalctl -u dockmesh | grep -i oidc
    Look for specific error like invalid token signature, missing claim, discovery failed.

Pages take seconds to load.

  1. Server load?
    Terminal window
    top # check dockmesh CPU/mem
    iostat # check disk wait
  2. Database size?
    Terminal window
    ls -lh /var/lib/dockmesh/data/dockmesh.db # Linux default
    ls -lh /usr/local/var/dockmesh/data/dockmesh.db # macOS default
    If it’s > 1 GB, consider enabling audit log retention (see Audit Log).
  • Vacuum the SQLite DB if fragmentation is high:
    Terminal window
    systemctl stop dockmesh
    sqlite3 /var/lib/dockmesh/data/dockmesh.db "VACUUM;"
    systemctl start dockmesh
  • Reduce stats retention in Settings if disk I/O is the bottleneck.

Backup job shows failed.

  1. Job log — click the failed run, read the error
  2. Target still reachable? — test in Backups → Targets → [target] → Test connection
  3. Disk space on target — SFTP/NAS with a full disk silently fails
  4. Encryption passphrase known? — restore tests require it; rotating it orphans old backups
  • dial tcp ... i/o timeout — target host is unreachable (firewall? DNS?)
  • permission denied — credentials have read but not write access on target
  • pre-backup hook exited 1 — the hook script failed (check the hook command/image)

Migration aborts partway, stack is back on source host.

  1. Pre-flight — did any check fail? Volume size mismatch is common.
  2. Network — bandwidth between source and destination; migrations of 100+ GB volumes can take hours on slow links
  3. Destination disk full mid-transfer — pre-flight checks free space, but if something else fills it up mid-transfer, migration aborts

Automatic rollback should leave you in the starting state. If it doesn’t, manual cleanup against the on-disk stack tree (one directory per stack, no host subdir):

Terminal window
# On the source host — restart the stack from its compose file
docker compose -f /var/lib/dockmesh/stacks/<stack>/compose.yaml up -d
# On the destination host — tear down whatever the migration left behind
docker compose -f /var/lib/dockmesh/stacks/<stack>/compose.yaml down
  1. Rule enabled? — Alerts → Rules → row’s enabled toggle
  2. Rule muted? — same row, check muted_until in the future (mutes are per-rule, not global)
  3. Channel working? — Alerts → Channels → row → Send test
  4. Cooldown still active? — a recent fire on the same rule suppresses re-notify for cooldown_seconds
  5. Container filter actually matches? — the rule’s container_filter glob is run against container names; double-check paperless-* etc. matches what docker ps shows on the affected host

Open Container → Logs, nothing shows up or stops after a few seconds.

  • Click Reconnect — WebSocket may have dropped
  • Check agent version on the host (old agents had a streaming bug fixed in 1.0.0-beta.3)
  • If behind a corporate proxy, WebSocket might be stripped — contact your network admin

The login page rejects your credentials, or returns:

account temporarily locked — try again in N minutes

Five failed login attempts in a row trigger a 15-minute lockout per user (default — configurable via auth.lockout_max_attempts and auth.lockout_duration_minutes). This usually comes from:

  • Browser autofill replaying a stale saved password for the same URL
  • Copy-paste from a password manager that got the wrong entry
  • An actual forgotten password
  • Someone on your network probing with wrong credentials (rare for homelab, more relevant on public-internet deploys)
  1. Wait 15 minutes — the lockout is time-based, no admin action needed. The login error tells you how long is left.

  2. If you know the password but the lockout is annoying:

    Terminal window
    sudo dockmesh admin unlock --user admin

    Clears the lockout without touching the password.

  3. If you forgot the password:

    Terminal window
    sudo dockmesh admin reset-password --user admin --password 'NewSecure#2026'

    This rewrites the password hash only — it does not clear an active lockout. If the account is also locked, run sudo dockmesh admin unlock --user admin afterwards (or wait the lockout out).

  4. If the login page rejects you silently (not a lockout error):

    • Delete the saved password for the dockmesh URL in your browser’s password manager, then type the password by hand
    • Try an Incognito/Private window — rules out autofill + cookie issues
  • Set a strong, memorable password you type rather than auto-fill
  • Tune the threshold up (defaults: 5 attempts / 15 min lockout) under Authentication → Password policy if you find it too strict — the settings are auth.lockout_max_attempts and auth.lockout_duration_minutes

If none of the above fixes your issue:

  • GitHub Discussions — searchable, other users can help, answers benefit everyone
  • GitHub Issues — for bugs (include dockmesh version, OS, minimal reproduction)
  • Security issues only: security@dockmesh.dev

Always include:

  • dockmesh version (dockmesh --version)
  • OS + Docker version
  • Relevant log snippets (journalctl or in-UI logs)
  • Steps to reproduce
  • FAQ — common conceptual questions
  • Hardening — preventive measures
  • Upgrade Guide — if the issue started after an upgrade