Woke up today to the homeserver being unresponsive. Couldn’t SSH, no video out when I connected a monitor, and even the reset button didn’t do anything. Had to hold the power button to shut it down.

/var/log/syslog doesn’t show anything interesting other than the issue happened at just after 4am. Log

2026-02-27T03:55:01.481794-08:00 blackbox CRON[1743418]: (www-data) CMD (/usr/bin/php8.3 /mnt/MONSTERDRIVE/pixelfeddata/pixelfed/artisan schedule:run >> /dev/null 2>&1)
2026-02-27T04:00:00.198504-08:00 blackbox smartd[2126]: Device: /dev/sdd [SAT], CHECK POWER STATUS spins up disk (0x81 -> 0xff)
2026-02-27T04:00:00.291853-08:00 blackbox systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
2026-02-27T04:00:00.298344-08:00 blackbox systemd[1]: sysstat-collect.service: Deactivated successfully.
2026-02-27T04:00:00.298523-08:00 blackbox systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
2026-02-27T04:00:00.299608-08:00 blackbox kernel: kauditd_printk_skb: 8 callbacks suppressed
2026-02-27T04:00:00.299613-08:00 blackbox kernel: audit: type=1130 audit(1772193600.298:798916): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='unit=sysstat-collect comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
2026-02-27T04:00:00.299615-08:00 blackbox kernel: audit: type=1131 audit(1772193600.298:798917): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='unit=sysstat-collect comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
2026-02-27T04:00:01.923610-08:00 blackbox kernel: audit: type=1101 audit(1772193601.922:798918): pid=1744810 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='op=PAM:accounting grantors=pam_permit acct="www-data" exe="/usr/sbin/cron" hostname=? addr=? terminal=cron res=success'
2026-02-27T04:00:01.923614-08:00 blackbox kernel: audit: type=1103 audit(1772193601.922:798919): pid=1744810 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='op=PAM:setcred grantors=pam_permit,pam_cap acct="www-data" exe="/usr/sbin/cron" hostname=? addr=? terminal=cron res=success'
2026-02-27T04:00:01.923615-08:00 blackbox kernel: audit: type=1006 audit(1772193601.922:798920): pid=1744810 uid=0 subj=unconfined old-auid=4294967295 auid=33 tty=(none) old-ses=4294967295 ses=50544 res=1
2026-02-27T04:00:01.923615-08:00 blackbox kernel: audit: type=1300 audit(1772193601.922:798920): arch=c000003e syscall=1 success=yes exit=2 a0=7 a1=7fff81d75200 a2=2 a3=0 items=0 ppid=2654 pid=1744810 auid=33 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=50544 comm="cron" exe="/usr/sbin/cron" subj=unconfined key=(null)
2026-02-27T04:00:01.923616-08:00 blackbox kernel: audit: type=1327 audit(1772193601.922:798920): proctitle=2F7573722F7362696E2F43524F4E002D66002D50
2026-02-27T04:00:01.924259-08:00 blackbox CRON[1744811]: (www-data) CMD (/usr/bin/php8.3 /mnt/MONSTERDRIVE/pixelfeddata/pixelfed/artisan schedule:run >> /dev/null 2>&1)
2026-02-27T04:00:01.924614-08:00 blackbox kernel: audit: type=1105 audit(1772193601.923:798921): pid=1744810 uid=0 auid=33 ses=50544 subj=unconfined msg='op=PAM:session_open grantors=pam_loginuid,pam_env,pam_env,pam_permit,pam_umask,pam_unix,pam_limits acct="www-data" exe="/usr/sbin/cron" hostname=? addr=? terminal=cron res=success'
2026-02-27T04:00:01.925610-08:00 blackbox kernel: audit: type=1110 audit(1772193601.924:798922): pid=1744811 uid=0 auid=33 ses=50544 subj=unconfined msg='op=PAM:setcred grantors=pam_permit,pam_cap acct="www-data" exe="/usr/sbin/cron" hostname=? addr=? terminal=cron res=success'
2026-02-27T04:00:02.357616-08:00 blackbox kernel: audit: type=1104 audit(1772193602.356:798923): pid=1744810 uid=0 auid=33 ses=50544 subj=unconfined msg='op=PAM:setcred grantors=pam_permit acct="www-data" exe="/usr/sbin/cron" hostname=? addr=? terminal=cron res=success'
2026-02-27T09:23:35.786375-08:00 blackbox systemd-modules-load[904]: Inserted module 'dm_multipath'

Would something like this be a direct hardware failure? Like a power supply hiccup or something? It happening at 4am coincides with my electric car starting to charge, but the server is on a dedicated 20A circuit and behind a battery backup. I also don’t see any power issues on my Sense monitor at that time though it has limited resolution.

Mainboard is a Supermicro H13SAE-MF and I’m using ECC RAM.

I’ve been running this hardware for over a year and never had this issue, but I’m running out of places to look.

Might be time to finally get IPMI working.

  • empireOfLove2@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    7
    ·
    4 hours ago

    The reset button is basically just a signal to the CPU/BIOS that it should wipe memory and begin the boot process from scratch. If it was not working, that indicates the CPU was hard locked and not responding to any sort of input, not just an os fault The power button sends an actual trigger signal to the PSU through the ATX connector so it bypasses any mainboard lock.

    Random shit happens, see if it does it again.
    My go to for random stability issues is to always run a full deep memtest to look for bad RAM and then a CPU stress test to see if it’s a random thermal or core issue. More often than not I find stability problems just with these two steps.