Netconsole to the rescue!

My work laptop recently started randomly crashing on me.  The warranty is long since passed, and I’m the IT department for my little company, so I’m on my own.  The symptoms were “hiccups” in mouse movement, followed shortly by a total freeze,  followed in 30 seconds by a panic/reboot.  My first suspicion was a new video driver I’ve been trying.  It’s beta software, so problems aren’t that unusual.  In any case, the problem sequence was leaving no evidence in my system logs, so I had nothing useful to report to the developers.

My linux kernel is custom-compiled: although I have many debugging tools compiled-in, I didn’t have anything that could save the messages from my dying laptop.  Yesterday, I took the time to dig around in the documentation, and created a new kernel with netconsole turned on.  I configured it to send my console log to my office server.  As luck would have it, my laptop crashed about two minutes after I turned the remote logging on.  And the remote log worked.

Surprise!  It wasn’t a driver error!  My laptop’s dying messages were reporting corrupted transfers between my cpu and my memory chips.  Hardware.  For the specific failure, there are only three possibilities: bad cpu, bad memory, or bad motherboard.  First, I opened the case and swapped the two memory chips.  This appeared to help, as I didn’t have another crash for the rest of the day, nor overnight.  (My linux install does the virus-scan for my Windows partition every night, ensuring that any virus that does get into my Windows box can’t modify the scanner.)

But I’m not out of the woods, as it did crash one more time today.  I have memory chips on order, so I can definitively rule out memory issues.  If that doesn’t work, I guess I’ll be shopping for a new laptop.

After the fold, I describe how I set up remote logging to accommodate my laptop’s road warrior use case.

When I compiled my new kernel with CONFIG_NETCONSOLE, I also turned on CONFIG_NETCONSOLE_DYNAMIC.  This adds runtime configurability to this feature.  Details are in the kernel sources under Documentation/networking/netconsole.txt.  I had to add “configfs” to my system, with the following line in /etc/fstab:

configfs /sys/kernel/config configfs defaults 0 0

I the wrote a hook script for dhcpcd that takes advantage of my customized networks.  My DHCP servers send the NTP server address option, so that Windows boxes will automatically time-sync to my local server.  Most DHCP servers don’t do this.  My new hook script looks for this setting, and assumes that my NTP server will also be the remote logger.  Here’s the script (put it in /lib/dhcpcd/dhcpcd-hooks/):

# Set up or demolish netconsole on an interface
#
setup_netconsole() {
 local target= rmac=
 echo 0 >"$1/enabled"
 target=`echo "$new_ntp_servers" |egrep -i -o "^[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+$" |head -n 1`
 if [[ -n "$target" ]] ; then
  ping -c 1 $target &>/dev/null &
  echo $interface      >"$1/dev_name"
  echo 6665            >"$1/local_port"
  echo $new_ip_address >"$1/local_ip"
  echo 514             >"$1/remote_port"
  echo $target         >"$1/remote_ip"
  rmac=`arp -n |grep "^${new_ntp_servers//./\\.} " |egrep -i -o "([0-9a-f]{2}:){5}[0-9a-f]{2}"`
  if [[ -n "$rmac" ]] ; then
   echo $rmac          >"$1/remote_mac"
   echo 1              >"$1/enabled"
  fi
 fi
}
if [[ ! -d /sys/kernel/config/netconsole ]] ; then
 exit 1
fi
case "$reason" in
 BOUND|INFORM|REBIND|REBOOT|RENEW|TIMEOUT|STATIC)
  [[ -d /sys/kernel/config/netconsole/$interface ]] || mkdir /sys/kernel/config/netconsole/$interface
  setup_netconsole /sys/kernel/config/netconsole/$interface
  ;;
 PREINIT|EXPIRE|FAIL|IPV4LL|NAK|NOCARRIER|RELEASE|STOP)
  [[ -d /sys/kernel/config/netconsole/$interface ]] && echo 0 >/sys/kernel/config/netconsole/$interface/enabled
  ;;
esac

On the server, I first used netcat for ad-hoc monitoring.  I’ve now set up permanent monitoring with sysklogd.  Find your syslogd start-up file, and add “-r” to its command line.

Leave a Reply