I use GNU/Linux Mint 18.3 with kernel version 4.10.0-42. For the past several weeks, every once in a while my system hangs, without any signs before hand of upcoming trouble (that I have noticed).

I've tried switching kernel versions (have had 4.4.0 and 4.8.0 before), but to no avail.

What can I do to resolve or circumvent this issue?

Additional information

  • My system’s BIOS is “ASUS UEFI BIOS 3016.”
  • My root is on an SSD which has not seen much write action
  • The hangs did not start occuring (immediately) after some chage of hardware.
  • The never seem to happen when I'm at the actual computer, always or almost always when I'm away / asleep. But again, not always, i.e. most days this doesn't happen.
  • I run XFCE with on-board graphics, but I also have an nVIDIA GTX 650 Ti not used for graphics (which is idle when these hangs happen). The nVIDIA driver version is 387.26 now.
  • When the hang occurs, the monitor continues displaying the last image, but nothing is responsive. Ctrl+Alt+Fn doesn't work, and the computer doesn't respond to network traffic.

My machine

(I'll add any additional information below as requested.)

/var/log/syslog before and after last hang:

Jan  7 23:09:55 my_pc smartd[966]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 70 to 69
Jan  7 23:39:55 my_pc smartd[966]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 69 to 68
Jan  8 00:03:48 my_pc rsyslogd: [origin software="rsyslogd" swVersion="8.16.0" x-pid="947" x-info="http://www.rsyslog.com"] start
Jan  8 00:03:48 my_pc rsyslogd: rsyslogd's groupid changed to 108
Jan  8 00:03:48 my_pc rsyslogd: rsyslogd's userid changed to 104

/var/log/syslog before and after second-to-last hang:

Jan  7 16:07:49 my_pc smartd[933]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 111 to 112
Jan  7 16:37:49 my_pc smartd[933]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 58 to 59
Jan  7 16:37:49 my_pc smartd[933]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 42 to 41
Jan  7 16:37:49 my_pc smartd[933]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 112 to 111
Jan  7 17:07:49 my_pc smartd[933]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 69 to 70
Jan  7 17:07:49 my_pc smartd[933]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 111 to 112
Jan  7 17:37:49 my_pc smartd[933]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 112 to 111
Jan  7 17:56:58 my_pc systemd[1]: Starting Daily apt download activities...
Jan  7 17:57:04 my_pc systemd[1]: Started Daily apt download activities.
Jan  7 17:58:05 my_pc inadyn[1376]: .
Jan  7 17:58:05 my_pc inadyn[1376]: Checking for IP# change, connecting to ip1.dynupdate.no-ip.com(34.196.162.199)
Jan  7 17:58:05 my_pc inadyn[1376]: No IP# change detected, still at 11.22.33.44
Jan  7 18:07:49 my_pc smartd[933]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 70 to 69
Jan  7 19:09:55 my_pc rsyslogd: [origin software="rsyslogd" swVersion="8.16.0" x-pid="967" x-info="http://www.rsyslog.com"] start
Jan  7 19:09:55 my_pc rsyslogd: rsyslogd's groupid changed to 108
Jan  7 19:09:55 my_pc rsyslogd: rsyslogd's userid changed to 104

The /dev/sdd temperature log records are weird. You see, I don't have an sdd. That is, sda is my SSD, sdb and sdc are magnetic HDDs, and /dev/sr0 is a DVD player. /dev/sdd doesn't even exist as a special file in /dev.

Lines from other logs:

auth.log shows some Chinese IPs trying to SSH into my machine as root, e.g.:

Jan  7 23:39:53 my_pc sshd[19697]: message repeated 3 times: [ Failed password for root from 218.65.30.53 port 51732 ssh2]
Jan  7 23:39:56 my_pc sshd[19697]: Failed password for root from 218.65.30.53 port 51732 ssh2
Jan  7 23:39:59 my_pc sshd[19697]: Failed password for root from 218.65.30.53 port 51732 ssh2
Jan  7 23:39:59 my_pc sshd[19697]: error: maximum authentication attempts exceeded for root from 218.65.30.53 port 51732 ssh2 [preauth]
Jan  7 23:39:59 my_pc sshd[19697]: Disconnecting: Too many authentication failures [preauth]
Jan  7 23:39:59 my_pc sshd[19697]: PAM 5 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=218.65.30.53  user=root

but I don't think this is related, since this is happening after the hang as well. No other lines on any other logs between the disk-related message I mentioned above and the hang.

  • 1
    Do the logs contain anything suspicious? – choroba Jan 7 at 19:05
  • Check your system logs before doing anything like—honestly—what you have already done. There is no reason for a Kernel downgrade like this on a modern system. Since you mention these hangs don’t happen while you are working on it, how do you know the system has hung? Meaning you wake up in the morning and see something was amiss? Maybe some power saving/sleep issue? – JakeGould Jan 7 at 19:21
  • @JakeGould: I get back to it and it's hanging, that's how I know... It can't be a sleep issue - I think - since the system is always-on and most nights it doesn't hang. – einpoklum Jan 7 at 19:31
  • @choroba: Sort of, but I'm not sure it's the actual problem or just a red herring, see edit. – einpoklum Jan 7 at 19:42
  • Do you run in graphics / X-windows mode, or console? When it's "hung," do you still have the last screen contents (and just no response to mouse or keyboard)? Or is it a black screen? Can you Ctrl-Alt-Fx switch to other consoles and get any action there? How about network activity -- can you ping the machine from a different box? – Dave M. Jan 7 at 20:34

It might be one of your drives being disconnected and then reconnected but detected as a new device. In my experience with linux servers this sometimes happens if the old device did not disconnect properly and the kernel still holds its letter and when it reconnects it will give it a new letter. It may be one of your drives being faulty or cables not secured. This really depends on the controller and how it handles the devices.

Since you say you find the machine already hung and you can't really poke it around to see what happened I would suggest writing a small bash script constantly pulling the info about all drives and writing it to a file, preferably to one of the drives you are sure work, otherwise it may not get written if you try writing it on the failing drive. The script may be something like:

#!/bin/bash 


date
echo "Starting device data dump" 
for drive in sda sdb sdc sdd
do
    echo "Dumping data for drive ${drive}"
    fdisk -l
    smartctl -a /dev/${drive}
    dmesg -T | tail -n50
done
echo "Ended device data dump"

Put that in cron running each minute and writing the output to a file with

crontab -e

Crontab line to add:

* * * * * /usr/local/bin/logcommand.sh >> /var/log/disk-problem.log

After hand check what's in the file. You should be able to see sdd's smart data like model, brand, serial number and compare it with your other drives. If ifs one of them disconnecting there will be a match, if not you still sould be able to get info about that misterious sdd drive and what it might be.

Also, check if your dmesg gets written to some file in /var/log. dmesg should print device disconnects and detections.

PS: Also, since your machine is hung when you find it its probably your root device which dives you problems since if holds the base system and without it the machine could not function.

  • Thank you for the detailed answer, and I'll try this. Of course, I'll need to wait until the next hang for results, and there might even be a Heisenberg effect here, but we'll see. – einpoklum Jan 7 at 22:47
  • Check out all your logs, thats always first thing to do, some may hold the answer ;) Get the date/time when the hang happened and search it in your logs, what happened before the freeze, last lines before the death :) – EvilTorbalan Jan 7 at 22:51
  • So, only other thing in the logs is some Chinese trying to hack my box by SSH'ing in as the root user. – einpoklum Jan 8 at 0:02
  • @einpoklum Is root stil enabled on your system? Here’s how to end that forever: Juts create a new user with sudo privileges, lock the root account and get on with your life. I doubt the hacking is caused by SSH probing happening, but it might be worth it to just eliminate stuff like that so you can know—with assurance—that stuff that shows up in the logs are truly worth paying attention to. – JakeGould Jan 8 at 15:58
  • @JakeGould: root login via SSH was never enabled on my system. But think I'll install fail2ban and maybe change my SSH port number. Still, I'm pretty convinced the hangs are unrelated to this. We'll see what the disk logs say. – einpoklum Jan 8 at 16:09

I don't know if this helps but I have a similar situation. The system is an Intel NUC running Linux Mint 18.3 (XFCE) with 8Gb RAM and an M2 SSD so very similar to the OPs.

My problems only shows when running Thunderbird. I direct all my Thunderbird data to another Linux Mint computer that I use as a server. Small Thunderbird accounts work (just) but the larger ones cause the system to become unstable and Thunderbird doesn't really run at all.

Linux Mint 18.3 (XFCE) is supplied with Linux Kernel 4.10.0-38 and this works fine on my system - Thunderbird works as well as on other systems. However if I upgrade the Linux Kernel to 4.10.0-42 using the inbuilt Mint upgrade package, Thunderbird causes the problems mentioned above.

I must stress that this problem (using the newer Kernel - 4.10.0-42) only happens on my NUC computer - other systems work fine with the upgraded Kernel.

My interim solution is to stick with the 4.10.0-38 Kernel and fully test any upgrades before using.

  • +1 for effort. I'm a bit worried about using earlier kernels, though. – einpoklum Jan 14 at 12:44
  • This doesn't seem to work. I did try switching my kernel version here and there, and not much effect. Also getting this problem with 4.13.x – einpoklum Feb 20 at 12:19

Your Answer

 

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Not the answer you're looking for? Browse other questions tagged or ask your own question.