Linus Torvalds + HN + ECC RAM + rasdaemon

Not_OlesNot_Oles Hosting ProviderContent Writer

One of the reasons why I am renting a Hetzner AX-51 server is because the AX51 is the least expensive server in the AX line that comes configured by default with Error-correcting Code Memory (ECC RAM memory).

This morning rajesh-s posted on Hacker News (HN) about Linus Torvalds opinion that "ECC absolutely matters."

The HN discussion was fascinating! All kinds of anecdotal and tech information about ECC and bit flips, their importance or lack of importance, relative costs, effects on hardware and software, influence from Intel and AMD marketing strategies, and use of ECC by various huge companies and in various kinds of equipment.

One HN comment by fortran77 stated that the rate of bit flips is about 1 per gigabyte per month.

gsvelto, a Mozilla engineer, posted in the HN discussion a link to his excellent rasdaemon tutorial.

Whoa! @Not_Oles, the "clueless administrator," had never heard of rasdaemon before! :) What's rasdaemon? As gsvelto said, "rasdaemon tools can be used to monitor ECC memory and report both correctable and uncorrectable memory error." The the initial "ras" in name "rasdaemon" stands for Reliablity, Availability and Serviceability (RAS).

I googled around a bit and checked Github for any rasdaemon problems with Proxmox. The weather seemed maybe pretty good. So I went ahead and installed rasdaemon on my server. Now maybe I can monitor ECC memory errors and also get those errors logged.

Here below is what the install looked like in case anybody is interested.

I had a fun day! I hope you did too! Greetings from Mexico! ??????️

root@hels ~ # apt-get update
Get:1 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]
Hit:2 http://deb.debian.org/debian buster InRelease                                                            
Get:3 http://deb.debian.org/debian buster-updates InRelease [51.9 kB]                                          
Hit:4 http://download.proxmox.com/debian/ceph-nautilus buster InRelease                                        
Hit:5 http://download.proxmox.com/debian/pve buster InRelease                  
Hit:6 http://mirror.hetzner.de/debian/packages buster InRelease
Get:7 http://mirror.hetzner.de/debian/security buster/updates InRelease [65.4 kB]
Get:8 http://mirror.hetzner.de/debian/packages buster-updates InRelease [51.9 kB]
Fetched 235 kB in 1s (321 kB/s)    
Reading package lists... Done
root@hels ~ # apt-get dist-upgrade
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Calculating upgrade... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
root@hels ~ # apt-get install rasdaemon
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libdbd-sqlite3-perl libdbi-perl
Suggested packages:
  libmldbm-perl libnet-daemon-perl libsql-statement-perl
The following NEW packages will be installed:
  libdbd-sqlite3-perl libdbi-perl rasdaemon
0 upgraded, 3 newly installed, 0 to remove and 0 not upgraded.
Need to get 1,030 kB of archives.
After this operation, 2,914 kB of additional disk space will be used.
Do you want to continue? [Y/n] 
Get:1 http://mirror.hetzner.de/debian/packages buster/main amd64 libdbi-perl amd64 1.642-1+deb10u1 [775 kB]
Get:2 http://mirror.hetzner.de/debian/packages buster/main amd64 libdbd-sqlite3-perl amd64 1.62-3 [177 kB]
Get:3 http://mirror.hetzner.de/debian/packages buster/main amd64 rasdaemon amd64 0.6.0-1.2 [78.2 kB]
Fetched 1,030 kB in 1s (2,023 kB/s)  
Selecting previously unselected package libdbi-perl:amd64.
(Reading database ... 71868 files and directories currently installed.)
Preparing to unpack .../libdbi-perl_1.642-1+deb10u1_amd64.deb ...
Unpacking libdbi-perl:amd64 (1.642-1+deb10u1) ...
Selecting previously unselected package libdbd-sqlite3-perl:amd64.
Preparing to unpack .../libdbd-sqlite3-perl_1.62-3_amd64.deb ...
Unpacking libdbd-sqlite3-perl:amd64 (1.62-3) ...
Selecting previously unselected package rasdaemon.
Preparing to unpack .../rasdaemon_0.6.0-1.2_amd64.deb ...
Unpacking rasdaemon (0.6.0-1.2) ...
Setting up libdbi-perl:amd64 (1.642-1+deb10u1) ...
Setting up libdbd-sqlite3-perl:amd64 (1.62-3) ...
Setting up rasdaemon (0.6.0-1.2) ...
Created symlink /etc/systemd/system/multi-user.target.wants/ras-mc-ctl.service → /lib/systemd/system/ras-mc-ctl.service.
Created symlink /etc/systemd/system/multi-user.target.wants/rasdaemon.service → /lib/systemd/system/rasdaemon.service.
Processing triggers for man-db (2.8.5-2) ...
root@hels ~ # systemctl enable rasdaemon
root@hels ~ # systemctl status rasdaemon
● rasdaemon.service - RAS daemon to log the RAS events
   Loaded: loaded (/lib/systemd/system/rasdaemon.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2021-01-04 01:44:06 UTC; 2min 42s ago
 Main PID: 15397 (rasdaemon)
    Tasks: 1 (limit: 4915)
   Memory: 10.9M
   CGroup: /system.slice/rasdaemon.service
           └─15397 /usr/sbin/rasdaemon -f -r

Jan 04 01:44:06 hels.xxxxxxxxx.xxx rasdaemon[15397]: rasdaemon: ras:extlog_mem_event event enabled
Jan 04 01:44:06 hels.xxxxxxxxx.xxx rasdaemon[15397]: rasdaemon: Enabled event ras:extlog_mem_event
Jan 04 01:44:06 hels.xxxxxxxxx.xxx rasdaemon[15397]: ras:extlog_mem_event event enabled
Jan 04 01:44:06 hels.xxxxxxxxx.xxx rasdaemon[15397]: rasdaemon: Listening to events for cpus 0 to 15
Jan 04 01:44:06 hels.xxxxxxxxx.xxx rasdaemon[15397]: Enabled event ras:extlog_mem_event
Jan 04 01:44:06 hels.xxxxxxxxx.xxx rasdaemon[15397]: rasdaemon: Recording mc_event events
Jan 04 01:44:06 hels.xxxxxxxxx.xxx rasdaemon[15397]: rasdaemon: Recording aer_event events
Jan 04 01:44:06 hels.xxxxxxxxx.xxx rasdaemon[15397]: rasdaemon: Recording extlog_event events
Jan 04 01:44:06 hels.xxxxxxxxx.xxx rasdaemon[15397]: rasdaemon: Recording mce_record events
Jan 04 01:44:06 hels.xxxxxxxxx.xxx rasdaemon[15397]: rasdaemon: Recording arm_event events
root@hels ~ # man rasdaemon
root@hels ~ # man ras-mc-ctl
root@hels ~ # ras-mc-ctl --mainboard
ras-mc-ctl: mainboard: ASRockRack model B450D4U-V1L
root@hels ~ # ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No Extlog errors.
No MCE errors.
root@hels ~ # 

I hope everyone gets the servers they want!

Tagged:

Comments

  • @Not_Oles said: One HN comment by fortran77 stated that the rate of bit flips is about 1 per gigabyte per month.

    Might not be a big deal with docker containers or anything that could restart on failure.
    But would be a disaster if the same happens on a DB server.

  • When DB server running on Docker? I think, the isolation of volumes and immutable images push db servers to containers.

  • alwyzonalwyzon Hosting Provider
    edited January 2021

    @evnix said:
    Might not be a big deal with docker containers or anything that could restart on failure.
    But would be a disaster if the same happens on a DB server.

    It might still be an issue as Rowhammer attacks can be used to escalate privileges and thus pretty much render any sandboxing useless. Not an easy attack and likely to get spotted if you carefully monitor your machines, but one should be aware that is possible when enough energy is spent on it. Heck, there even was a proof of concept published in 2015 for privilege escalation in web browsers using JavaScript.

    Alwyzon - Virtual Servers in Austria starting at 4,49 €/month (excl. VAT)

  • @alwyzon said: It might still be an issue as Rowhammer attacks can be used to escalate privileges

    You have a point! I did not look at it from an attackers point of view, once inside they have all access to the internal VPN and config files to extract and connect to

  • havochavoc OGContent Writer

    Yeah saw that - was an interesting read. Didn't know one could monitor it.

    I think the relative importance of risks is also worth keeping in mind though. I'm not in charge of a datacenter so my data losses are more likely to be from a bad config / backup /vulnerability not on point etc than rowhammer or a cosmic bit flip.

    Or sometimes just stupidity...deleted a ssh key the other day. Whoops.

  • Not_OlesNot_Oles Hosting ProviderContent Writer

    @havoc said: the relative importance of risks

    When @david and I were setting up my old, now gone, OVH servers for our original giveaway program, we had a mysterious crash. We spent several days trying to figure out what might have happened. I got back into the server and retrieved the logs, but no joy. We then spent several weeks wondering about reliability and trying extended testing.

    There were no more crashes at OVH and there have not been any crashes on the new Hetzner box. Nevertheless, when evaluating relative importance of ECC monitoring, maybe incorporating the time and trouble spent debugging into the comparison metric might be good. :)

    I hope everyone gets the servers they want!

  • havochavoc OGContent Writer

    @Not_Oles said:

    @havoc said: the relative importance of risks

    When @david and I were setting up my old, now gone, OVH servers for our original giveaway program, we had a mysterious crash. We spent several days trying to figure out what might have happened. I got back into the server and retrieved the logs, but no joy. We then spent several weeks wondering about reliability and trying extended testing.

    There were no more crashes at OVH and there have not been any crashes on the new Hetzner box. Nevertheless, when evaluating relative importance of ECC monitoring, maybe incorporating the time and trouble spent debugging into the comparison metric might be good. :)

    Yeah when offering a service to other for payment then I'd def expect ECC to be considered.

    Sorry above comment wasn't really meant to be dismissive of your usage case. I just think Torvalds point is a little too broad & overstated in that a big chunk of computers are used to FB and browse cat pictures...not exactly something you need to guard against cosmic rays (though if you can for no additional cost sure why not)

    Thanked by (1)Not_Oles
Sign In or Register to comment.