Hetzner NVMe failing

SagnikSSagnikS Hosting ProviderOG

Heya,

I've got a question, tried researching about it on the interwebz, but it didn't lead anywhere. So, I have quite a bit of servers with Hetzner. Around a week back, I got an alert (from a monitoring software) that 3 separate NVMes on 3 dedicated servers had failed. Since they were running on RAID 1, I scheduled a disk replacement and it was done with. Fyi, the % used on all the drives were less than 20%, 1 was in single digits. Today, I got alerted again, notifying that yet another NVMe (it was the "new" NVMe on the same dedicated server) had failed. I'm posting the SMART stats here if it helps:

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.3.13-1-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZVLB512HAJQ-00000
Serial Number:                      S3W8NX1M954021
Firmware Version:                   EXA7301Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 512,110,190,592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Utilization:            242,402,029,568 [242 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 8991c42437
Local Time is:                      Thu Jan 30 14:14:58 2020 UTC
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     81 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.02W       -        -    0  0  0  0        0       0
 1 +     6.30W       -        -    1  1  1  1        0       0
 2 +     3.50W       -        -    2  2  2  2        0       0
 3 -   0.0760W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x04
Temperature:                        36 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    2,827,660 [1.44 TB]
Data Units Written:                 1,561,851 [799 GB]
Host Read Commands:                 2,721,446,668
Host Write Commands:                67,041,131
Controller Busy Time:               2,403
Power Cycles:                       13
Power On Hours:                     184
Unsafe Shutdowns:                   3
Media and Data Integrity Errors:    1
Error Information Log Entries:      6
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               36 Celsius
Temperature Sensor 2:               45 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0          6     2  0x0198  0x4502  0x000    606758440     1     -

Now this, is very very very odd. Among multiple providers I have used around the world, I have never ever had anything like this. A seemingly new NVMe, with less than 200 power on hours fails? Point to note, they were the exact same models that failed. Might be a bad batch..?

If you guys have any thoughts on this, it would be highly appreciated.

«1

Comments

  • InceptionHostingInceptionHosting Hosting ProviderOG

    Could be a bad batch of disks but also based on a quick bit of research on that error/status could be a controller based issue on the boards, which frankly given your description sounds more likely.

    The disks may be fine.

    https://inceptionhosting.com
    Please do not use the PM system here for Inception Hosting support issues.

  • SagnikSSagnikS Hosting ProviderOG

    @AnthonySmith said:
    Could be a bad batch of disks but also based on a quick bit of research on that error/status could be a controller based issue on the boards, which frankly given your description sounds more likely.

    The disks may be fine.

    AFAIK, Hetzner uses PCIe raisers/extenders to plug the NVMe drives in, might be smth on there.

    Thanked by (1)vimalware
  • InceptionHostingInceptionHosting Hosting ProviderOG

    could be, I guess if you have had separate identical 3 issues, it is likely that others would have seen the same (if they even noticed) and perhaps they have a wider investigation going on.

    https://inceptionhosting.com
    Please do not use the PM system here for Inception Hosting support issues.

  • SagnikSSagnikS Hosting ProviderOG

    @AnthonySmith said:
    could be, I guess if you have had separate identical 3 issues, it is likely that others would have seen the same (if they even noticed) and perhaps they have a wider investigation going on.

    I've opened a ticket with Hetzner specifically for this (probably should have done that first), I'll update here if I get any useful response from them. Meanwhile, tagging @Hetzner_OL to grab their attention.

    Thanked by (1)Hetzner_OL
  • SagnikSSagnikS Hosting ProviderOG

    :neutral:

    Dear Client
    
    thank you for your inquiry. Please note, that there are no known issues with the regarding NVMes. Hence, we assume an issue, related to the type of usage/software on your servers.
    

    Welp, no luck there, it's just Proxmox running, on ext4, if anyone's interested.

  • MikeAMikeA Hosting ProviderOG
    edited January 2020

    I got an alert (from a monitoring software) that 3 separate NVMes on 3 dedicated servers had failed.

    What monitoring/alert? If it's Hetrix and you have all checks enabled it will notify you if it's doing anything like checking/resync. What did cat /proc/mdstat show? I've never had a problem with Hetzners drives.

    ExtraVM - High RAM Specials
    Yours truly.

  • SagnikSSagnikS Hosting ProviderOG

    @MikeA said:

    I got an alert (from a monitoring software) that 3 separate NVMes on 3 dedicated servers had failed.

    What monitoring/alert? If it's Hetrix and you have all checks enabled it will notify you if it's doing anything like checking/resync. What did cat /proc/mdstat show? I've never had a problem with Hetzners drives.

    It's not Hetrix, I just have a script that checks the SMART stuff. /proc/mdstat was normal, nothing odd on there.

  • When in doubt, sue.

    ♻ Amitz day is October 21.
    ♻ Join Nigh sect by adopting my avatar. Let us spread the joys of the end.

  • WSSWSS OGRetired

    @deank said:
    When in doubt, sue.

    Mary, or baby? I love babysue.

    My pronouns are like/subscribe.

  • SagnikSSagnikS Hosting ProviderOG
    edited January 2020

    @MikeA said:

    I got an alert (from a monitoring software) that 3 separate NVMes on 3 dedicated servers had failed.

    What monitoring/alert? If it's Hetrix and you have all checks enabled it will notify you if it's doing anything like checking/resync. What did cat /proc/mdstat show? I've never had a problem with Hetzners drives.

    By any chance, do you have the same model of NVMe drives? (SAMSUNG MZVLB512HAJQ-00000 ). And here's the proc/mdstat output:

    root@node ~ # cat /proc/mdstat 
    Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
    md1 : active raid1 nvme0n1p2[2] nvme1n1p2[0]
          499449152 blocks super 1.2 [2/2] [UU]
          bitmap: 4/4 pages [16KB], 65536KB chunk
    
    md0 : active raid1 nvme0n1p1[2] nvme1n1p1[0]
          523264 blocks super 1.2 [2/2] [UU]
    
    unused devices: <none>
    
  • MikeAMikeA Hosting ProviderOG
    edited January 2020

    @SagnikS One of my Ryzen 3700X servers has a Toshiba KXG60ZNV1T02, the other has a Samsung MZVLB1T0HALR. i9-9900K has Samsung MZVLB1T0HALR as well. Treadripper MZQLB960HAJR.
    So no, not the same ones you have.

    Thanked by (1)SagnikS

    ExtraVM - High RAM Specials
    Yours truly.

  • SagnikSSagnikS Hosting ProviderOG

    @MikeA said:
    @SagnikS One of my Ryzen 3700X servers has a Toshiba KXG60ZNV1T02, the other has a Samsung MZVLB1T0HALR. i9-9900K has Samsung MZVLB1T0HALR as well. Treadripper MZQLB960HAJR.
    So no, not the same ones you have.

    Ah gotcha, they're 1TB or more ig. These are 500GB ones.

  • Hetzner_OLHetzner_OL Hosting ProviderOG

    Sorry, since I'm not a technician myself and I don't have direct access to the information you've shared with our suppor team, it's a bit difficult to comment on this situation.
    I assume that you shared all the information that you could with our team, including the troubleshooting that you've already tried, right? If not, please do that. Maybe there's something else that will turn up. You could also consider writing a post in our custiomer Discussion Forum. If other customers with NVMes have had similar issues, they'll let you know, or they'll give you some other ideas to try out. (Many of our oldest clients are from Germany, which is why there is so much German in this Forum, but most readers speak Engilsh. Just make sure to share what you've already tried out.) --Katie

    Thanked by (2)SagnikS mfs

    We're Katie and Lea and we'll do our best to answer questions you have about Hetzner Online. We and not our employer are responsible for any horrible puns and dated cultural references.

  • ClouviderClouvider Hosting ProviderOG
    edited January 2020

    We had major issues with this model of the drive, with frequent failures across a large number of the drives (if they work - they work well, but some fail quickly, badly, and early if they don’t). We have since stopped providing new services with this model.

    Thanked by (3)SagnikS vimalware mfs
  • cybertechcybertech OGBenchmark King

    @Clouvider said:
    We had major issues with this model of the drive, with frequent failures across a large number of the drives (if they work - they work well, but some fail quickly, badly, and early if they don’t). We have since stopped providing new services with this model.

    Warning warning NVMe warning!!!!

    I bench YABS 24/7/365 unless it's a leap year.

  • @Clouvider said: We had major issues with this model of the drive

    The samsung ones?

  • SagnikSSagnikS Hosting ProviderOG

    @Clouvider said:
    We had major issues with this model of the drive, with frequent failures across a large number of the drives (if they work - they work well, but some fail quickly, badly, and early if they don’t). We have since stopped providing new services with this model.

    Similar experience with ours too, glad to know it's an issue with the NVMe itself. Just to confirm, it's this right: MZVLB512HAJQ?

  • InceptionHostingInceptionHosting Hosting ProviderOG

    @Clouvider said:
    We had major issues with this model of the drive, with frequent failures across a large number of the drives (if they work - they work well, but some fail quickly, badly, and early if they don’t). We have since stopped providing new services with this model.

    I think that's the ones I had that went bye bye with zero warning iirc?

    https://inceptionhosting.com
    Please do not use the PM system here for Inception Hosting support issues.

  • ClouviderClouvider Hosting ProviderOG

    @AnthonySmith said:

    @Clouvider said:
    We had major issues with this model of the drive, with frequent failures across a large number of the drives (if they work - they work well, but some fail quickly, badly, and early if they don’t). We have since stopped providing new services with this model.

    I think that's the ones I had that went bye bye with zero warning iirc?

    Aye.

    Thanked by (1)vimalware
  • Are you guys talking about our ex?

    Thanked by (1)seriesn

    ♻ Amitz day is October 21.
    ♻ Join Nigh sect by adopting my avatar. Let us spread the joys of the end.

  • SagnikSSagnikS Hosting ProviderOG

    @Hetzner_OL said:
    Sorry, since I'm not a technician myself and I don't have direct access to the information you've shared with our suppor team, it's a bit difficult to comment on this situation.
    I assume that you shared all the information that you could with our team, including the troubleshooting that you've already tried, right? If not, please do that. Maybe there's something else that will turn up. You could also consider writing a post in our custiomer Discussion Forum. If other customers with NVMes have had similar issues, they'll let you know, or they'll give you some other ideas to try out. (Many of our oldest clients are from Germany, which is why there is so much German in this Forum, but most readers speak Engilsh. Just make sure to share what you've already tried out.) --Katie

    I really don't know if there's anything to troubleshoot at all when an NVMe fails, and I got this response from support:

    thank you for your inquiry. Please note, that there are no known issues with the regarding NVMes. Hence, we assume an issue, related to the type of usage/software on your servers.
    
  • InceptionHostingInceptionHosting Hosting ProviderOG

    @SagnikS said: thank you for your inquiry. Please note, that there are no known issues with the regarding NVMes. Hence, we assume an issue, related to the type of usage/software on your servers

    I really hate that sort of support.

    Roughly translated: "Dear Customer 104582, I have looked in to nothing, I am really just trying to find a reason for this not to be my problem so i can close the ticket"

    https://inceptionhosting.com
    Please do not use the PM system here for Inception Hosting support issues.

  • ClouviderClouvider Hosting ProviderOG
    edited January 2020

    @SagnikS said:

    @Hetzner_OL said:
    Sorry, since I'm not a technician myself and I don't have direct access to the information you've shared with our suppor team, it's a bit difficult to comment on this situation.
    I assume that you shared all the information that you could with our team, including the troubleshooting that you've already tried, right? If not, please do that. Maybe there's something else that will turn up. You could also consider writing a post in our custiomer Discussion Forum. If other customers with NVMes have had similar issues, they'll let you know, or they'll give you some other ideas to try out. (Many of our oldest clients are from Germany, which is why there is so much German in this Forum, but most readers speak Engilsh. Just make sure to share what you've already tried out.) --Katie

    I really don't know if there's anything to troubleshoot at all when an NVMe fails, and I got this response from support:

    thank you for your inquiry. Please note, that there are no known issues with the regarding NVMes. Hence, we assume an issue, related to the type of usage/software on your servers.
    

    I mean yeah, I narrowed it down to some of them being particularly sensitive to running hot, so if you had a “less resilient” drive, and you hammered it, you would run it hot and then through your own use you’d destroy it, but hey, this wasn’t happening on PM961, only on PM981, so clearly this is not a user caused issue...

    Thanked by (2)SagnikS vimalware
  • I guess I have to put in a request to mix manufacturers when configuring nvme mirrored pools.

  • @deank said: Are you guys talking about our ex?

    None of my ex failed so swiftly, badly and early like (allegedly) this NVMe model

    @vimalware said: mix manufacturers

    when you're younger that's something you should absolutely try I guess

  • SagnikSSagnikS Hosting ProviderOG
    edited February 2020

    @vimalware said:
    I guess I have to put in a request to mix manufacturers when configuring nvme mirrored pools.

    Yep, however, a friend of mine was told that Hetzner doesn't to have anything in stock other than those Samsung NVMes. :frown:

    Thanked by (1)vimalware
  • jarlandjarland Hosting ProviderOG

    Looks like the closest I have is:

    2x SAMSUNG MZVLB1T0HALR-00000

    The rest are:

    KXG60ZNV1T02 TOSHIBA

    Looks good so far:

    https://clbin.com/u9u2n

    I've sure put them through a lot. Uptime of 232 days, don't think I've rebooted this machine much.

    Do everything as though everyone you’ll ever know is watching.

  • SagnikSSagnikS Hosting ProviderOG

    @jarland said:
    Looks like the closest I have is:

    2x SAMSUNG MZVLB1T0HALR-00000

    The rest are:

    KXG60ZNV1T02 TOSHIBA

    Looks good so far:

    https://clbin.com/u9u2n

    I've sure put them through a lot. Uptime of 232 days, don't think I've rebooted this machine much.

    Probably something to do with that exact 512GB model.

    Thanked by (1)jarland
  • ClouviderClouvider Hosting ProviderOG

    @jarland said:
    Looks like the closest I have is:

    2x SAMSUNG MZVLB1T0HALR-00000

    The rest are:

    KXG60ZNV1T02 TOSHIBA

    Looks good so far:

    https://clbin.com/u9u2n

    I've sure put them through a lot. Uptime of 232 days, don't think I've rebooted this machine much.

    This doesn’t affect as many 1TB ones, 256 and 512 however are/were a problem

    Thanked by (3)jarland vimalware SagnikS
  • edited February 2020

    @AnthonySmith said:

    @SagnikS said: thank you for your inquiry. Please note, that there are no known issues with the regarding NVMes. Hence, we assume an issue, related to the type of usage/software on your servers

    I really hate that sort of support.

    Roughly translated: "Dear Customer 104582, I have looked in to nothing, I am really just trying to find a reason for this not to be my problem so i can close the ticket"

    He has to use stronger sotfwares.

    Hardwares are BIG.
    The bigger the computer, the better.

    Softwares are strong.
    Made by heavy duty programmers.

    Efficiency is through the roof.
    Heating during the winter from servers.

    ???
    Profit is millions.

Sign In or Register to comment.