Servarica Downtime Explanation
All machines are back up. Received this email. Formatting isn't the best, but pretty interesting read nonetheless!
As some of you know we promised that we will allow all VMs to switch to 1gbps network by the end of this month. In order to do that we had to do some changes in our infrastructure which unfortunately caused some issues including the issue that occurred 2 days ago.
Before explaining what the issue was, let's start by the most important part which is what Actions we took then will explain our infrastructure and how the issue caused the downtime
We are doing the following to prevent future tasks to occur again
1-We are offering as per our SLA all our users even the ones not affected by any downtime 15% of their monthly cost as credit which is already applied to your account (if your service is annually you will get total invoice /12 * 0.15 refund)
2-We are in the process of hardening all storage VMs against switching to read only file system (read the issue results section for explanation). We are testing the solution now and soon if there are no issues we will deploy it.
The process will require all VMs to be rebooted to apply the changes which will be planned reboot
3-We are trying to find exactly why the switches behaved in the weird way and we plan to try to replicate the case so we can find solution for it
Now let's explain our infrastructure to understand that we built a really resilient infrastructure but issues occur from time to time that force us to go offline.
We operate several racks of servers; each rack include the following types of servers with different numbers
1-Storage Servers: those are big servers with many disks that only serve storage to other servers using NFS. No clients VMs running on those servers.
Each of those servers is connected through multi 10gbps and / or 40GB NICs
Each server is connected to more than one 10gb or 40gb switch for redundancy
The disks in those servers are either configured in mirror mode or raidz2 for redudancy
2-Storage Compute Nodes: those are the servers where storage VMs(VPS) live. Each of those servers is connected through multi 10gb NICs to more than 1 switch for redundancy .
since the storage is separated front he compute nodes we are able to organize all nodes that host storage vms as pools pools with N+1 redundancy at least
That means if we have a pool of 4 servers, we can only accept orders that fill 3 servers and leave the equivalent of 1 server free.
This setup allow us to do a lot of maintenance tasks without our users even notice for example we can do minor servers upgrades by live migrate all VMs from the server to other pool servers then reboot the server and move the VMs back to the server that we rebooted when it is up
The VMs are still distributed on all 4 servers to give more performance in normal operation but once we need to free any server, we can free it since we already always plan for N+1 redundancy
3-NVMe and SSDs Nodes: Those nodes were NVMe and SSD VMs live, each of those nodes have either NVMe or SSD disks or both depending on the plans that we put on it .
Those servers are connected to both our data and storage networks through multiple 10gb nics that are connected to more than 1 switch
4- Special servers Like GPU servers and servers where we run our experimentations on
Power and connectivity per rack:
1-Each rack is connected to 2 power sources that are backed by independent UPS and diesel power generators in case one power feed goes down (A+B power). All our equipment are dual powered and connected to both A and B power sources.
2-The power source of the datacenter is from Hydro Quebec which generates more than 99% of its electricity from water (Clean energy)
3-In every rack the switches are configured as multi chassis switches (act as 1 virtual switch)
4-Each server is connected to more than 1 physical switch, so we have no single point of failure through out our network
As you can see we do redundancy where ever we can
We have redundancy on the network , power ,disks and even storage compute nodes
Now let's explain what occurred 2 days ago in as much details as possible
1- we were doing new router installation in one of our racks
2-While Installing it we needed to disconnect one of the switches, that is usually not an issue at all as all the servers are connected to the switches using LAG and in case one switch goes down all traffic will switch to the other switch with zero interruption.
3-We disconnected the switch did our work and reconnected it again and everything was normal
4-After few minutes, the whole rack that was connected to the LAG of this switch went down
5-We tried everything to check for errors or if we by mistake removed any cables while doing the installation, but everything was in perfect order yet no traffic
6-We did step by step troubleshooting to each component of our network and found out the mlag (multi chassis lag) between the switch that we disconnect and the other one did not become active after connecting the switch back) and worst it was in weird state that prevented the traffic from flowing
7-We rebooted both switches and the network was back and running
8-We are still not sure what triggered having the switches in this weird state as the act of disconnecting one switch and keep the traffic on the other is tested and done regularly with zero issues in the past
The issue resulted in the following:
1-All NVMe and SSD vms in the affected rack had their data network down for up to 4 hours
2-Some of big storage VMs whose storage servers was not colocated in the same affected rack lost connection to their disks and the OS immediately switched the file system on those VMs to read-only (because it no longer can access the remote storage)
3-The team as preventive measure rebooted all the storage VMs that had their file system as read only (they actually rebooted more VMs as it was not always fast and easy to know if the VM storage in the same rack or not )
Again we Sorry for the instability that occurred the last period and we want to assure you that we will keep doing better in the future