Below is timeline of events that has been provided by our data center in Denver - Handy Networks:
- At 10:42 am, a customer reported to us that this virtual machine was not responsive. Our technical team investigated the issue and determined the virtual machine required a reboot.
- At 10:44 am, a virtual machine on the “WEHOST2-VPS” Hyper-V cluster is rebooted because it is non-responsive. Upon reboot, it begins sending anomalous network traffic, which has a cascading impact.
- At 10:47:43 am, the hypervisor hosting this virtual machine begins throwing memory allocation errors, related to vRss network queues. By 10:55:56 am, the hypervisor has over 53GB of memory allocated.
- At 10:51:30 am, the hypervisor that was providing service to the virtual machine crashed.
- At 10:55 am, the hypervisor recovers, and continues to fill up vRss network queues.
- At 10:56:56 am, fpc1 on dist3.denver2 indicates it is in alarm for DDOS protocol violations. (Note: dist3.denver2 is a Juniper QFX 5100 virtual chassis, which should N+1 redundancy and high availability. Each node is referred to as an fpcX.)
- At 10:57:21 am, our primary switch stack – dist3.denver2 - reports interface states between fpc0 and fpc1 are unstable.
- At 10:57:33 am, fpc0 goes into DDOS alarm as well.
- At 10:58:15, fpc1 crashes.
- At 10:59:15, fpc0 crashes. At this time, both of nodes of our redundant distribution switch cluster are offline and network access is impacted for all clients.
- Within 3-4 minutes, we have senior engineers in front of the switch stack which is crashed, and also remotely connecting to other switches via our OOB network to determine their status.
- Over the next 30 or so minutes, fpc0 and fpc1 crash repeatedly. fpc0 crashed at 11:02 am, 11:12 am, 11:16 am and fpc1 crashed at 11:06 am, 11:15 am, and 11:30 am. During these periods, there was some intermittent network connectivity for some impacted clients.
- Our review of the logs from both fpc0 and fpc1 during this time indicates that virtual chassis never fully converged, which resulted in spanning tree loops on our network.
- At approximately 11:35 am, we powered down both fpc0 and fpc1 and brought them back online. Unfortunately, they did not come up cleanly.
- At 11:40 am, we began physically isolated (both power and network) fpc1.
- At 11:45 am, once fpc1 was isolated, we reboot fpc0.
- fpc0 came back online at 11:55 am, but was stuck in “linecard” mode, requiring us to manually remove low level config files and restart certain processes.
- At 12:03 pm we completed this process.
- At 12:04 pm we see interfaces physically coming back online.
- At 12:06 pm we begin receiving UP alerts from.
At this time, we are cautiously optimistic that the immediate trouble is over, but we very carefully monitor the situation for the next 45 minutes, while also beginning to look into various logs. After deep research into GBs of logs, we were able to definitively determine that the initiating event for this outage was related to rebooting a particular VM. We actually rebooted this particular VM last week, and experienced an isolated network issue which impacted two of our many Hyper-V clusters. Upon rebooting the VM, the same general sequence of events occurred, except that the volume of network traffic involved was significantly higher today, which had a far larger impact.
At the time the outage began this morning, we were actually in the process of reviewing the logs from the initial event on August 7th for root cause analysis purposes, and the development of a remediation plan.
After Action Plan
We will soon be publishing a series of network maintenance windows to restore redundancy to dist3.denver2 and make other network adjustments that should provide increased reliability going forward. The first maintenance window will likely take place Tuesday, August 13, 2019 starting at 9:00 pm. Further updates will be provided.
The network issue at Denver datacenter has been resolved since some time and all services have been restored. We will post RFO as soon as we have it available from data center.
There is data center wide network issue in Denver and NOC guys are working on it. Our network engineers are currently investigating an issue affecting network level. We are trying to fixed this and shall get back to you with updates shortly.
Apologies for the inconvenience caused.