RESOLVED : Outage Denver Datacenter

31 December 1969 6:00 PM
- Server/Network Maintenance

Please find the complete RFO.

This was caused due to Misordered routing policy which resulted in unexpected withdrawal of our default route, after disabling one of our external carriers. Data center network team have updated their procedures so as to handle similar incidents in a different way to avoid such issues.

Below, please find our reason for outage and root cause analysis reports related to the network disruption we suffered on January 17, 2019.
Timeline (All times are MT)

3:35 AM – We begin receiving customer reports that they are experiencing latency and packet loss on one of our providers, Hurricane Electric.

3:45 AM – The issue is escalated to our on-call technician. A decision is made to “admin down” our BGP sessions to Hurricane Electric.

3:52 AM – “Admin down” change is committed

3:57 AM – Our external monitoring platform, NodePing, alerts our Denver based NOC team and escalation contacts, including our CEO, that a connectivity was lost for a large number of network devices.

4:04 AM – On-call escalation point implements management escalation plan due to the wide spread nature of the outage. Management contacts were also reaching out to on-site staff and on-call staff.

4:12 AM – On-site staff confirm that both data center locations appear to be operating normally from a power and cooling perspective and that there were no visible alarms with our networking equipment.

4:16 AM – Facilities staff and senior systems engineers are engaged in order to assess and triage the issue. On site staff began reviewing logs on our core/distribution layer switches at both data centers, no obvious errors are found.

4:47 AM – Physical console access is obtained to our core routers.

4:52 AM – The “admin down” change is reverted and connectivity is restored.

5:05 AM – After confirming all networking alerts were resolved, root cause analysis is initiated.

5:40 AM – Root cause analysis is complete and configuration changes are implemented on an emergency basis to remediate the issue at hand.

5:50 AM – The configuration changes are tested by again setting Hurricane Electric into “admin down” state. No disruption is noted, confirming the change is effective in resolving the underlying condition.

12:17 PM – This report is finalized.

Root Cause Analysis
A review of our logs and configurations on our border routers indicate that a chained series of misconfigurations, when combined with the “admin down” of Hurricane Electric, would result in our default route being withdrawn from all of our top of rack switches, resulting in traffic forwarding disruption to substantially all of our clients in both of our data centers.

Technical Background
We have network infrastructure that is critical to our network in three locations:
• CoreSite DE1 (910 15th Street)
• Our Denver Tech Center Data Center (5350 S. Valentia Way)
• Our Downtown Denver Data Center

These locations are connected via a redundant, physically isolated, DWDM fiber network. We interconnect with other network providers at CoreSite and our DTC Data Center, so that we have physical redundancy for our external network connectivity. Historically, this fiber network has been very reliable, and indeed, even today it performed flawlessly. Our network is architected so that we receive full BGP route tables from our IP Transit providers (Zayo, GTT, Hurricane Electric, and Cogent) into our border routers. Our core and distribution switches do not support full BGP routing tables, so we inject a “generated” default route into them, one
from each border router.

These default routes are only generated and distributed to downstream devices when a border router is receiving full and complete routes from at least one of our IP transit providers. The policy we have in place typically looks like this:

term provider1 {
from {
neighbor x.x.x.x;
next-hop x.x.x.x;
}
then accept;
}
term provider2 {
from {
neighbor x.x.x.x;
next-hop x.x.x.x;
}
then accept;
}
term reject {
then reject;
}
The logic here basically says “If we have external routes from provider one or provider two, then send a default route to the core/distribution switches at both actual data center locations.” We have this kind of implementation to compensate for the potential of a dual fiber cut, which could leave one of our border routers in a completely isolated state. If that condition was to occur, without having our above policies in place, the dead-headed router would continue to advertise our IP space to the global Internet, resulting in traffic being blackholed depending on the route, and sub-optimal traffic flows within our network which could create additional latency and packet loss internally.

Unfortunately, these policies (which were designed and implemented to protect network failure) were misconfigured on both of our border routers. The misconfigurations were a little different on each border router:

Border1:
The ‘reject’ term was after the ‘he’ term. When Hurricane Electric was taken offline, border1stopped distributing a default route into our network.

Border2:
A recently added provider was not added to the policy statement, and a provider that we have had online for quite a while had a typo in the IP addresses, which made the policy statement ineffective.

After Action Plan
From an IT governance perspective, important questions need to be asked:
• How did two misconfigurations happen in the first place?
• What can be done to prevent such issues from happening again, in terms of people,
process, and technology?

The answers for these questions are linked. We have had an open network engineering position for some time. I actually personally handle most of our network engineering - it is one of the last technical operational things that I personally handle. Our strategic plan for the second half of 2018 included hiring a dedicated network engineer, and the position is still open.

Consequently, peer reviews of major configuration changes on our border routers have not been happening effectively. Additionally, we relied on a manual process to generate the policy statements in the first place, which quite clearly failed. In order to ensure this governance issue does not exist going forward, we will be hiring an additional network engineer as soon as we can recruit a qualified candidate.

Additionally, we will be working to make sure that this critical routing policy is generated through a reliable, automated process. In the past we have used Juniper SLAX scripts to perform similar configuration validations, but we are evaluating a number of options to ensure this particular issue does not re-occur.

Going forward, all staff will also be utilizing a commit confirmed methodology for any change, even standard changes that are routine, on our border routers. Utilizing this feature would have resulted in automatic rollback of the change that triggered this underlying condition. From an operational perspective, we have also identified a number of short comings, which we will work to correct as soon as possible:

1. This outage was so vast that it impacted our ability to access critical internal IT systems. Having quicker access to these systems would have reduced the duration of the outage.

2. Our out of band access to our routers was not validated in quite a while, and it turns out that we had major issues using our OOB access. Having access to our OOB network could have reduced the duration of the outage.

3. We are in the process of making our internal applications more resilient by adopting a hybrid, multi-cloud strategy. Unfortunately, this plan has not been fully implemented yet. While certain internal systems remained online, we still experienced operational disruption due to dependencies which had yet to be engineered around.

Consequently, our after-action plan, in total, is as follows:
1. Accelerate hiring a network engineer.
2. Require peer reviews of all major config changes on our border routers.
3. Automate the creation of the routing policies involved in this issue so that they are
always correct and accurate.
4. Resolve issues with OOB access.
5. Accelerate our multi-cloud strategy for internal tools.

Closing
We understand that our customers entrust us to provide reliable and resilient data center and cloud hosting services. We are committed to improving our policies and processes so that this issue and similar issues do not re-occur in the future. We hope that, at the very least, you appreciate our candor regarding the circumstances of this outage and our commitment to taking definitive actions address some underlying organizational issues which extended the duration of this outage by approximately 15-20 minutes.

Also, to be completely clear, implementing and validating these network policies were completely my personal individual responsibility. I could make a number of excuses, but at the end of the day, the reality is that my mistakes contributed greatly to this outage. I am very fortunate to work with dedicated professionals such as Pete, Jeff, Lindsay, and David who all got woken up in the wee hours of the morning by this situation. We will get the issues identified rectified as soon as is possible.
Update 1 - 5.59 AM

All servers are now accessible as the underlying issue has been resolved.

Date : 17 January 2019, 5:13 am

We are experiencing connectivity issues at our Denver Data center. The cause is currently unknown. We are in touch with our team at the data center. We will update this announcement as soon as we have more information.

Thanks for your co-operation
SoftSys Support Team

Share via

Did you find this article useful?

RESOLVED : Outage Denver Datacenter

Related Articles

Categories

Related Articles

Network Issue in Denver DataCenter

Emergency Network Maintenance - August 13, 2019 @ 9:00PM - 2:00AM (Denver DC)

Server Migration // SOFTSERVMAIL1 / SOFTSERVMAIL2