Friday, May 17, 2019

Grid outage autopsy



April Linden has blogged about the outage this week that saw all users unable to connect to Second Life for four hours.

You can see the full blog here: https://community.secondlife.com/blogs/entry/2550-the-road-to-downtime-was-paved-with-good-intentions/

It reads..


Hi Residents!

We had one of the longest periods of downtime in recent memory this week (roughly four hours!), and I want to explain what happened.

This week we were doing much needed maintenance on the network that powers Second Life. The core routers that connect our data center to the Internet were nearing their end-of-life, and needed to be upgraded to make our cloud migration more robust.

Replacing the core routers on a production system that’s in very active use is really tricky to get right. We were determined to do it correctly, so we spent over a month planning all of the things we were going to do, and in what order, including full rollback plans at each step. We even hired a very experienced network consultant to work with us to make sure we had a really good plan in place, all with the goal of interrupting Second Life as little as we could while improving it.

This past Monday was the big day. A few of our engineers (including our network consultant) and myself (the team manager) arrived in the data center, ready to go.  We were going to be the eyes, ears, and hands on the ground for a different group of engineers that worked remotely to carefully follow the plan we’d laid out. It was my job to communicate what was happening at every step along the way to my fellow Lindens back at the Lab, and also to Residents via the status blog. I did this to allow the engineering team to focus on the task at hand.

Everything started out great. We got the first new core router in place and taking traffic without any impact at all to the grid. When we started working on the second core router, however, it all went wrong.

As part of the process of shifting traffic over to the second router, one of our engineers moved a cable to its new home. We knew that there’d be a few seconds of impact, and we were expecting that, but it was quickly clear that something somewhere didn’t work right. There was a moment of sheer horror in the data center when we realized that all traffic out of Second Life had stopped flowing, and we didn’t know why.

After the shock had worn off we quickly decided to roll back the step that failed, but it was too late. Everyone that was logged into Second Life at the time had been logged out all at once. Concurrency across the grid fell almost instantly to zero. We decided to disable logins grid-wide and restore network connectivity to Second Life as quickly as we could.

At this point we had a quick meeting with the various stakeholders, and agreed that since we were down already, the right thing to do was to press on and figure out what happened so that we could avoid it happening again. We got a hold of a few other folks to communicate with Residents via the status blog, social media, and forums, and I kept up with the internal communication within the Lab while the engineers debugged the issue.

This is why logins were disabled for several hours. We were determined to figure out what had happened and fix the issue, because we very much did not want it to happen again. We’ve engineered our network in a way that any piece can fail without any loss of connectivity, so we needed to dig into this failure to understand exactly what happened.

After almost four very intense hours of debugging, the team figured out what went wrong, worked around it, and finished up the migration to the new network gear. We reopened logins, monitored the grid as Residents returned, and went home in the middle of the night completely wiped out.

We’ve spent the rest of this week working with the manufacturer of our network gear to correct the problem, and doing lots of testing. We’ve been able to replicate the conditions that led to the network outage, and tested our equipment to make sure it won’t happen again. (Even they were perplexed at first! It was a very tricky issue.) As of the middle of the week we’ve been able to do a full set of tests including deliberately disconnecting and shutting down a router without impact to the grid at all.

Second Life is a really complex distributed system, and it never fails to surprise me. This week was certainly no exception.

I also want to answer a question that’s been asked several times on the forums and other places this week. That question is “why didn’t LL tell us exactly when this maintenance was going to happen?”

As I’ve had to blog about several other times in the past, the sad reality is that there are people out there who would use that information with ill intent. For example, we’re usually really good at handling DDoSes, but it requires our full capacity being online to do it. A DDoS hitting at the same time our network maintenance was in progress would have made the downtime much longer than it already was.

We always want what’s best for Second Life. We love SL, too. We have to make careful decisions, even if it comes at the expense of being vague at times. I wish this wasn’t the case, but sadly, it very much is.

We’re really sorry about this week’s downtime. We did everything we possibly could have to try to avoid it, and yet it still happened. I feel terrible about that.

The week was pretty awful, but does have a great silver lining. Second Life is now up and running with new core routers that are much more powerful than anything we’ve had before, and we’ve had a chance to do a lot of failure testing. It’s been a rough week, but the grid is in better shape as a result.

Thanks for your patience as we recovered from this unexpected event. It’s been really encouraging to see the support some folks have been giving us since the outage. Thank you, you’ve really helped cheer a lot of us up. ❤️

No comments: