Showing posts with label Grid. Show all posts
Showing posts with label Grid. Show all posts

Friday, May 17, 2019

Grid outage autopsy



April Linden has blogged about the outage this week that saw all users unable to connect to Second Life for four hours.

You can see the full blog here: https://community.secondlife.com/blogs/entry/2550-the-road-to-downtime-was-paved-with-good-intentions/

It reads..


Hi Residents!

We had one of the longest periods of downtime in recent memory this week (roughly four hours!), and I want to explain what happened.

This week we were doing much needed maintenance on the network that powers Second Life. The core routers that connect our data center to the Internet were nearing their end-of-life, and needed to be upgraded to make our cloud migration more robust.

Replacing the core routers on a production system that’s in very active use is really tricky to get right. We were determined to do it correctly, so we spent over a month planning all of the things we were going to do, and in what order, including full rollback plans at each step. We even hired a very experienced network consultant to work with us to make sure we had a really good plan in place, all with the goal of interrupting Second Life as little as we could while improving it.

This past Monday was the big day. A few of our engineers (including our network consultant) and myself (the team manager) arrived in the data center, ready to go.  We were going to be the eyes, ears, and hands on the ground for a different group of engineers that worked remotely to carefully follow the plan we’d laid out. It was my job to communicate what was happening at every step along the way to my fellow Lindens back at the Lab, and also to Residents via the status blog. I did this to allow the engineering team to focus on the task at hand.

Everything started out great. We got the first new core router in place and taking traffic without any impact at all to the grid. When we started working on the second core router, however, it all went wrong.

As part of the process of shifting traffic over to the second router, one of our engineers moved a cable to its new home. We knew that there’d be a few seconds of impact, and we were expecting that, but it was quickly clear that something somewhere didn’t work right. There was a moment of sheer horror in the data center when we realized that all traffic out of Second Life had stopped flowing, and we didn’t know why.

After the shock had worn off we quickly decided to roll back the step that failed, but it was too late. Everyone that was logged into Second Life at the time had been logged out all at once. Concurrency across the grid fell almost instantly to zero. We decided to disable logins grid-wide and restore network connectivity to Second Life as quickly as we could.

At this point we had a quick meeting with the various stakeholders, and agreed that since we were down already, the right thing to do was to press on and figure out what happened so that we could avoid it happening again. We got a hold of a few other folks to communicate with Residents via the status blog, social media, and forums, and I kept up with the internal communication within the Lab while the engineers debugged the issue.

This is why logins were disabled for several hours. We were determined to figure out what had happened and fix the issue, because we very much did not want it to happen again. We’ve engineered our network in a way that any piece can fail without any loss of connectivity, so we needed to dig into this failure to understand exactly what happened.

After almost four very intense hours of debugging, the team figured out what went wrong, worked around it, and finished up the migration to the new network gear. We reopened logins, monitored the grid as Residents returned, and went home in the middle of the night completely wiped out.

We’ve spent the rest of this week working with the manufacturer of our network gear to correct the problem, and doing lots of testing. We’ve been able to replicate the conditions that led to the network outage, and tested our equipment to make sure it won’t happen again. (Even they were perplexed at first! It was a very tricky issue.) As of the middle of the week we’ve been able to do a full set of tests including deliberately disconnecting and shutting down a router without impact to the grid at all.

Second Life is a really complex distributed system, and it never fails to surprise me. This week was certainly no exception.

I also want to answer a question that’s been asked several times on the forums and other places this week. That question is “why didn’t LL tell us exactly when this maintenance was going to happen?”

As I’ve had to blog about several other times in the past, the sad reality is that there are people out there who would use that information with ill intent. For example, we’re usually really good at handling DDoSes, but it requires our full capacity being online to do it. A DDoS hitting at the same time our network maintenance was in progress would have made the downtime much longer than it already was.

We always want what’s best for Second Life. We love SL, too. We have to make careful decisions, even if it comes at the expense of being vague at times. I wish this wasn’t the case, but sadly, it very much is.

We’re really sorry about this week’s downtime. We did everything we possibly could have to try to avoid it, and yet it still happened. I feel terrible about that.

The week was pretty awful, but does have a great silver lining. Second Life is now up and running with new core routers that are much more powerful than anything we’ve had before, and we’ve had a chance to do a lot of failure testing. It’s been a rough week, but the grid is in better shape as a result.

Thanks for your patience as we recovered from this unexpected event. It’s been really encouraging to see the support some folks have been giving us since the outage. Thank you, you’ve really helped cheer a lot of us up. ❤️

Tuesday, February 27, 2018

Grid Issues



The Second Life Grid has been up and down many times over the past 24 hours.   April Linden takes up the story..


Hi everyone.

As I’m sure most of y’all have noticed, Second Life has had a rough 24 hours. We’re experiencing outages unlike any in recent history, and I wanted to take a moment and explain what’s going on.

The grid is currently undergoing a large DDoS (Distributed Denial of Service) attack. Second Life being hit with a DDoS attack is pretty routine. It happens quite a bit, and we’re good at handling it without a large number of Residents noticing. However, the current DDoS attacks are at a level that we rarely see, and are impacting the entire grid at once.

My team (the Second Life Operations Team) is working as hard as we can to mitigate these attacks. We’ve had people working round-the-clock since they started, and will continue to do so until they settle down. (I had a very late night, myself!)

Second Life is not the only Internet service that’s been targeted today. My sister and brother opsen at other companies across the country are fighting the same battle we are. It’s been a rough few days on much of the Internet.

We’re really sorry that access to Second Life has been so sporadic over the last day. Trying to combat these attacks has the full attention of my team, and we’re working as hard as we can on it. We’ll keep posting on the Second Life Status Blog as we have new updates.

See you inworld!

April Linden
Second Life Operations Team Lead


Follow the story here: https://community.secondlife.com/blogs/entry/2312-unscheduled-ddos/

Thursday, November 30, 2017

Proving that the Grid is flat



In an effort to scientifically prove that the Grid is flat and not round, a team of London City boffins  built an environmentally friendly dung powered rocket to take a team of researchers into the stratosphere and beyond.

These brave pioneers traveled to the outer reaches of space, before realizing that physics is limited to 32 prims.  As they sat marooned at 1500 meters, eying each other up as potential ballast that could be jettisoned, an idea struck them.

Just like Apollo Wosname, perhaps the brave four, could return back to earth in their lunar module, which just happened to be a retro Sofa from prim solutions.

For a fleeting moment on re-entry to the 'Land of Hub', out of the corner of their eyes, they spotted the horizon, and guess what!?  It was flat!

We believe this to be true proof of concept, and the first ever verified research that absolutely and irrevocably proves for all  time that the grid is flat, and not round!

A hero's welcome was waiting for our brave astronauts as they touched down gracefully.