Saturday, May 24, 2014

Linden Lab pledges: New wheel for hamster



It's been a bit of a worry for Linden Lab this week, suffering some of the worst problems on a grid-wide scale for several years.

This week users of Second Life suffered outages, logon suspensions, unrezzed avatars, extreme lag (even for them) and in short, some of the worst on-going performance issues in quite a long time.

In recent years it has become quite a rarity for Linden Lab to hold themselves accountable for any problems caused by bad code or hardware failure.  Even when they accidently removed all megaprims throughout the grid a few years back, leaving gaping holes in every region in the world, it was passed off as something the engineers would work on and fix, without much in the form of apology, much less an explanation.

So it is somewhat a surprise that Landen Linden has written on the Second Life Blog and offered a full account of the chain of events that took the grid to grid-lock.

Hopefully this is the dawn of a new era on open and frank accountability from the Lab.  They will probably find that if they explain what has actually happened, people will be more understanding of this kind of service disaster. 

From Landon Linden, in a posting entitle, The Recent Unpleasantness, here is what happened...

In my first few months at the Lab, we removed more than a hundred major single points of failure in our service, but several major ones still loomed large, the granddaddy of them all being the core MySQL database server. By late Winter 2009 we were suffering from a core database outage a few times each week.

With a lot of hard work and countless long nights we stabilized the service and started making major improvements to the overall stability and performance of Second Life. However, despite our continued improvements, and the relative tranquillity they have created, the spectres of technical debt and single points of failure still loom over our operations. In recent weeks some of them have struck and disrupted Second Life. So much so that I want to explain the outages that have occurred, how we addressed them, and what we are doing going forward.


First, that core MySQL database cluster still exists. It is still the core of many of our central functions. When the write server fails it takes a minimum of thirty minutes to promote a new server into position. The promotion itself is actually relatively quick, but its numerous dependent services must all be taken down and brought back up carefully to ensure that they are all functioning properly.

In the last two months the core MySQL write database has hit two different fatal hardware faults, driving us to temporarily halt most Second Life operations. In some sense, two major write database failures close together is bad luck, but we cannot depend on luck to ensure the reliability of Second Life. In the very near future, we are moving the core MySQL write server to a new hardware class, on which production read servers are already running. Moving the write server will further improve overall database performance and make failures less frequent. It does not of course solve the root single point of failure problem so in the coming days, weeks, and months we will be reducing the impact of database failures even more. This includes continued improvement to the rotation process, extracting more functions out of the core database cluster, and further reducing the number of features that depend on the single write server.

The core MySQL database, however, has not been our only problem recently. A few weeks ago there was a massive distributed denial of service attack on one of our upstream service providers that affected most of their customers, including us, and inhibited the ability of some to use our services. We have since mitigated future potential impact from such an attack by adding an additional provider. There have also been hardware failures in the Marketplace search infrastructure that have impacted that site, a problem that we are continuing to work through. Most seriously though was this week’s four and a half hour long login outage.

On Tuesday morning, users stopped being able to get into Second Life. The root cause was created over ten years ago in a system designed to assign a unique identifier to the hand-off of sessions from login to users’ initial regions. At 7:40AM Pacific Time, that system quietly ran out of possible numbers to assign. It took us four hours to isolate the problem, test a fix, and deploy the change. Users could immediately log in at that point, but it took an additional two hours for systems to settle out. When tens of thousands of users rush back into Second Life following an outage, we have to deliberately throttle some services to prevent further breakage.

Having such a hidden fault in a core service  is unacceptable, so we are doing a thorough review of the login process to determine if there are any more problems like this lurking. Our intent at this point also is to remove the identifier assignment service altogether. It not only was the ultimate source of this outage, but is also one more single point of failure that should have been dispatched long ago.
 

We want to apologize for all of the recent problems and the frustration they have caused. We too are frustrated and are intent on making our service better. Few things give me more pleasure than every day helping to make Second Life a happy and fun place. Thank you for your patience and support. We simply could not have a more devoted user base and for that we owe you better.

Most sincerely,
Landon (Linden)



No comments: