Sunday morning outage

Fast speed dataflow.

Hello everyone,

My name is Chung Liu, and I recently joined Appcelerator in the role of Director of Cloud Engineering. Our developer site was down for a few hours on Sunday morning (November 15), and I would like to give our users an update on what happened and what we are doing to prevent outages going forward.

Early on Sunday morning, one of the servers that handles HTTP API calls stopped releasing database connections. After a short time, this caused other servers to stop working properly because they couldn’t establish their own connections to the database.

The problem was resolved by removing the misbehaving server from the load balancer pool, which directed HTTP API traffic to the remaining servers in the cluster. For redundancy and scalability, we have a cluster of load balanced servers just to handle HTTP API calls. Later, we spun up a replacement server instance in Amazon EC2, restoring the cluster to the original number of servers.

This outage also uncovered a bug in our client code for forward and reverse geolocation cloud services, which caused some iPhone applications to crash during the downtime. This issue is being resolved in the upcoming 1.5 release, and we will ensure that this type of problem does not happen again.

We are very sorry for the frustration and missed productivity caused by this downtime. Going forward, we are taking measures to address the root causes of this outage, and prevent other outages in the future. We will continue to improve our monitoring systems and internal processes so we can detect more issues earlier and resolve them quicker. We will also increase the scope and coverage of our failover tests to verify that our high-availability architecture can stand up to more diverse types of scenarios.

If you have any more questions about the outage, please email support@appcelerator.com.

Thanks,

Chung Liu

Director of Cloud Engineering

Appcelerator, Inc.

1 COMMENT

  1. Hiya,

    One way, among many, of preventing cluster nodes from “stealing” all the available database connections is to give each one of them their own database username (username/hostname combo) and limit connections per user.

    It’s an often overlooked way of doing things. You’d obviously want more checks in code, more monitoring and all that jazz, but this is last ditch attempt at limiting the problem should all other checks fail for some reason.

    Ste Daniels