[Editor’s note: This post was originally published on the Architected Availability blog and is republished with permission from the author]
I am currently employed as the chief architect at Lucidchart. In my spare time (literally) I am also the ops guy. All of our servers are running on Amazon’s EC2 cloud. Using the cloud is amazing and frustrating at the same time. Managing hardware, using tape drives, and co-location facilities are all nightmares; on the other hand, so are service outages, network failures, and ephemeral storage drives.
As the CTO of Amazon, Werner Vogels, says, “Everything fails all the time.” I would like to give a report of one such failure: How it happened, what was affected, how we got through it, what I did because of it, and how I’ll never have to deal with it again.
Lucidchart’s Persistence Layer
Let me just stop right here and let you know just how much I like redundancy, automatic fail-over, and availability.
- MySQL Servers. I have 2 clusters of MySQL servers running the Lucidchart site. Each cluster is a 2 servers, in a master-master configuration. In the application, I have written code to automatically fail-over if one of the servers stops responding. I have done the same thing for our Drupal and WordPress installations.
- MongoDB Servers. I have 1 cluster of MongoDB servers running the site. This cluster has 3 servers, each running 3 instances of mongod – one for replSet a, one for replSet b, and one for a config shard. The replication configuration is set up so that server 1 is the master for replSet a, server 2 is the master for replSet b, and server 3 is the fail-over for both replSets. If any of the 3 servers go down, the other servers will automatically elect another master between themselves, and the site will go on without considerable interruption.
- Event Servers. We have not spent much time on our event server, so it’s currently an unscaleable, unreplicated, unavailable nightmare to manage, but it works as long as the process is restarted every six hours… yikes! It’s on the backlog to update. Because of its shortcomings, we have a low TTL on the DNS entry plus a hot backup. Should the event server have any issues, we just kill the server, switch DNS, and we’re back up and running in no time.
I hope I’ve emphasized enough that our database layer is redundant. This, however, did not prepare us at all for the EBS failure that we encountered last year.
Werner had it right. Something in Amazon’s network layer bugged out last summer, causing some EBS volumes to become unavailable. We were hit; specifically, the data volume for one of our MongoDB servers was not responding. The thing that I couldn’t understand was why MongoDB’s built-in fail-over wasn’t working. I’ve tested the fail-over in a variety of scenarios, and I’ve never seen the process be so unresponsive. Nevertheless, our site was down for about 20 minutes.
The underlying cause came to me while checking connections between the servers. The mongod process was still running, listening on the port, and accepting connections. Those connections would just hang forever. So, it wasn’t a downed server, it was just an impossibly slow server. Even armed with this information, the fix was not as quick as I had hoped. The process was stuck in the kernel waiting for an IOCTL, so killing the process did nothing – even kill -9. It was going to stay that way until EBS was back to normal.
After the flurry of Zabbix notifications and ssh sessions was over, my job was to reproduce and fix the issue so it wouldn’t hit again. Since our MongoDB servers had the problems, I jumped straight to a fix for their software. I won’t go into the details here, but I was able to reproduce the network volume failure using NFS shares. I logged a bug, and let the 10gen engineers do their magic.
We have now upgraded our MongoDB servers, and we are not susceptible to that particular bug anymore. Or are we? What if a similar problem happens in our event service? What if our web servers can’t access the log files? What if our MySQL EBS volumes went down? It seems I may have not actually fixed the problem at all.
I thought of another solution — a better solution. All of our services are listening on tcp ports, which can be easily controlled with iptables at the server level. If I had a process that consistently checked a file on the volume, then I could change iptables to reject all traffic on the associated ports while the volume is down. The redundancy we have in our application layer will handle downed servers a lot better than servers with quick connection times and horribly slow operation times.
There were a couple issues:
- If the volume is down, then the process will never actually complete the file check, it will just hang forever.
- File system reads are cached. Checking a file’s existence or contents won’t actually tell me whether the volume is working.
As it turns out, the solution was fairly simple. I would have 2 processes, one that checks the disk and sends heartbeats, and one that receives heartbeats and updates iptables. To get around the file system cache, the disk checker will touch a file. After implementation and testing, it seems to work just as expected. I ran it through the same NFS scenario that I reported the MongoDB bug with. We released it immediately, and haven’t had the same problem since. That was back in August 2012.
Above is the communication between application servers and database servers under normal operation. The application will speak to one server or the other, depending on item hash.
When the disk fails, the application will no longer be able to connect to the preferred database server because the traffic will be rejected by iptables. The client fails over to the other server, and everything is still alright.
We’ve decided to release this helpful piece of code to the open source community under the Apache 2.0 license. You can find it on GitHub at this URL:
Please take it, use it, and submit updates back. I would love to hear what you think of it or if it saves your bacon.