Wednesday, January 4, 2006

What Happened when the Lights Went Out

As you probably noticed, the Foxcloud Synchronization Server was down for about five days, starting early morning of December 31st, 2005. We realize that you may be curious about what happened, so here's a synopsis.

At about 5am the morning of December 31st, the data center that hosts the Foxcloud Synchronization Server (FSS) lost power. Some of the details are still murky, but our understanding is that the power outage was related to a more general storm-related power outage. You may have heard about the heavy rains and flooding that affected much of the San Francisco Bay Area during the holiday week.

The power remained out for about 8 hours. When it returned, the FSS attempted to restart itself, but failed midway through its restart process. Here's where it gets a bit complicated. See if you can follow this:

The FSS is based on Cosmo, an Open Source sharing server that is being developed by the Open Source Applications Foundation, Foxcloud's organizational cousin. Cosmo, in turn, relies on Jackrabbit, an open source content repository package being incubated at Apache. The problem stemmed from a bug in the Jackrabbit code that is executed during restart. Given the holidays and the somewhat complex, distributed nature of the development of this project (we're in San Francisco, but the lead engineer on Cosmo is in New York this week, and the lead Jackrabbit developer lives in Switzerland), it took us until today to isolate and develop a work-around for the problem.

We're taking steps to reduce the chances of this kind of problem recurring. We're in the process of preparing to move to a different data center that can provide us with better guarantees about power reliability and better onsite management of our servers. Also as part of that move, we'll be moving onto more robust hardware that will be more fault tolerant than what we're running on now.

We'd like to be able to promise you that the Foxcloud Service will never go down again, but the reality is that this is young software and we're still working out the kinks. These kinds of outages are an unfortunate part of that process, but we'll be working to make them shorter and less painful. And now with this blog we'll be able to communicate with you better about what's happening.

Feel free to drop us a comment to let us know how you think we're doing. We're still excited about Foxmarks and the Foxcloud Service, and hope you are, too.