Archive:Hardware and hosting report/Archives 2005

2005 Q2 Report

But... down times...still happen...

Wikimedia site down on May 13
On May's Friday the 13th power supply troubles knocked out the majority of Wikimedia servers. Because of this, the main network switch, core file server and nearly all web and cache servers failed to function. As all of database servers had dual power supplies, they were the only ones to survive the crash. Proper site recovery took more than one hour, with some resulting side effects (some hardware needs replacement).

2005 Q1 report

Hardware Report

 
LCD layers

January and February saw a number of slowdowns and, on one occasion, a complete shutdown of Wikimedia sites. These were due to a variety of reasons. Many individual servers broke; 10 machines in the main cluster were fixed in the first quarter alone; and traffic continues to rise. The colocation facility had a massive power failure in February, leading to two days of downtime and read-only availability. As of the start of Q2, almost all servers are back in action, and the cluster is looking healthy.

Developers have recently started a LiveJournal as a way to communicate about servers issues with the community. That's one more feed for your preferred RSS aggregator.

Power outages

There were two major power outages in the first quarter. The first outage, around February 21st, was due to a double power failure: two different power supplies to our cluster were switched off at the same time, when some of the internal switches in our colocation facility failed. Some databases were corrupted by the sudden loss of power; the surviving database had not been completely up-to-date with the most current server, and it took almost two days for developers to recover all data. In the meantime, the site was restored to read-only mode after a few hours.

The second outage took place on March 16th due to a human error: one of the master database's hard disks filled up, preventing slaves from being updated. At this point the data cluster had not fully recovered from the previous outage, and there was less than full redundancy among the database slaves. By the time space was made on the disk, the most up-to-date slave was already many hours behind. It took over eight hours of read-only time for the databases to be resynchronized.

Caches installed near Paris

Report from David Monniaux.

 
Our servers are the three machines in the middle.

In December 2004, servers donated to the Wikimedia Foundation were installed at the Telecity facility located in Aubervilliers on the outskirts of Paris, France. The network access is donated by French provider Lost Oasis. In January, the software setup was completed; however, various problems then had to be ironed out.

As of April 1, 2005, those machines cache content in English and French, as well as all multimedia content (images, sounds...), for users located in Belgium, France, Germany, Luxembourg, Switerland, and the United Kingdom (daily stats per country). The caches work as follows: if they hold the requested page in their local memory, they serve it directly; otherwise, they forward the request to the main Florida servers, and memorizes the answer while passing it to the browser of the Wikipedia user. Typically, for text content, 80% of accesses are cached (that is, they are served directly); the proportion climbs to 90-95% for image accesses. Due to the current way that the Mediawiki software works, content is cached much more efficiently for anonymous users: essentially, all text pages have to be requested from Florida for logged-in users.

The interest of such caches is twofold:

  • First, they relieve the load on the main Wikimedia Florida servers. We have to buy our bandwidth (network capacity) for Florida, whereas we can get (smaller) bandwidth chunks in other locations.
  • Second, they make browsing much quicker and responsive, at least for anonymous users. Any access to the Florida servers from Europe may take 100-150 ms round trip; this means that retrieving a complete page may take a significant fraction of a second, even if the servers respond instantaneously. The Paris servers, on the other hand, have much smaller roundtrip times from the countries they serve.

The Paris caches serve as a production experiment and test bed for future cache developments, which are currently being studied. We may, for instance, change the caching software in order to reduce the load on the caches (currently, with all the countries they serve, the machines are loaded 80-95%; the machines are, however, quite outdated), and see how we may improve efficiency and cache rates (it appears that the caches do not perform as efficiently as they should by fetching data from each other).

Blocking of open proxies

Since March 28th, Wikipedia has been automatically blocking edits coming from open proxies. The feature is still in testing; details are being worked out on the Meta-wiki.

Jimmy Wales asks for more developers at FOSDEM

As the opening speaker at the FOSDEM 2005 conference in Brussels, Jimmy Wales appealed to the development community for support with the technical side of running Wikipedia. Analyses of these remarks were published in several places last week.

Announcements

  • Wikimedia server was down for several hours on 17 March.
    On 17 March there was read only service for several hours after a disk drive used for logging on the master database server became full. Monitoring tools showed combined space for all drives and the last human check of that drive alone showed apparently sufficient space available. Improved montoring and larger disk drives are being obtained. James Day
    More downtime analysis.
  • Wikimedia servers down on February 22.
    On February 22, two circuit breakers blew, removing power from most of the Wikimedia servers and leading to the loss of all service for several hours, no editing for most of a day and slowness for a week. Full recovery of database robustness took the better part of a month. Uninterruptible power supplies could have reduced the effect of this incident (but not all power incidents; law requires an emergency power-off switch, which has caused outages for other sites, most notably for LiveJournal not long ago). Additional UPS systems will be used for key systems, fire code willing. -James Day