Power down: recent planned and unplanned cuts

15 December 2008

One of the two diesel generators which started up during the power cut



Power cuts happen for a variety of reasons.  ATLAS was able to gradually power down for a planned cut in electricity on Wednesday, November 19th, from 7:00-7:10.  But it is not always planned: During the following two weeks, the main cooling plant got into the habit of shutting down on early Wednesday mornings.

The planned cut

Since 2006, when the Safe Power Network failed to engage during a power outage at CERN, the system team has made a point of performing annual tests to ensure that when the lights go out, a set of diesel generators turn on to keep essential systems running. 

“We started the planning for this a week early,” says Martin Jäkel of Operations Plan Maintenance (OPM), coordinator for the power-down and recovery procedures in ATLAS. Since this test had been done last year, many of the procedures could be repeated, but a few new systems were involved.  For instance, the pixel detector was hardly active last year, but now it is fully operational. 

In preparation for the planned cut on Wednesday, OPM began powering down the systems half a day in advance.  The process starts with the detectors themselves.  The electricity supply is gradually ramped down. This takes between half an hour and two hours, depending on the detector.

Then, they shut down the computers, between 800 and 1,000 of them, in USA 15 and on the surface.

The last system to shut down is the rack infrastructure, which contains PC servers, power supplies, and readout electronics. Racks are turned off via the detector control system, once the detector group has shut down the equipment inside. This can take as little as half an hour.


Racks for the Liquid Argon Calorimeter in USA15



According to Giuseppe Mornacchi, also of OPM, the whole operation has to be done according to a predefined sequence because of the many interdependencies between components. For example, a PC gets switched off before the machine providing the file system.

At the time of a cut, two diesel groups at Point 1 start up and maintain the power to the Control Room and a number of critical systems, such as the cryogenics controls.  Even with the detectors off, shifters continue to monitor the conditions in the cavern to ensure that nothing is damaged. Meanwhile generators on the Meyrin site ensure that important parts of the infrastructure – such as the elevators and emergency lights – continue to run even in the event of a power outage.

Once the Safety Power team was satisfied that the generators had engaged automatically and important systems continued to function, power was soon restored. The process is done in reverse to bring the ATLAS installation up and operational again.

“The interesting part is, of course, to bring up the detector again from complete power-down to stand-alone runs,” says Martin.  Often, a part will refuse to start or a program will fail.  He compares this to turning off and on any piece of electronics in your home a few thousand times.  One of those times, you can expect that something will go wrong.  

“We do not switch it on and off a thousand times, but we switch off a thousand of these at once,” he explains.

The surprise

On Wednesday, November 26th, at 6:25 a.m., a cooling failure unexpectedly shut down ATLAS.  The main cooling plant at Point 1 stopped running, while its control system kept reporting that everything was fine. “It was not a normal failure,” said Giuseppe in an interview two days after the incident. The team spent the next two weeks looking into what had happened.

With the main plant shut down, all the cooling systems ground to a halt.  The Detector Safety System operated as expected: it waited for a pre-programmed time (in most cases about 3 minutes) for cooling to return and then switched off power to protect the specific detectors and electronics.


One of the detector Front End Cooling Stations in UX15



This shut-down was considerably safer than running without cooling, but cutting the electricity quickly risks causing problems in the detectors.  Again, Martin compares it to a home computer – you’re more likely to see problems if you suddenly unplug it than if you allow it to run shut-down processes.

The cooling failure stopped the detector for the whole day, and would have stopped data-taking for 12-15 hours if it had happened during collisions.  The cause of the fault was tracked down to a (formerly) Uninterruptible Power Supply (UPS) module that was powering the controls of the main cooling plant. 

Even with the source of the problem identified, the plant shut down a second time on Wednesday, December 3rd – again at about half past six in the morning. Given the day and time, it seemed likely that the shutdowns were some hang-over from the planned power cut two weeks earlier. 

The solution

While the cooling group looked for regularly scheduled interventions that might trigger the main cooling plant to switch off, they also examined their brand of UPS modules. Indeed, the modules have a “self discharge” test mode which was automatically activated at the moment of the first, planned power outage.

In any case, the power source for the controls of the main cooling plant is changing to the normal ATLAS UPS system. Says Giuseppe: “We have now removed this module and have a line to our Uninterruptible Power Supply System ready to be connected.” Linked into the larger system, the cooling plant will no longer suffer these unexpected, recurrent stoppages.

Both Martin and Giuseppe agree that the cooling and ventilation systems had been very stable in the last year, they are confident that, in spite of the recent problems, the successful stability record will continue.

 

Katie McAlpine

ATLAS e-News