Data-ready: the Grid meets its Challenge

16 June 2008

From CERN to the world: lines connect major data centers back to Meyrin, Switzerland


Sure, "robust" is the Grid's middle name, and its many parts passed smaller-scale tests. But when 50 terabytes of data are streaming out of CERN every day, older data moves among the smaller centers, and analysis jobs are running on top of it all, will the Grid be able to keep up?

Through the month of May, the Common Computing Readiness Challenge tested just that by running the computing activities of all four major LHC experiments together. The CCRC tested the Grid’s capacity to handle the high volumes of data and analyse events at the same time, as well as the speed with which problems could be identified and resolved while running.

“We moved 1.4 petabytes of data around the world,” says Simone Campana, the ATLAS Tiers Coordinator. He directed a subset of the ATLAS Computing Operation during the CCRC. This small team of six to seven on-site ATLAS collaborators worked with the World LHC Computing Grid and teams at the Tier-1 and Tier-2 computing centers that receive data distributed from CERN’s Tier-0 site.

The Challenge started out slow, just 200 megabytes of data per second moving among sites. It gradually ramped up as the month progressed, matching the amount of data flow we expect when the LHC is running.

The teams kept increasing data flow until they doubled the expected rate of 600 megabytes per second, showing that the Grid could recover if trouble in the system halted the flow from Tier-0 to the large Tier-1 computing centers scattered across 10 countries. “In case we have to catch up with some backlog, we can do it,” Simone explains.

At top speed, they achieved rates of 2 gigabytes per second leaving CERN for the Tier-1 sites, which is four times the rate at which data needs to move during normal running, according to Simone.

They also ran 20,000 jobs simultaneously on the grid, moving data and analysing events at the same time. “We demonstrated that we could fill up the system with jobs and sustain a good rate for many, many days,” says Simone.


Ramping up toward 20,000 jobs


Encountering the occasional glitch in such a broad and decentralised system is inevitable, but the Grid team knows how to handle problems. One of the troubles Simone recalls was slow data transfer in at least two locations. When teams began investigating, they identified that the source of the problem was within the network. They notified the people working on the network, and the pace of data transfer soon returned to normal.

Simone describes these types of problems as most difficult to solve. Starting with a general observation, they begin searching for the problem in the top layers of software, closest to the users, excluding possible causes one by one until they reach the baseline software and hardware.

Despite the meticulous process, those who work on the Grid know how to get to the bottom of problems fast. Data transfer was back up to speed within three hours of identifying the glitch.

The test run was a success in terms of handling data transfer, providing computing power for analysis jobs, and quickly solving problems as they arose. The Grid, having risen to the computing Challenge, is ready to distribute LHC data among the many computing sites.

Still, those working on the Grid will use the time remaining before start-up to test the data-reprocessing systems more thoroughly and anticipate the effects of chaotic data analysis on the system.


Katie McAlpine

ATLAS e-News