Distributed Analysis User Support

25 January 2010

A simplified view of the ATLAS distributed analysis system.

In the era of ATLAS data taking, user support has become a challenging task. With many new people starting on analysis, the user support task is to ensure that each and everyone is able to analyse the collision data distributed among hundreds of computing sites worldwide. The Distributed Analysis Support Team (DAST) is a team of shifters forming the front line for all help requests on distributed data analysis.

ATLAS distributed data analysis is based on a complex and evolving system with many components interacting with each other (as seen in the picture above). Users can submit their analysis jobs to the system using one of the two “Frontend” tools available; namely, pathena and Ganga . The workload management systems (the “Backends”) handle the matchmaking of analysis jobs to computing resources. Jobs are executed on the “Grids” which provide the computing resources. The Distributed Data Management (DDM) system is also an important component of the whole system (not shown in the picture). It manages data distribution and access at the computing sites. Users directly interact with the DDM system too, when querying the location of the data as well as when retrieving the output of the analysis from the remote Grid sites.

The system is evolving to handle most error conditions automatically; however users are still faced with various complications. Common problems include authentication difficulties, finding interesting data, unstable or misconfigured computing sites, complexity of the software being used inside the Grid environment (ATLAS standard code - Athena and user analysis code) and retrieving job outputs. It is not expected that every user can become an expert in troubleshooting these problems. Though the distributed analysis system has been in use for a long time, user support had previously been on a best effort basis mainly provided by the developers of pathena and Ganga. A more organized user support was needed in order to ease the load on the developers and to prepare for massive user analysis in the data taking era. The Distributed Analysis Support Team (DAST) was therefore formed for this purpose in September 2008.

The DAST support team currently consists of ten expert shifters who provide direct support by monitoring a mailing list called the distributed analysis help forum in the CERN e-groups. There are two shifters on duty during working hours; one in the North American time zone and one in the European time zone, covering 15 hours a day on weekdays. The analysis shift work is fully counted towards ATLAS service work.

The current team members include developers of the distributed analysis tools, site support people, and physicists with good data analysis experience. The overall activity of the team is coordinated by Daniel van der Ster and myself, and we are shifters ourselves.

Various type of problems are posted in the help forum besides the ones directly related to pathena and Ganga. These include problems with running Athena code, difficulties with using the physics analysis tools, site and service problems, data access at the sites, the conditions database access of the analysis jobs, data replication and problems with using the DDM end user tools (dq2 tools). The shifters help users by solving the issues directly or escalating them to the relevant developers, site administrators and support groups. It has been a very active forum from the beginning, with an average of 6-7 new threads per day in 2009. The response time from the support team is usually within minutes/hours. In some cases users provide insufficient info to debug the issue thus extending the lifetime of the thread.

The shifters write a shift report in the end of the week for a brief discussion of the open and closed threads in the forum. They meet weekly to go over the shift reports (posted in Indico and available to everyone in ATLAS) and discuss outstanding issues and their possible improvements.

As we approached the restart of the LHC operations quickly, the team planned on recruiting new shifters with anticipation of an increase in user requests. A training session was organized and chaired by myself during October 26-27, 2009. The goal was to attract new people to the support team as well as to review the status of the user support system and improve the documentation.

The training session was held remotely on EVO. The participation was excellent; 21 people expressed interest in joining the training session and 13 people wanted to become a trainee shifter after the meeting. The material was prepared and presented by the team members. As shown on the agenda, the first day was dedicated to talks giving an overview of the distributed analysis system from a shifter’s point of view. The second day focused on the infrastructure of the support team providing more detailed info about how the team works. In addition, tips for responding to user requests were presented. The expert shifters have been training five new people since the meeting and more will join in a few months.

While the ATLAS distributed computing teams put a continuous effort for a more automated and robust service of many components in the system, we would like to encourage experienced users to join our team to share their experience with others. Also, more user–to-user support in the help forum is vital. With many new people starting on analysis, the users should feel free to help each other without waiting a response from the shifters. We also hope for good use of our documentation for a head start on distributed data analysis in this exciting year of ATLAS.




Nurcan Ozturk

University of Texas at Arlington