Data-driven background estimation methods for physics analyses

16 June 2008

Prediction for SU3 SUSY events with the background estimated by various data-driven approaches (transverse mass method for the W and tt backgrounds, and neutrino replacement method for the Z background) in no-lepton mode (left) and one-lepton mode (right) at 10 TeV run with an accumulated luminosity of 100 pb-1 (the equivalent of the 2008 run). (Plots provided by Yousuke Kataoka, Tokyo University).

This year, at the 43rd Moriond conference, it was clear that everybody was waiting for the LHC to bring a few surprises in our field. But there is still much work needed before getting there. This was recently revealed/confirmed by the CSC notes exercise.

As already reported in several articles in ATLAS e-news, there are still several tasks we have to manage such as completing the FDR data analysis, tuning the full and fast simulations, speeding up the reconstruction algorithms, and setting priorities for MC simulation to name but a few. As if this was not enough, we also need to develop "data-driven analyses" which was first tried with the CSC analyses and I will concentrate on this aspect.

The key issue of the physics analysis at the LHC is how we handle the huge amount of QCD backgrounds over a signal with relatively small statistics for the individual search channels. We need lepton and photon identification algorithms with a very high rejection level against QCD jets, or large missing transverse energy and high-energy multi-jets in the final state to avoid those QCD events. As an example, the typical SUSY events with gluinos or SM Higgs bosons with mH=150 GeV should be produced with cross-sections of the order of 10 pb, which corresponds to the production rate of one event every two minutes or 10-2 Hz for a luminosity of 1033/cm2·s. On the other hand, QCD di-jets events with ET > 100 GeV are roughly produced with a cross-section of about 1 mb, that is a rate of 1 kHz for the same luminosity. To make a discovery using a simple counting approach, one would need to achieve a background rejection of the order of 105 to 108. This means we need an extremely high background rejection while keeping a large acceptance for the signal.

Now, there are two motivations for using real data for the background estimation; the first reason is related to how well Monte Carlo simulations can model the detector performance and how well QCD events are precisely described by perturbation theory. For instance, in SUSY analysis, large missing transverse energy and high-energy multi-jets events are required to enhance the signal events. Large fake missing energy is caused by energy mis-measurements in the calorimeter. And the modeling of high-energy multi-jets events requires using several terms of the expansion of the perturbation series. An approximation given by the parton showering model or lowest order multi-leg matrix element calculation has a large ambiguity. Thus, the estimation method cannot rely on MC predictions alone, which is why data-driven analysis is strongly desired, i.e. one wants to extract these quantities directly from the data.

The second motivation comes from the difficulties involved in simulating huge numbers of events. It is almost impossible to generate 107 to 1010 background events with full simulation to achieve a 1% accuracy level in signal acceptance estimation. For instance, in the Higgs to two photons search, one background contribution comes from QCD jets. When one requires finding two photons coming from QCD jets, only one in 10000 jets contains a photon. One then needs to have large numbers of simulated background events (~108) if the fake rates are estimated using MC events alone.

But the fake photon distribution can be estimated from data. Using a process with large cross-sections, like di-jets events, one can easily obtain the jets energy distribution with large statistics. Jets that are mis-identified as photons will have the same distribution. The fake photon rate is also needed but obtained by finding how many jets are also reconstructed as photons. Note that another approach to estimate the fake photon contribution is to use conversion events, but here the emphasis is to use high statistics events from data.

If the signal is a narrow resonance and simple side-band subtraction can work, we don’t need to extract the background contribution in the mass window. We just fit the events. But in most cases, the target signal is a complicated object. A typical example is SUSY events. The effective mass or missing ET are used to discriminate the signal from backgrounds. However, the signal spectrum is normally too broad to use the side-band subtraction. In this case, we need to know the shape of the background.

The first step is to define a control sample by finding a region of phase space where the signal events do not contribute. Using MC simulation, one needs to find a kinematic variable which is not correlated with the final variable we want to observe when we extract the background contribution in the signal region. The control samples are taken from real data and are mostly well known Standard Model processes. Once we find good variables to model the shape of the background, then we can estimate the background events in the signal region from the control region measured from the data.

As an example, in SUSY inclusive analyses, two of the many methods are used: the neutrino replacement method and the transverse mass method. When searching for a SUSY process leading to jets plus large missing transverse energy, one of the dominant backgrounds is Z→νν + jets. To estimate the contribution from Z→νν + jets events, one can use Z→μμ+jets events. The identified di-muon tracks are declared invisible to emulate neutrinos. Then the missing transverse energy is re-calculated for this event. Since the production mechanisms are the same in both processes, one simply has to correct for the muon reconstruction efficiency and neutrino branching ratio. The missing transverse energy is correctly estimated from the Z→μμ events in data, without having to rely on Monte Carlo simulations.

For W+jets and tt+jets events, the transverse mass method is used. First, one selects events with lepton+MET+jets, where MET is missing transverse energy. According to the MC, the transverse mass has no correlation with the final effective mass distribution. In the low transverse mass region, the tt+jets and W+jets events dominate. The fraction of W to tt events does not depend on the transverse mass cut, such that this fraction is valid in the signal region. Thus, we can use the low transverse mass region as a control sample. The shape is drawn from the control sample, then the overall normalization is taken into account in the low missing ET region in which the SUSY signal does not contribute. Backgrounds estimated with these methods are shown on the figure above.

To gain confidence in our results, we need the data to tune the Monte Carlo simulation but the Monte Carlo is also needed to determine how to use the data for data-driven background estimation methods. These techniques became more widely used with the CSC efforts but more is to come. After all, we still don’t have data!


Soshi Tsuno

KEK, Japan