1 Introduction
1.1 Purpose of this thesis
The purpose of this thesis is to propose four novel implementations of univariate kernel density estimation based on TensorFlow1, TensorFlow Probability2 and zfit3, which incorporate insights from recent papers to decrease the computational complexity and therefore runtime. The newly proposed implementations are then compared to the state of the art of kernel density estimation in Python and shown to be competitive. By leveraging TensorFlow’s graph based computation, the newly proposed methods to calculate a kernel density estimate can benefit from parallelization and efficient computation on CPU/GPU.
First the mathematical theory will be summarized in chapter 2, before currently existing implementations of kernel density estimation (KDE) in Python are discussed in chapter 3. In chapter 4 the four novel KDE implementations are proposed, which are then compared against the current state of the art in chapter 5. The findings are then summarized in chapter 6.
1.2 Kernel density estimation
Experimental science is based on accumulating and interpreting data, however the process of interpreting the data can be difficult. Suppose we are trying to understand some physical system. We might want to find out how probable certain events are, which is described by the probability density function (PDF) of the system. If we already have some knowledge of the system and the underlying mechanisms are well understood, we might be able to guess the shape of the probability density function and define it mathematically as a function depending on some parameters of the system (which might have physical meaning). If we then fit the function to the experimentally found data by some goodness-of-fit criterion like log-likelihood or \(\chi^2\), we can get information about the parameters and learn more about the system itself. But what if the underlying mechanisms are too complex to be fully described analytically and the knowledge of the system is too poor to describe the model as an exact mathematical function?
In the particle accelerator at CERN for instance, a whooping 25 gigabytes of data is recorded per second4, resulting from the process of many physical interactions that occur almost simultaneously. It is often not feasible to anticipate features of the distribution one observes experimentally, as the distribution is comprised of many different distributions, which result from all the different physical interactions. However, through Monte Carlo based simulation, samples can be drawn from the theoretical distribution. These can be used in conjunction with so called non-parametric methods, that approximate the shape of the drawn distribution without the need for a predefined mathematical model. The perhaps simplest non-parametric method is the Histogramm. By summing the the data up in discrete bins the underlying probability distribution can be approximated, without needing any prior knowledge of the system itself. However Histograms tend to produce PDFs that are highly dependent on bin width and bin positioning, meaning the interpretation of the data changes a lot by two arbitrary parameters.
A more sophisticated non-parametric method is the kernel density estimation (KDE), which can be looked at as a sort of generalized histogram5. In a kernel density estimation each data point is substituted with a so called kernel function that specifies how much it influences its neighboring regions. This kernel functions can then be summed up to get an estimate of the probability density distribution, quite similarly as summing up data points inside bins. However since the kernel functions are centered on the data points directly, KDE circumvents the problem of arbitrary bin positioning6. Since KDE still depends on kernel bandwidth (a measure of the spread of the kernel function) instead of bin width, one might argue that this is not a major improvement. However, upon closer inspection, one finds that the underlying PDF does depend less strongly on the kernel bandwidth than histograms do on bin width and it is much easier to specify rules for an approximately optimal kernel bandwidth than it is to do so for bin width7. Mathematical research has shown that it is possible to compute an approximately optimal bandwidth value, which is not possible for bin width. Another benefit is that one gets a smooth distribution by specifying a smooth kernel function, which is often desirable or even expected from theory. Due to this increased robustness, KDE is particular useful in High-Energy Physics (HEP) where it has been used for confidence level calculations for the Higgs Searches at the Large Electron Positron Collider (LEP)8. However there is still room for improvement in terms of accuracy and computation speed and certain more sophisticated approaches to kernel density estimation have been proposed in dependence on specific areas of application8.
1.3 zfit and TensorFlow
Currently the basic principle of KDE has been implemented in various programming languages and statistical modeling tools. The standard framework used in High Energy Physics (HEP) that includes KDE is the ROOT/RooFit toolkit written in C++. However as the amount of experimental data grows (like at CERN), so grows the computational burden and traditional methods to calculate kernel density estimation become cumbersome. In addition, Python plays an increasingly large role in the natural sciences due to support by corporations involved in Big Data and its superior accessibility. To elevate research in HEP, zfit, a new alternative to RooFit, was recently proposed. It is implemented on top of TensorFlow, one of the leading Python frameworks to handle large data and high parallelization, allowing a transparent usage of CPUs and GPUs3.
TensorFlow provides the intuitive accessibility of Python while ensuring speed and efficiency because the underlying operations are implemented in C++. TensorFlow implements so-called graph-based computation, which means that it builds a graph describing the computation to be done before it actually executes it. By analyzing the graph TensorFlow can then optimize the algorithm and schedule parts that can be executed in parallel to be run on different CPUs or GPUs, which is especially important for large data and lengthy calculations. Additionally TensorFlow supports automatic differentiation. Every operation in the graph implements its own derivative and the therefore the whole graph can be differentiated by using the chain rule. This is especially important for applications where we want to minimize the gradient such as in neural network based computations. Finally, TensorFlow can be used together with other scientific libraries in Python and TensorFlow accepts NumPy9 arrays as input, although computations outside of the graph do not benefit of TensorFlow’s optimizations and its automatic differentiation. Being able to calculate a kernel density estimation using TensorFlow and zfit has therefore numerous advantages for large data.
So far only exact kernel density estimations exist for TensorFlow, however as seen in chapter 2, the KDE can be approximated using multiple mathematical tricks with only a negligible decrease in accuracy, while decreasing the computational complexity substantially.
1.4 Univariate case
The newly proposed implementations in this thesis are limited to the one-dimensional case, since this is the case which is most often used and therefore benefits the most of decreased runtime. It is feasible to extend the implementation to the multi-dimensional case in the future, however this would require more work due to not quite identical APIs to the univariate case. In addition one must ensure that the kernel functions used would be multi-dimensional themselves.