6 Summary

After discussing the current mathematical research on kernel density estimation, several new implementations for kernel density estimation using zfit and TensorFlow were proposed. Namely one just using linear binning (ZfitBinned), one using an approach based on a Fast Fourier Transform (ZfitFFT), one based on the improved Sheather-Jones algorithm (ZfitISJ) and one based on specialized kernel functions of the form \(poly(x)\cdot\exp(x)\) (ZfitHofmeyr). Since all proposed methods are based on TensorFlow and zfit, they are optimized for parallel processing on multiple CPUs and GPUs.

An extensive benchmarking showed that the methods ZfitBinned and ZfitFFT are both able to compete with the current state-of-the-art implementations in Python in terms of runtime for a given accuracy. Furthermore those methods showed superior performance both in accuracy as well as runtime for a given accuracy for large sample sizes (\(n \geq 10^8\)). Even for smaller datasets the methods may be favorable in cases where the PDF has to be built only once but is evaluated repeatedly (i.e. log-likelihood fits in zfit). In such cases ZfitFFT proves especially useful, as it only calculates a linear interpolation in the evaluation phase and has therefore the smallest runtime during this phase. Handling large datasets and fast repeated evaluation of the PDF are both cases that are important in high energy physics. Additionally ZfitBinned and ZfitFFT have the benefit of allowing any distribution of the location-scale family that follows the Distribution contract of TensorFlow Probability2 as kernel function, which includes more than twelve distributions at the moment.

For spiky non-normal distributions the method ZfitISJ provides superior accuracy with only a minor increase in runtime, because it computes an approximately optimal bandwidth and does not depend on assuming normally distributed data in doing so. It was able to outperform any other implementation in terms of accuracy in the most cases.

The last method ZfitHofmeyr is an interesting proof of concept, that can in theory compute exact kernel density estimates very fast, however, due to its recursive nature it is poorly suited to be implemented with TensorFlow. After implementing it with C++ as custom TensorFlow operation the practical speed gain was indeed notable, however it was not able to outperform the other implementations based on runtime for a given accuracy. In particular, it failed to approximate some distributions completely for bigger sample sizes, although this might be an artifact of an uncaught overflow error in the C++ implementation. Further investigation would be needed to find a way to mitigate this. For these reasons the current implementation of ZfitHofmeyr proofed insufficient to accurately portray the usefulness of Hofmeyr’s method.

Table 6.1: Comparison between KDE implementations (NR: normal reference rules, namely Scott/Silverman, CV: Cross Validation, ISJ: Improved Sheater Jones according to Botev et al.)
Feature / Library scipy sklearn statsmodels KDEpy Zfit
Number of kernel functions 1 6 7 (6 slow) 9 12
Weighted data points No No Non-FFT Yes Yes
Automatic bandwidth NR None NR,CV NR, ISJ NR, ISJ
Multidimensional No No Yes Yes No(planned)
Supported algorithms Exact Tree Exact, FFT Exact, Tree, FFT Exact, Binned, FFT, ISJ

All proposed implementations restrict themselves to the one-dimensional case as described in section 1.4. So far only KDEpy and Statsmodels implement a multidimensional KDE in Python. Generalization to higher dimensionel kernel density estimation is feasible and would extend its use case even further, although this would require more work. Firstly because the binning routine would have to be extended to the multi-dimensional case and secondly because the improved Sheather-Jones algorithm implementation would have to be adapted. In addition one must ensure that the kernel functions used are multidimensional themselves, however, this is already the case for the most important kernel function, the Gaussian density. In addition, looking deeper in the specialized kernel function based approach might result in an even faster implementation, however this would require substantially more work due to the difficulty of implementing recursion in TensorFlow efficiently.

In conclusion, the newly proposed methods ZfitFFT, ZfitISJ show superior performance for cases where only the PDF evaluation runtime is of importance, especially if the PDF is used for log-likelihood or \(\chi^2\)-squared fitting. Furthermore, the proposed improved Sheather-Jones algorithm implementation shows state-of-the-art accuracy in general while imposing only a minor runtime cost. Finally, for larger datasets with sample size (\(n \geq 10^8\)), the binned as well as the FFT based implementations (ZfitBinned, ZfitFFT) are able to outperform the current state-of-the-art in terms of runtime for a given accuracy. Therefore the newly proposed implementations are optimally suited for kernel density estimation in scientific fields like high energy physics that deal with large datasets.