Why is it scipy.stats.gaussian_kde() slower than seaborn.kde_plot() for the same data?

In python 3.7, I have this numpy array with shape=(2, 34900). This arrays is a list of coordinates where the index 0 represents the X axis and the index 1 represents the y axis.

When I use seaborn.kde_plot() to make a visualization of the distribution of this data, I'm able to get the result in about 5-15 seconds when running on a i5 7th generation.

But when I try to run the following piece of code:

#Find the kernel for 
k = scipy.stats.kde.gaussian_kde(data, bw_method=.3)
#Define the grid
xi, yi = np.mgrid[0:1:2000*1j, 0:1:2000*1j]
#apply the function
zi = k(np.vstack([xi.flatten(), yi.flatten()]))

which finds the gaussian kernel for this data and applies it to a grid I defined, it takes much more time. I wasn't able to run the full array but when running on a slice with the size of 140, it takes about 40 seconds to complete.

The 140 sized slice does make an interesting result which I was able to visualize using plt.pcolormesh().

My question is what I am missing here. If I understood what is happening correctly, I'm using the scipy.stats.kde.gaussian_kde() to create an estimation of a function defined by the data. Then I'm applying the function to a 2D space and getting it's Z component as result. Then I'm plotting the Z component. But how can this process be any different from seaborn.kde_plot() that makes the code take so much longer.

Scipy's implementation just goes through each point doing this:

for i in range(self.n):
    diff = self.dataset[:, i, newaxis] - points
    tdiff = dot(self.inv_cov, diff)
    energy = sum(diff*tdiff,axis=0) / 2.0
    result = result + exp(-energy)

1 answer

  • answered 2018-03-13 22:31 ImportanceOfBeingErnest

    Seaborn has in general two ways to calculate the bivariate kde. If available, it uses statsmodels, if not, it falls back to scipy.

    The scipy code is similar to what is shown in the question. It uses scipy.stats.gaussian_kde. The statsmodels code uses statsmodels.nonparametric.api.KDEMultivariate.

    However, for a fair comparisson we would need to take the same grid size for both methods. The standard gridsize for seaborn is 100 points.

    import numpy as np; np.random.seed(42)
    import seaborn.distributions as sd
    N = 34900
    x = np.random.randn(N)
    y = np.random.randn(N)
    clip = [(-np.inf, np.inf), (-np.inf, np.inf)]
    f = lambda x,y : sd._statsmodels_bivariate_kde(x, y, bw, gridsize, cut, clip)
    g = lambda x,y : sd._scipy_bivariate_kde(x, y, bw, gridsize, cut, clip)

    If we time those two functions,

    # statsmodels
    %timeit f(x,y)  # 1 loop, best of 3: 16.4 s per loop
    # scipy
    %timeit g(x,y)  # 1 loop, best of 3: 8.67 s per loop

    Scipy is hence twice as fast as statsmodels (the seaborn default). The reason why the code in the question takes so long is that instead of a grid of size 100, a grid of size 2000 is used.

    Seeing those results one would actually be tempted to use scipy instead of statsmodels. Unfortunately it does not allow to choose which one to use. One hence needs to manually set the respective flag.

    import seaborn.distributions as sd
    sd._has_statsmodels = False
    # plot kdeplot with scipy.stats.kde.gaussian_kde