Why is it scipy.stats.gaussian_kde() slower than seaborn.kde_plot() for the same data?
In python 3.7, I have this numpy array with shape=(2, 34900). This arrays is a list of coordinates where the index 0 represents the X axis and the index 1 represents the y axis.
When I use seaborn.kde_plot() to make a visualization of the distribution of this data, I'm able to get the result in about 5-15 seconds when running on a i5 7th generation.
But when I try to run the following piece of code:
#Find the kernel for k = scipy.stats.kde.gaussian_kde(data, bw_method=.3) #Define the grid xi, yi = np.mgrid[0:1:2000*1j, 0:1:2000*1j] #apply the function zi = k(np.vstack([xi.flatten(), yi.flatten()]))
which finds the gaussian kernel for this data and applies it to a grid I defined, it takes much more time. I wasn't able to run the full array but when running on a slice with the size of 140, it takes about 40 seconds to complete.
The 140 sized slice does make an interesting result which I was able to visualize using
My question is what I am missing here. If I understood what is happening correctly, I'm using the
scipy.stats.kde.gaussian_kde() to create an estimation of a function defined by the data. Then I'm applying the function to a 2D space and getting it's Z component as result. Then I'm plotting the Z component. But how can this process be any different from
seaborn.kde_plot() that makes the code take so much longer.
Scipy's implementation just goes through each point doing this:
for i in range(self.n): diff = self.dataset[:, i, newaxis] - points tdiff = dot(self.inv_cov, diff) energy = sum(diff*tdiff,axis=0) / 2.0 result = result + exp(-energy)
Seaborn has in general two ways to calculate the bivariate kde. If available, it uses
statsmodels, if not, it falls back to
However, for a fair comparisson we would need to take the same grid size for both methods. The standard gridsize for seaborn is 100 points.
import numpy as np; np.random.seed(42) import seaborn.distributions as sd N = 34900 x = np.random.randn(N) y = np.random.randn(N) bw="scott" gridsize=100 cut=3 clip = [(-np.inf, np.inf), (-np.inf, np.inf)] f = lambda x,y : sd._statsmodels_bivariate_kde(x, y, bw, gridsize, cut, clip) g = lambda x,y : sd._scipy_bivariate_kde(x, y, bw, gridsize, cut, clip)
If we time those two functions,
# statsmodels %timeit f(x,y) # 1 loop, best of 3: 16.4 s per loop # scipy %timeit g(x,y) # 1 loop, best of 3: 8.67 s per loop
Scipy is hence twice as fast as statsmodels (the seaborn default). The reason why the code in the question takes so long is that instead of a grid of size 100, a grid of size 2000 is used.
Seeing those results one would actually be tempted to use scipy instead of statsmodels. Unfortunately it does not allow to choose which one to use. One hence needs to manually set the respective flag.
import seaborn.distributions as sd sd._has_statsmodels = False # plot kdeplot with scipy.stats.kde.gaussian_kde sns.kdeplot(x,y)