**UPDATE 1/15/2014**: This blog is no longer in service.

This post is now located at: http://slendermeans.org/ml4h-ch2-p1.html

Thanks,

-c.

Advertisements

I'd like to drop my trousers to the world. I am a man of means, of slender means.

**UPDATE 1/15/2014**: This blog is no longer in service.

This post is now located at: http://slendermeans.org/ml4h-ch2-p1.html

Thanks,

-c.

Advertisements

This entry was posted in Will it Python and tagged machine learning, python, R. Bookmark the permalink.

%d bloggers like this:

Just wanted to say that it’s really fun to read along with these posts and learn a bit about how to do these things in Python.

One quick note regarding the performance of KDE’s: the naive way to calculate KDE’s using their on-paper mathematical definition is not very computationally efficient when you have a lot of data. (I discovered this for myself when recently coding them up in Julia.) The trick is to realize that the equation for a KDE is actually an instance of convolution, which implies that you can exploit the convolution theorem to speed up computation: transform the data and the kernel into their FFT representations, then do simple multiplication, then undo the FFT to get the full result. This is how R handles computing KDE’s. Not sure if the Python systems are doing this, but if not, that’s the obvious way to improve performance.

John – thanks for the kind words. You’re of course right about the FFT issue. It was clear that the statsmodels function I was using wasn’t doing that (there’s a not about it in the docstring even), but I wasn’t sure about the SciPy function, so I didn’t want to get into it in this post.

See Skipper’s comment below: apparently SciPy’s

`gaussian_kde`

doesn’t use FFT, while there is a statsmodels implementation that does. I just wasn’t using it.The book is great, by the way. I’m having a lot of fun going through it. I also enjoyed your Julia talk last night — all three talks were really interesting and I think got a lot of folks psyched about the project. (No plans for a Will it Julia? series though, since It Probably Won’t).

I’ve done some exploration of density estimation and it’s not actually obvious that dFFT is the right way to compute it because it has large memory overhead (esp. as you move up in dimensionality). A straight loop is incredibly fast anyway (using in Rcpp I can do a kernel smoothing on 1 million obs in ~90 ms) and makes for code that is much easier to understand.

Thanks really interesting to hear, Hadley. I’ll have to do some explorations of the performance advantages of using dFFT’s for KDE’s, since I’m planning to implement them in the next few months in Julia.

Great work! As you find deficiencies in statsmodels – documentation, speed, etc. – it would be a great help to us if you’d send a note to our mailing list [1] or open a github ticket [2] so we can improve things.

As for the speed of KDE. Only our Gaussian kernel uses the FFT, which is *much* faster than the Gaussian KDE code from scipy. I get a 1500x speed-up in this case.

Scipy:

[~/src/ML_for_Hackers/02-Exploration/]

[18]: timeit density(heights.values)

1 loops, best of 3: 5.09 s per loop

Statsmodels:

[~/src/ML_for_Hackers/02-Exploration/]

[19]: kde = sm.nonparametric.KDE(heights.values)

[~/src/ML_for_Hackers/02-Exploration/]

[20]: timeit kde.fit()

100 loops, best of 3: 3.49 ms per loop

[~/src/ML_for_Hackers/02-Exploration/]

[21]: plt.plot(kde.support, kde.density)

[21]: []

[1] https://groups.google.com/forum/?fromgroups#!forum/pystatsmodels

[2] https://github.com/statsmodels/statsmodels/issues?utf8=%E2%9C%93

Awesome — thanks for this. As I get my bearings a little more, I’ll definitely be more active in submitting issues (and hopefully eventually help fix some).

So, wrt the KDE, it looks like I was just doing it wrong. I was using the

`kdensity`

function and running it straight on the data. Your procedure is (1) create a KDE object with the`KDE`

function, (2) run the`fit`

method of the KDE to get the estimated densities. I see`KDE.fit()`

has a boolean`fft`

argument, whereas`kdensity`

doesn’t.And indeed, doing it that way, I get about a 1000x speedup over Scipy’s

`gaussian_kde`

.Am I right to think there’s no reason to be using

`kdensity`

?Right. kdensity is just a helper function, but it’s exposed in case you don’t want the full KDE object (the fast one is kdensityfft, if you want it directly). KDE.fit just calls these under the hood. You’ll notice that the KDE object is a full-fledged distribution object like the scipy.stats densities. It has cdf, entropy, cumhazard, icdf, and sf methods. The only reason to use the non-fft version is if you prefer a kernel other than Gaussian. Though in practice the kernel choice doesn’t matter as much as bandwidth selection. We are currently working on data-driven bandwidth selection methods now. FFT for non-Gaussian kernels only depends on someone writing out the closed form solution for this though (there’s a note in the source).

Keep up the good work, and ping us if you have any questions about scipy.stats and statsmodels.

I don’t think R’s ‘range’ function does what you think it does — range(a) just returns c(min(a), max(a)).

OTOH, numpy.ptp does what your my_range function does.

You’re right on both counts. Now noted in the post. Thanks!

I think `diff(range(x))` is the idiomatic way to do this in R.

Unless I missed something, it looks like the notebook and script both need an additional import statement, i.e., import statsmodels.api as sm.

Sterling:

You’re right. I thought I’d committed that fix. It’s up now. Thanks.

Pingback: How do you speed up 40,000 weighted least squares calculations? Skip 36,000 of them. | Slender Means

Pingback: Slender Means

As usual, very educational.

One comment, you can refer to the notebooks via nbviewer and then people can see them. Example for this chapter – http://nbviewer.ipython.org/urls/raw.github.com/carljv/Will_it_Python/master/MLFH/CH2/ch2.ipynb

Thanks. While I didn’t start doing it until the later posts (nbviewer was not set up yet for this one I think), many posts should have nbviewer links. In fact all of them are linked here: https://slendrmeans.wordpress.com/will-it-python/

Great set of posts. Here is one way of dealing with the TypeException thrown by your my_quantiles function.

#Create a quantiles function to mimic R’s

def my_quantiles2(s,prob = (0.0,0.25,0.5,1.0)):

”’

Calculate quantiles of a series.

Parameters:

———–

s : a pandas Series

prob : a tuple (or other iterable) of probabilities at

which to compute the quantiles. Converts a single valued

prob into a tuple

Returns:

——–

A pandas series with the probabilities as an index.

”’

# convert single integer or float to tuple

try:

q = [s.quantile(p) for p in prob]

except TypeError:

prob = (prob,)

q = [s.quantile(p) for p in prob]

return Series(q,index=prob)