Will it Python? Machine Learning for Hackers, Chapter 2, Part 1: Summary stats and density estimators

UPDATE 1/15/2014: This blog is no longer in service.

This post is now located at: http://slendermeans.org/ml4h-ch2-p1.html


This entry was posted in Will it Python and tagged , , . Bookmark the permalink.

17 Responses to Will it Python? Machine Learning for Hackers, Chapter 2, Part 1: Summary stats and density estimators

  1. Just wanted to say that it’s really fun to read along with these posts and learn a bit about how to do these things in Python.

    One quick note regarding the performance of KDE’s: the naive way to calculate KDE’s using their on-paper mathematical definition is not very computationally efficient when you have a lot of data. (I discovered this for myself when recently coding them up in Julia.) The trick is to realize that the equation for a KDE is actually an instance of convolution, which implies that you can exploit the convolution theorem to speed up computation: transform the data and the kernel into their FFT representations, then do simple multiplication, then undo the FFT to get the full result. This is how R handles computing KDE’s. Not sure if the Python systems are doing this, but if not, that’s the obvious way to improve performance.

    • Carl says:

      John – thanks for the kind words. You’re of course right about the FFT issue. It was clear that the statsmodels function I was using wasn’t doing that (there’s a not about it in the docstring even), but I wasn’t sure about the SciPy function, so I didn’t want to get into it in this post.

      See Skipper’s comment below: apparently SciPy’s gaussian_kde doesn’t use FFT, while there is a statsmodels implementation that does. I just wasn’t using it.

      The book is great, by the way. I’m having a lot of fun going through it. I also enjoyed your Julia talk last night — all three talks were really interesting and I think got a lot of folks psyched about the project. (No plans for a Will it Julia? series though, since It Probably Won’t).

    • I’ve done some exploration of density estimation and it’s not actually obvious that dFFT is the right way to compute it because it has large memory overhead (esp. as you move up in dimensionality). A straight loop is incredibly fast anyway (using in Rcpp I can do a kernel smoothing on 1 million obs in ~90 ms) and makes for code that is much easier to understand.

  2. Great work! As you find deficiencies in statsmodels – documentation, speed, etc. – it would be a great help to us if you’d send a note to our mailing list [1] or open a github ticket [2] so we can improve things.

    As for the speed of KDE. Only our Gaussian kernel uses the FFT, which is *much* faster than the Gaussian KDE code from scipy. I get a 1500x speed-up in this case.


    [18]: timeit density(heights.values)
    1 loops, best of 3: 5.09 s per loop


    [19]: kde = sm.nonparametric.KDE(heights.values)

    [20]: timeit kde.fit()
    100 loops, best of 3: 3.49 ms per loop

    [21]: plt.plot(kde.support, kde.density)
    [21]: []

    [1] https://groups.google.com/forum/?fromgroups#!forum/pystatsmodels
    [2] https://github.com/statsmodels/statsmodels/issues?utf8=%E2%9C%93

    • Carl says:

      Awesome — thanks for this. As I get my bearings a little more, I’ll definitely be more active in submitting issues (and hopefully eventually help fix some).

      So, wrt the KDE, it looks like I was just doing it wrong. I was using the kdensity function and running it straight on the data. Your procedure is (1) create a KDE object with the KDE function, (2) run the fit method of the KDE to get the estimated densities. I see KDE.fit() has a boolean fft argument, whereas kdensity doesn’t.

      And indeed, doing it that way, I get about a 1000x speedup over Scipy’s gaussian_kde.

      Am I right to think there’s no reason to be using kdensity?

      • Right. kdensity is just a helper function, but it’s exposed in case you don’t want the full KDE object (the fast one is kdensityfft, if you want it directly). KDE.fit just calls these under the hood. You’ll notice that the KDE object is a full-fledged distribution object like the scipy.stats densities. It has cdf, entropy, cumhazard, icdf, and sf methods. The only reason to use the non-fft version is if you prefer a kernel other than Gaussian. Though in practice the kernel choice doesn’t matter as much as bandwidth selection. We are currently working on data-driven bandwidth selection methods now. FFT for non-Gaussian kernels only depends on someone writing out the closed form solution for this though (there’s a note in the source).

        Keep up the good work, and ping us if you have any questions about scipy.stats and statsmodels.

  3. Nathaniel Smith says:

    I don’t think R’s ‘range’ function does what you think it does — range(a) just returns c(min(a), max(a)).

    OTOH, numpy.ptp does what your my_range function does.

  4. Sterling says:

    Unless I missed something, it looks like the notebook and script both need an additional import statement, i.e., import statsmodels.api as sm.

  5. Pingback: How do you speed up 40,000 weighted least squares calculations? Skip 36,000 of them. | Slender Means

  6. Pingback: Slender Means

  7. tebeka says:

    As usual, very educational.

    One comment, you can refer to the notebooks via nbviewer and then people can see them. Example for this chapter – http://nbviewer.ipython.org/urls/raw.github.com/carljv/Will_it_Python/master/MLFH/CH2/ch2.ipynb

  8. AJ says:

    Great set of posts. Here is one way of dealing with the TypeException thrown by your my_quantiles function.

    #Create a quantiles function to mimic R’s
    def my_quantiles2(s,prob = (0.0,0.25,0.5,1.0)):
    Calculate quantiles of a series.

    s : a pandas Series
    prob : a tuple (or other iterable) of probabilities at
    which to compute the quantiles. Converts a single valued
    prob into a tuple

    A pandas series with the probabilities as an index.
    # convert single integer or float to tuple
    q = [s.quantile(p) for p in prob]
    except TypeError:
    prob = (prob,)
    q = [s.quantile(p) for p in prob]

    return Series(q,index=prob)

Comments are closed.