I’ve seen the best minds of my generation destroyed by Matlab …

(Note: this is very quick and not well thought out. Mostly a conversation starter as opposed to any real thesis on the subject.)

This post is a continuation of a Twitter conversation here, started when John Myles White poked the hornets’ nest. (Python’s nest? Where do Pythons live?)

jmw_tweet

The gist with John’s code is here.

This isn’t a very thoughtful post. But the conversation was becoming sort of a shootout and my thoughts (half-formed as they are) were a bit longer than a tweet. Essentially, I think the Python performance shootouts–PyPy, Numba, Cython–are missing the point.

The point is, I think, that loops are a crutch. A 3-nested for loop in Julia that increments a counter takes 8 lines of code (1 initialize counter, 3 for statements, 1 increment statement, 3 end statements). Only one of those lines tells me what the code does.

But most scientific programmers learned to code in imperative languages and that style of thinking and coding has become natural. I’ve often seen comments like this:

forloop_tweet

Which I think simply equates readability with familiarity. That isn’t wrong, but it isn’t the whole story.

Anyway, a lot of the responses to John’s code were showing that, hey, you can get fast loops in Python, with either JITing (PyPy, Numba) or Cython. So here are my thoughts:

1. Cython is great. I’ve used it with great success myself. But Julia gives me fast loops while keeping the dynamic typing; i.e., I’m still writing in Julia. Cython is a manifestation of what the Julia developers call the “two-language problem.”  My programmer-productivity happens in the slow, dynamic language, and I swap to a more painful language for critical bottlenecks and glue the two together. Cython is a more pleasant manifestation of the problem, especially since it lets you evolve in an exploratory, piece-meal way from your first language to your second language. But you still end up with code that is nice dynamic-typing and abstractions on the outside; gross static typing and low-level imperative stuff on the inside. (And Cython examples are often clean and simple, but the code can get hairy very quickly.)

2. One of the nice things about the slow for loops in Python and R is that they force you to think about other ways to express your problem. R and Python programmers start thinking about how they can exploit arrays and other ADTs, and higher-order functions to express they’re problem. Avoiding the loop performance hit is the first reason, but then many of them start to realize they like their code better this way. The adjustment is hard at first, but once you get their, it’s hard to go back.

Forget about the Numpy, PyPy, Cython solutions to John’s problem. I think it’s safe to say his original pure Python code would be considered pretty un-Pythonic, to the extent that’s a thing. Python programmers are discouraged from that style of writing-C-in-Python, for both performance reasons, and conceptual reasons. Python programmers just think the alternatives (e.g. list comprehensions) are more expressive and maintainable. They’re not avoiding for loops because they’re slow: they don’t want to write for loops.

3. Maybe Julia is the answer to this problem. Since list comprehensions, higher-order-functions (applies, maps, etc.) wrap imperative loops, and Julia loops are fast, then these things can be written in Julia and be fast.

But that requires some thought about how Julia devs want Julia programmers to program. Julia is great and really promising, and it’s got an opportunity to let scientific programmers really raise their game. But I’d hate the big pitch for Julia to be: hey, you can write fast loops! And it would basically become a refuge for people who never learned to properly code R and are are fed up with slow loops, or for Matlab guys who’s licenses ran out.

Posted in Uncategorized | Tagged , , | 6 Comments

Will it Python? Machine Learning for Hackers, Chapter 7: Numerical optimization with deterministic and stochastic methods

Introduction

Chapter 7 of Machine Learning for Hackers is about numerical optimization. The authors organize the chapter around two examples of optimization. The first is a straightforward least-squares problem like that we’ve encountered already doing linear regressions, and is amenable to standard iterative algorithms (e.g. gradient descent). The second is a problem with a discrete search space, not clearly differentiable, and so lends itself to a stochastic/heuristic optimization technique (though we’ll see the optimization problem is basically artificial). The first problem gives us a chance to play around with Scipy’s optimization routines. The second problem has us hand-coding a Metropolis algorithm; this doesn’t show off much new Python, but it’s fun nonetheless.

The notebook for this chapter is at the github report here, or you can view it online via nbviewer here.

Ridge regression by least-squares

In chapter 6 we estimated LASSO regressions, which added an L1 penalty on the parameters to the OLS loss-function. The ridge regression works the same way, but applies an L2 penalty to the parameters. The ridge regression is a somewhat more straightforward optimization problem, since the L2 norm we use gives us a differentiable loss function.

In this example, we’ll regress weight on height, similar to chapter 5. We can specify the loss (sum of squared errors) function for the ridge regression with the following function in Python:

y = heights_weights['Weight'].values
Xmat = sm.add_constant(heights_weights['Height'], prepend = True)

def ridge_error(params, y, Xmat, lam):
    '''
    Compute SSE of the ridge regression.
    This is the normal regression SSE, plus the
    L2 cost of the parameters.
    '''
    predicted = np.dot(Xmat, params)
    sse = ((y - predicted) ** 2).sum()
    sse += lam * (params ** 2).sum()

    return sse

The authors use R’s optim function, which defaults to the Nelder-Mead simplex algorithm. This algorithm doesn’t use any gradient or Hessian information to optimize the function. We’ll want to try out some gradient methods, though. Even though the functions for these methods will compute numerical gradients and Hessians for us, for the ridge problem these are easy enough to specify explicitly.

def ridge_grad(params, y, Xmat, lam):
    '''
    The gradiant of the ridge regression SSE.
    '''
    grad = np.dot(np.dot(Xmat.T, Xmat), params) - np.dot(Xmat.T, y)
    grad += lam * params
    grad *= 2
    return grad

def ridge_hess(params, y, Xmat, lam):
    '''
    The hessian of the ridge regression SSE.
    '''
    return np.dot(Xmat.T, Xmat) + np.eye(2) * lam

Like the LASSO regressions we worked with in chapter 6, the ridge requires a penalty parameter to weight the L2 cost of the coefficient parameters (called lam in the functions above; lambda is a keyword in Python). The authors assume we’ve already found an appropriate value via cross-validation, and that value is 1.0.

We can now try to minimize the loss function with a couple of different algorithms. First the Nelder-Mead simplex, which should correspond to the authors’ use of optim in R.

# Starting values for the a, b (intercept, slope) parameters
params0 = np.array([0.0, 0.0])

# Nelder-Mead simplex
ridge_fit = opt.fmin(ridge_error, params0, args = (y, Xmat, 1))
print 'Solution: a = %8.3f, b = %8.3f ' % tuple(ridge_fit)

Optimization terminated successfully.
         Current function value: 1612442.197636
         Iterations: 117
         Function evaluations: 221
Solution: a = -340.565, b =    7.565

Now the Newton conjugate-gradient method. We need to give this function a gradient; the Hessian is optional. First without the Hessian:

ridge_fit = opt.fmin_ncg(ridge_error, params0, fprime = ridge_grad, args = (y, Xmat, 1))
print 'Solution: a = %8.3f, b = %8.3f ' % tuple(ridge_fit)

Optimization terminated successfully.
         Current function value: 1612442.197636
         Iterations: 3
         Function evaluations: 4
         Gradient evaluations: 11
         Hessian evaluations: 0
Solution: a = -340.565, b =    7.565

Now supplying the Hessian:

ridge_fit = opt.fmin_ncg(ridge_error, params0, fprime = ridge_grad,
                         fhess = ridge_hess, args = (y, Xmat, 1))
print 'Solution: a = %8.3f, b = %8.3f ' % tuple(ridge_fit)

Optimization terminated successfully.
         Current function value: 1612442.197636
         Iterations: 3
         Function evaluations: 7
         Gradient evaluations: 3
         Hessian evaluations: 3
Solution: a = -340.565, b =    7.565

Fortunately, we get the same results for all three methods. Supplying the Hessian to the Newton method shaves some time off, but in this simple application, it’s not really worth coding up a Hessian function (except for fun).

Lastly, the BFGS method, supplied with the gradient:


ridge_fit = opt.fmin_ncg(ridge_error, params0, fprime = ridge_grad,
 fhess = ridge_hess, args = (y, Xmat, 1))
 print 'Solution: a = %8.3f, b = %8.3f ' % tuple(ridge_fit)

Optimization terminated successfully.
 Current function value: 1612442.197636
 Iterations: 3
 Function evaluations: 7
 Gradient evaluations: 3
 Hessian evaluations: 3
 Solution: a = -340.565, b = 7.565

For this simple problem, all of these methods work well. For more complicated problems, there are considerations which would lead you to prefer one over another, or perhaps to use them in combination. There are also several more methods available, some which allow you to solve constrained optimization problems. Check out the very good documentation. Also note that if you’re not into hand-coding gradients, scipy has a function derivative in its misc module that will compute numerical derivatives. In many cases, the functions will do this automatically if you fail to provide a function to their gradient arguments.

Optimizing on sentences with the Metropolis algorithm

The second example in this chapter is a “code-breaking” exercise. We start with a message “here is some sample text”, which we encrypt using a Ceasar cipher that shifts each letter in the message to the next letter in the alphabet (with Z going to A). We can represent the cipher (or any cipher) in Python with a dict that maps each letter to its encrypted counterpart.

letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h',
           'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p',
           'q', 'r', 's', 't', 'u', 'v', 'w', 'x',
           'y', 'z']

ceasar_cipher = {i: j for i, j in zip(letters, letters[1:] + letters[:1])}
inverse_ceasar_cipher = {ceasar_cipher[k]: k for k in ceasar_cipher}

The inverse_ceasar_cipher dict reverses the cipher, so we can get an original message back from one that’s been encrypted by the Ceasar cipher. Based on these structures, let’s make functions that will encrypt and decrypt text.

def cipher_text(text, cipher_dict = ceasar_cipher):
    # Split the string into a list of characters to apply
    # the decoder over.
    strlist = list(text)

    ciphered = ''.join([cipher_dict.get(x) or x for x in strlist])
    return ciphered

def decipher_text(text, cipher_dict = ceasar_cipher):
    # Split the string into a list of characters to apply
    # the decoder over.
    strlist = list(text)

    # Invert the cipher dictionary (k, v) -> (v, k)
    decipher_dict = {cipher_dict[k]: k for k in cipher_dict}

    deciphered = ''.join([decipher_dict.get(x) or x for x in strlist])
    return deciphered

To decrypt our message, we’ll design a Metropolis algorithm that randomly proposes ciphers, decrypts the message according to the proposed cipher, and see’s how probable that message is based on a lexical database of word frequency in Wikipedia.

The following functions are used to generate proposal ciphers for the Metropolis algorithm. The idea is to randomly generate ciphers and see what text they result in. If the text resulting from a proposed cipher is more likely (according to the lexical database) than the current cipher, we accept the proposal. If it’s not, we accept it wil a probability that is lower the less likely the resulting text is.

The method of generating new proposals is important. The authors use a method that chooses a key (letter) at random from the current cipher, and swaps its with some other letter. For example, if we start with the Ceasar Cipher, our proposal might randomly choose to re-map A to N (instead of B). The proposal would then be the same a the Ceasar Cipher, but with A → N and M → B (since A originally mapped to B and M originally mapped to N). This proposal-generating mechanism is encapsulated in propose_modified_cipher_from_cipher.

This is inefficient in a few ways. First, the letter chosen to modify in the cipher may not even appear in the text, so the proposed cipher won’t modify the text at all and you end up wasting cycles generating a lot of useless proposals. Second, we may end up picking a letter that occurs in a highly likely word, which will increase the probability of generating an inferior proposal.

We’ll suggest another mechanism that, instead of selecting a letter from the current cipher to re-map, will choose a letter amongst the non-words in the current deciphered text. For example, if our current deciphered text is “hello wqrld”, we will only select amongst {w, q, r, l, d} to modify at random. The minimizes the chances that a modified cipher will turn real words into gibberish and produce less likely text. The function propose_modified_cipher_from_text performs this proposal mechanism.

One way to think of this is that it’s analogous to tuning the variance of the proposal distribution in the typical Metropolis algorithm. If the variance is too low, our algorithm won’t efficiently explore the target distribution. If it’s too high, we’ll end up generating lots of lousy proposals. Our cipher proposal rules can suffer from similar problems.

def generate_random_cipher():
    '''
    Randomly generate a cipher dictionary (a one-to-one letter -> letter map).
    Used to generate the starting cipher of the algorithm.
    '''
    cipher = []

    input  = letters
    output = letters[:]
    random.shuffle(output)

    cipher_dict = {k: v for (k, v) in zip(input, output)}

    return cipher_dict

def modify_cipher(cipher_dict, input, new_output):
    '''
     Swap a single key in a cipher dictionary.

     Old: a -> b, ..., m -> n, ...
     New: a -> n, ..., m -> b, ...
     '''
    decipher_dict = {cipher_dict[k]: k for k in cipher_dict}
    old_output = cipher_dict[input]

    new_cipher = cipher_dict.copy()
    new_cipher[input] = new_output
    new_cipher[decipher_dict[new_output]] = old_output

    return new_cipher

def propose_modified_cipher_from_cipher(text, cipher_dict,
                                        lexical_db = lexical_database):
    '''
    Generates a new cipher by choosing and swapping a key in the
    current cipher.
    '''
    _          = text # Unused
    input      = random.sample(cipher_dict.keys(), 1)[0]
    new_output = random.sample(letters, 1)[0]
    return modify_cipher(cipher_dict, input, new_output)

def propose_modified_cipher_from_text(text, cipher_dict,
                                      lexical_db = lexical_database):

    '''
    Generates a new cipher by choosing a swapping a key in the current
    cipher, but only chooses keys that are letters that appear in the
    gibberish words in the current text.
    '''
    deciphered = decipher_text(text, cipher_dict).split()
    letters_to_sample = ''.join([t for t in deciphered
                                 if lexical_db.get(t) is None])
    letters_to_sample = letters_to_sample or ''.join(set(deciphered))

    input      = random.sample(letters_to_sample, 1)[0]
    new_output = random.sample(letters, 1)[0]
    return modify_cipher(cipher_dict, input, new_output)

Next, we need to be able to compute a message’s likelihood (from the lexical database). The log-likelihood of a message is just the sum of the log-likelihoods of each word (one-gram) in the message. If the word is gibberish (i.e., doesn’t occur in the database) it gets a tiny probability set to the smallest floating-point precision.

def one_gram_prob(one_gram, lexical_db = lexical_database):
    return lexical_db.get(one_gram) or np.finfo(float).eps

def text_logp(text, cipher_dict, lexical_db = lexical_database):
    deciphered = decipher_text(text, cipher_dict).split()
    logp = np.array([math.log(one_gram_prob(w)) for w in deciphered]).sum()
    return logp

We can now use these functions in our Metropolis algorithm. Each step in the metropolis algorithm proposes a cipher, deciphers the text according the proposal, and computes the log-likelihood of the deciphered message. If the likelihood of the deciphered message is better under the proposal cipher than the current cipher, we definitely accept that proposal for our next step. If not, we only accept the proposal with a probability based on the relative likelihood of the proposal to the current cipher.

I’ll define this function to take an arbitrary proposal function via the proposal_rule argument. So far, this can be one of the two propose_modified_cipher_from_* functions defined above.

def metropolis_step(text, cipher_dict, proposal_rule, lexical_db = lexical_database):
    proposed_cipher = proposal_rule(text, cipher_dict)
    lp1 = text_logp(text, cipher_dict)
    lp2 = text_logp(text, proposed_cipher)

    if lp2 > lp1:
        return proposed_cipher
    else:
        a = math.exp(lp2 - lp1)
        x = random.random()
        if x < a:
            return proposed_cipher
        else:
            return cipher_dict

To run the algorithm, just wrap the step function inside a loop. There’s no stopping rule for the algorithm, so we have to choose a number of iterations, and hope it’s enough to get us to the optimum. Let’s use 250,000.

message = 'here is some sample text'
ciphered_text = cipher_text(message, ceasar_cipher)
niter = 250000

def metropolis_decipher(ciphered_text, proposal_rule, niter, seed = 4):
    random.seed(seed)
    cipher = generate_random_cipher()

    deciphered_text_list = []
    logp_list = []

    for i in xrange(niter):
        logp = text_logp(ciphered_text, cipher)
        current_deciphered_text = decipher_text(ciphered_text, cipher)

        deciphered_text_list.append(current_deciphered_text)
        logp_list.append(logp)

        cipher = metropolis_step(ciphered_text, cipher, proposal_rule)

    results = DataFrame({'deciphered_text': deciphered_text_list, 'logp': logp_list})
    results.index = np.arange(1, niter + 1)
    return results

First let’s look at the authors’
proposal rule. While they managed to get a reasonable decrypted message in about 50,000 iterations, we’re still reading gibberish after 250,000. As they say in the book, their results are an artefact of a lucky seed value.

results0 = metropolis_decipher(ciphered_text,
                               propose_modified_cipher_from_cipher, niter)
print results0.ix[10000::10000]

	     deciphered_text	        logp
10000	 kudu of feru fyrvbu hush	-86.585205
20000	 wudu of feru fbrkxu hush	-87.124919
30000	 kudu of feru fnrbau hush	-86.585205
40000	 wudu of feru fmrjiu hush	-87.124919
50000	 kudu of feru fyrnbu hush	-86.585205
60000	 kudu of feru fxrnvu hush	-86.585205
70000	 pudu of feru fvrnlu hush	-87.561022
80000	 kudu of feru fvrxgu hush	-86.585205
90000	 kudu of feru fbrvtu hush	-86.585205
100000	 kudu of feru fjrnlu hush	-86.585205
110000	 kudu of feru fprbju hush	-86.585205
120000	 kudu of feru fnrjcu hush	-86.585205
130000	 kudu of feru flrvpu hush	-86.585205
140000	 puku of feru flrvxu hush	-88.028362
150000	 kudu of feru fxrviu hush	-86.585205
160000	 pulu of feru ftrdzu hush	-88.323162
170000	 wuzu of feru flrxdu hush	-89.575925
180000	 kudu of feru firamu hush	-86.585205
190000	 wudu of feru fyrzqu hush	-87.124919
200000	 wudu of feru fnraxu hush	-87.124919
210000	 puku of feru fjrnyu hush	-88.028362
220000	 puku of feru firyau hush	-88.028362
230000	 pudu of feru fkrcvu hush	-87.561022
240000	 kudu of feru ftrwzu hush	-86.585205
250000	 kudu of feru fprxzu hush	-86.585205

Now, let’s try the alternative proposal rule, which only chooses letters from gibberish words when it modifies the current cipher to propose a new one. The algorithm doesn’t find the actual message, but it actually finds a more likely message (according the the lexical database) within 20,000 iterations.

results1 = metropolis_decipher(ciphered_text,
                               propose_modified_cipher_from_text, niter)
print results1.ix[10000::10000]

         deciphered_text	        logp
10000	 were mi isle izlkde text	-68.946850
20000	 were as some simple text	-35.784429
30000	 were as some simple text	-35.784429
40000	 were as some simple text	-35.784429
50000	 were as some simple text	-35.784429
60000	 were as some simple text	-35.784429
70000	 were us some simple text	-38.176725
80000	 were as some simple text	-35.784429
90000	 were as some simple text	-35.784429
100000	 were as some simple text	-35.784429
110000	 were as some simple text	-35.784429
120000	 were as some simple text	-35.784429
130000	 were as some simple text	-35.784429
140000	 were as some simple text	-35.784429
150000	 were us some simple text	-38.176725
160000	 were as some simple text	-35.784429
170000	 were is some sample text	-37.012894
180000	 were as some simple text	-35.784429
190000	 were as some simple text	-35.784429
200000	 were as some simple text	-35.784429
210000	 were as some simple text	-35.784429
220000	 were as some simple text	-35.784429
230000	 were as some simple text	-35.784429
240000	 were as some simple text	-35.784429
250000	 were is some sample text	-37.012894

The graph below plots the likelihood paths of the algorithm for the two proposal rules. The blue line is the log-likelihood of the original message we’re trying to recover.

metropolis_likpaths

Direct calculation of the most likely message.

The Metropolis algorithm is kind of pointless for this application. It’s really just jumping around looking for the most likely phrase. But since the likelihood of a message is just the sum of the log probabilities of the log probabilities of its component words, we just need to look for the most likely words of the lengths of the words of the ciphered message.

If the message at some point is “fgk tp hpdt”, then, if run long enough, the algorithm should just find the most likely three-letter word, the most likely two-letter word, and the most likely four-letter word. But we can look these up directly.

For example, the message we encrypted is ‘here is some sample text’, which has word lengths 4, 2, 4, 6, 4. What’s the most likely message with these word lengths?

def maxprob_message(word_lens = (4, 2, 4, 6, 4), lexical_db = lexical_database):
    db_word_series = Series(lexical_db.index)
    db_word_len    = db_word_series.str.len()
    max_prob_wordlist = []
    logp = 0.0
    for i in word_lens:
        db_words_i = list(db_word_series[db_word_len == i])
        db_max_prob_word = lexical_db[db_words_i].idxmax()
        logp += math.log(lexical_db[db_words_i].max())
        max_prob_wordlist.append(db_max_prob_word)
    return max_prob_wordlist, logp

maxprob_message()
(['with', 'of', 'with', 'united', 'with'], -25.642396806584493)

So, technically, we should have decoded our message to be “with of united with” instead of “here is some sample text”. This is not a shining endorsement of this methodology for decrypting messages.

Conclusion

While it was a fun exercise to code up the Metropolis decrypter in this chapter, it didn’t show off any new Python functionality. The ridge problem, while less interesting, showed off some of the optimization algorithms in Scipy. There’s a lot of good stuff in Scipy’s optimize module, and its documentation is worth checking out.

Posted in Uncategorized | Leave a comment

Will it Python? Machine Learning for Hackers, Chapter 6: Regression models with regularization

Introduction

In my opinion, Chapter 6 is the most important chapter in Machine Learning for Hackers. It introduces the fundamental problem of machine learning: overfitting and the bias-variance tradeoff. And it demonstrates the two key tools for dealing with it: regularization and cross-validation.

It’s also a fun chapter to write in Python, because it lets me play with the fantastic scikit-learn library. scikit-learn is loaded with hi-tech machine learning models, along with convenient “pipeline”-type functions that facilitate the process of cross-validating and selecting hyperparameters for models. Best of all, it’s very well documented.

Fitting a sine wave with polynomial regression

The chapter starts out with a useful toy example–trying to fit a curve to data generated by a sine function over the interval [0, 1] with added Gaussian noise. The natural way to fit nonlinear data like this is using a polynomial function, so that the output, y is a function of powers of the input x. But there are two problems with this.

First, we can generate highly correlated regressors by taking powers of x, leading to noisy parameter estimates. The input x are evenly space numbers on the interval [0, 1]. So x and x2 are going to have a correlation over 95%. Similar with x2 and x3. The solution to this is to use orthogonalized polynomial functions: tranformations of x that, when summed, result in polynomial functions, but are orthogonal (therefore uncorrelated) with each other.

Luckily, we can easily calculate these transformations using patsy. The C(x, Poly) transform computes orthonormal polynomial functions of x, then we’ll extract out various orders of the polynomial. So Xpoly[:, :2] selects out the 0th and 1st order functions, then when summed will give us a first order polynomial (i.e. linear). Similarly Xpoly[: :4] gives us the 0th through 3rd order functions, which sum up to a cubic polynomial.

sin_data = DataFrame({'x' : np.linspace(0, 1, 101)})
sin_data['y'] = np.sin(2 * pi * sin_data['x']) + np.random.normal(0, 0.1, 101)

x = sin_data['x']
y = sin_data['y']
Xpoly = dmatrix('C(x, Poly)')
Xpoly1 = Xpoly[:, :2]
Xpoly3 = Xpoly[:, :4]
Xpoly5 = Xpoly[:, :6]
Xpoly25 = Xpoly[:, :26]

The problem we encounter now is how to choose what order polynomial to fit to the data. Any data can be fit well (i.e. have a high R2) if we use a high enough order polynomial. But we will start to over-fit our data; capturing noise specific to our sample, leading to poor predictions on new data. The graph below shows the fits to the data of a straight line, a 3rd-order polynomial, a 5th-order polynomial, and a 25th-order polynomial. Notice how the last fit gives us all kinds of degrees of freedom to capture specific datapoints, and the excessive “wiggles” look like we’re fitting to noise.

sine_wave_polyfits

In machine learning, this problem is solved with regularization–penalizing large parameter estimates in a way that, hopefully, shrinks down the coefficients on all but the most important inputs. Here’s where scikit-learn shines.

Preventing overfitting with regularization

The penalty parameter in a regularized regression is typically found via cross-validation; for each candidate penalty one repeatedly fits the model on subsets on the data, and the penalty value that gives the best fit across the cross-validation “folds” is chosen. In the book, the authors hand-code up a cross-validation scheme, looping over possible penalties and subsets of the data and recording the MSEs.

In scikit-learn you can usually automate the cross-validation procedure, by one of a couple of ways. Many models have a CVversion, or, if not, you can wrap your model in a function like GridSearchCV which is a convenience function around all the looping and fit-recording entailed in a cross-validation. Here I’ll use the LassoCV function, which performs cross-validation for a LASSO-penalized linear regression.

lasso_model = LassoCV(cv = 15, copy_X = True, normalize = True)
lasso_fit = lasso_model.fit(Xpoly[:, 1:11], y)
lasso_path = lasso_model.score(Xpoly[:, 1:11], y)

The first line sets up the model by specifying some options. The only interesting one here is cv, which specifies how many cross-validation folds to run on each penalty-parameter value. The second line fits the model: here’s I’m going to run a 10th-order polynomial regression, and let the LASSO penalty shrink away all but the most important orders. Finally, lasso_path provides the objective function that our penalty parameter is suppose to optimize in the cross-validations (typically RMSE).

After running the fit() method, LassoCV will provide useful output attributes, including the “optimal” penalty parameter, stored in .alpha_. Note that scikit-learn refers to the penalty parameter as alpha, while R’s glmnet, which the authors use to implement the LASSO model, calls it lambda. I’m more accustomed to the penalty parameter being denoted with lambda myself. Note also that glmnet uses alpha elsewhere.

# Plot the average MSE across folds
plt.plot(-np.log(lasso_fit.alphas_), np.sqrt(lasso_fit.mse_path_).mean(axis = 1))
plt.ylabel('RMSE (avg. across folds)')
plt.xlabel(r'$-\log(\lambda)$')
# Indicate the lasso parameter that minimizes the average MSE across folds.
plt.axvline(-np.log(lasso_fit.alpha_), color = 'red')

lasso_cv_fits_poly

The value of the penalty parameter itself isn’t all that meaningful. So let’s take a look at what the resulting coefficient estimates are when we apply the penalty.

print 'Deg.  Coefficient'
print Series(np.r_[lasso_fit.intercept_, lasso_fit.coef_])

Deg.  Coefficient
0    -0.003584
1    -5.359452
2     0.000000
3     4.689958
4    -0.000000
5    -0.547131
6    -0.047675
7     0.124998
8     0.133224
9    -0.171974
10    0.090685

So the LASSO, after selecting a penalty parameter via cross-validation, results in essentially a 3rd-order polynomial model: y = -5.4x + 4.7x3. This makes sense since, as we saw above, we’d captured the important features of the data by the time we’d fit a 3rd order polynomial.

Predicting O’Reilly book sales using back-cover descriptions

Next I’ll use the same model to tackle some real data. We have the sales ranks of the top-100 selling O’Reilly books. We’d like to see if we use the text on the back-cover description of the book to predict its rank. So the output variable is the rank of the book (reversed so that 100 is the top-selling book, and 1 is the 100th best-selling book), while the input variables are all the terms that appear in these 100 books’ back covers. For each book the value of an input variable is the number of times the term appears on its back cover. Many of the input values will be zero (for example, the term “javascript” will occur many times in a book about javascript, but zero times in every other book).

So the matrix of input variables is just our old friend, the term-document matrix. Creating this (using any of the methods described in the posts for chapter 3 or chapter 4), we can just apply LassoCV again.

lasso_model = LassoCV(cv = 10)
lasso_fit = lasso_model.fit(desc_tdm.values, ranks.values)

Because of the size and nature of the input data, this runs pretty slowly (about 3-5 minutes for me). And, because there seems to be no good prediction model to be had here, the model doesn’t alway converge. If we do get a convergent run, we find the CV procedure wants us to shrink all the coefficients to zero: no input is worth keeping per the LASSO. (Note that since the x-axis in the graph is -log(penalty), moving left on the axis, towards 0, means more regularization.) This is the same result the authors find.

lasso_cv_fits_text

Logistic regression with cross-validation

With the previous model a bust, the authors regroup and try to fit a more simple output variable: a binary indicator of whether the book is in the top-50 sellers or not. Since they’re modeling a 0/1 outcome, they use a logistic regression. Like the linear models we used above, we can also apply regularizers to logistic regression.

In the book, the authors again code up an explicit cross-validation procedure. The notebook for this chapter has some code that replicates their procedure, but here I’ll discuss a version that uses scikit-learn’s GridCV function, which automates the cross-validation procedure for us. (the term “grid” is a little confusing here, since we’re only optimizing over one variable, the penalty parameter; the term “grid” is a little more intuitive in a 2-or-more-dimension search).

clf = GridSearchCV(LogisticRegression(C = 1.0, penalty = 'l1'), c_grid,
                   score_func = metrics.zero_one_score, cv = n_cv_folds)
clf.fit(trainX, trainy)

We initialize the GridCV procedure by telling it:

  • What model we’re using: logistic, with a penalty parameter C initialized at 1.0, using the L1 (LASSO) penalty.
  • A grid/array of parameter value candidates to search over: here values of C.
  • A score function to optimize: before we were using the RMSE of the regression, here we’ll use a correct classification rate,
    given zero_one_score, in scikit-learn’s metrics module.
  • The number of cross-validation folds to performs; this defined elsewhere in the variable n_cv_folds

Then I fit the model on training data (a random subset of 80). After running this, We can check what value it chose for the penalty parameter, C, and what the in-sample error-rate for this value was.

clf.best_params_, 1.0 - clf.best_score_
({'C': 0.29377144516536563}, 0.375)

And again, let’s plot the error rates against values of C to vizualize how regularization affects the model accuracy.

rates = np.array([1.0 - x[1] for x in clf.grid_scores_])
stds   = [np.std(1.0 - x[2]) / math.sqrt(n_cv_folds) for x in clf.grid_scores_]

plt.fill_between(cs, rates - stds, rates + stds, color = 'steelblue', alpha = .4)
plt.plot(cs, rates, 'o-k', label = 'Avg. error rate across folds')
plt.xlabel('C (regularization parameter)')
plt.ylabel('Avg. error rate (and +/- 1 s.e.)')
plt.legend(loc = 'best')
plt.gca().grid()

logistic_cv_errors

After fitting to the training set, we can predict on the test set and and see how accurate the model is on new data using the classification_report function.

print metrics.classification_report(testy, clf.predict(testX))
             precision    recall  f1-score   support

          0       0.78      0.44      0.56        16
          1       0.18      0.50      0.27         4

avg / total       0.66      0.45      0.50        20

And the confusion matrix shows we got 9 instances classified correctly (the diagonal), and 11 incorrectly (the off-diagonal).

print '   Predicted'
print '   Class'
print DataFrame(metrics.confusion_matrix(testy, clf.predict(testX)))
   Predicted
   Class
   0  1
0  7  9
1  2  2

Conclusion

Cross-validation often requires a lot of bookkeeping code. Writing this over and over again for different applications is inefficient and error-prone. So it’s great that scikit-learn has functions that encapsulate the cross-validation process in convenient abstractions/interfaces that do the bookkeeping for you. It also has a wide array of useful, cutting-edge models, and the documentation is not just clear and organized, but also educational: there are lots of examples and exposition that explains how the underlying models work, not just what the API is.

So even though we didn’t build any kick-ass, high-accuracy predictive models here, we did get to explore some fundamental methods in building ML models, and get acquainted with the powerful tools in scikit-learn.

Posted in Uncategorized | 3 Comments

Will it Python? Machine Learning for Hackers, Chapter 5: Linear regression (with categorical regressors)

Introduction

Chapter 5 of Machine Learning for Hackers is a relatively simple exercise in running linear regressions. Therefore, this post will be short, and I’ll only discuss the more interesting regression example, which nicely shows how patsy formulas handle categorical variables.

Linear regression with categorical independent variables

In chapter 5, the authors construct several linear regressions, the last of which is a multi-variate regression descriping the number of page views of top-viewed web sites. The regression is pretty straightforward, but includes two categorical variables: HasAdvertising, which takes values True or False; and InEnglish, which takes values Yes, No and NA (missing).

If we include these variables in the formula, then patsy/statmodels will automatically generate the necessary dummy variables. For HasAdvertising, we get a dummy variable equal to one when the the value is True. For InEnglish, which takes three values, we get two separate dummy variables, one for Yes, one for No, with the missing value serving as the baseline.

model = 'np.log(PageViews) ~ np.log(UniqueVisitors) + HasAdvertising  + InEnglish'
pageview_fit_multi = ols(model, top_1k_sites).fit()
print pageview_fit_multi.summary()

 


                            OLS Regression Results
==============================================================================
Dep. Variable:      np.log(PageViews)   R-squared:                       0.480
Model:                            OLS   Adj. R-squared:                  0.478
Method:                 Least Squares   F-statistic:                     229.4
Date:                Sat, 24 Nov 2012   Prob (F-statistic):          1.52e-139
Time:                        09:50:25   Log-Likelihood:                -1481.1
No. Observations:                1000   AIC:                             2972.
Df Residuals:                     995   BIC:                             2997.
Df Model:                           4
==========================================================================================
                             coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------------
Intercept                 -1.9450      1.148     -1.695      0.090        -4.197     0.307
HasAdvertising[T.True]     0.3060      0.092      3.336      0.001         0.126     0.486
InEnglish[T.No]            0.8347      0.209      4.001      0.000         0.425     1.244
InEnglish[T.Yes]          -0.1691      0.204     -0.828      0.408        -0.570     0.232
np.log(UniqueVisitors)     1.2651      0.071     17.936      0.000         1.127     1.403
==============================================================================
Omnibus:                       73.424   Durbin-Watson:                   2.068
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               92.632
Skew:                           0.646   Prob(JB):                     7.68e-21
Kurtosis:                       3.744   Cond. No.                         570.
==============================================================================

If we were going to do this without the formula API, we’d have to explicity make these dummies. For comparison, here’s that.

top_1k_sites['LogUniqueVisitors'] = np.log(top_1k_sites['UniqueVisitors'])
top_1k_sites['HasAdvertisingYes'] = np.where(top_1k_sites['HasAdvertising'] == 'Yes', 1, 0)
top_1k_sites['InEnglishYes'] = np.where(top_1k_sites['InEnglish'] == 'Yes', 1, 0)
top_1k_sites['InEnglishNo'] = np.where(top_1k_sites['InEnglish'] == 'No', 1, 0)

linreg_fit = sm.OLS(np.log(top_1k_sites['PageViews']),
                    sm.add_constant(top_1k_sites[['HasAdvertisingYes', 'LogUniqueVisitors',
                                                  'InEnglishNo', 'InEnglishYes']],
                                    prepend = True)).fit()
linreg_fit.summary()
Posted in Data Analysis, Will it Python | Tagged , , | Leave a comment

Will it Python? Machine Learning for Hackers, Chapter 4: Priority e-mail ranking

Introduction

I’m not going to write much about this chapter. In my opinion the payoff-to-effort ratio for this project is pretty low. The algorithm for ranking e-mails is pretty straightforward, but in my opinion seriously flawed. Most of the code in the chapter (and there’s a lot of it) revolves around parsing the text in the files. It’s a good exercise in thinking through feature extraction, but it’s not got a lot of new ML concepts. And from my perspective, there’s not much opportunity to show off any Python goodness. But, I’ll hit a couple of points that are new and interesting.

The complete code is at the Github repo here. and you can read the notebook via nbviewer here.

1. Vectorized string methods in pandas. Back in chapter 1, I groused about lacking vectorized functions for operations on strings or dates in pandas. If it wasn’t a numpy ufunc, you had to use the pandas map() method. That’s changed a lot over the summer, and since pandas 0.9.0, we can call vectorized string methods.

For example, here’s the code in my chapter for program that identifies e-mails that are part of a thread, by looking for “re:”-like prefixes on the subjects.

reply_pattern   = '(re:|re\[\d\]:)'
fwd_pattern = '(fw:|fw[\d]:)'

def thread_flag(s):
    '''
    Returns True if string s matches the thread patterns.
    If s is a pandas Series, returns a Series of booleans.
    '''
    if isinstance(s, basestring):
        return re.search(reply_pattern, s) is not None
    else:
        return s.str.contains(reply_pattern, re.I)

def clean_subject(s):
    '''
    Removes all the reply and forward labeling from a
    string (an e-mail subject) s.
    If s is a pandas Series, returns a Series of cleaned
    strings.
    This will help find the initial message in the thread
    (which won't have any of the reply/forward labeling.
    '''
    if isinstance(s, basestring):
        s_clean = re.sub(reply_pattern, '', s, re.I)
        s_clean = re.sub(fwd_pattern, '', s_clean, re.I)
        s_clean = s_clean.strip()
    else:
        s_clean = s.str.replace(reply_pattern, '', re.I)
        s_clean = s_clean.str.replace(fwd_pattern, '', re.I)
        s_clean = s_clean.str.strip()

    return s_clean

In thread_flag, if the input is a pandas series of e-mail subject lines, then the function will use a vectorized string function, called with .str.contains() to see if a pattern matching a reply-type prefix is in the subject. The function will therefore return a pandas series of booleans, that are True for all the subjects that have a reply pattern, and False for all the subjects that don’t.

The function clean_subjects, if given a pandas Series input, will use the vectorized string methods .str.replace() and .str.strip() to clean the re- and fwd-like patterns out of the subjects.

Notice there are some differences between the naming of pandas string methods and the base string methods or re module functions that perform similar operations on single strings. For example, there’s no contains function in re; we use re.search(). Similarly .str.replace() does what we’d use re.sub() to do on a single string.

2. More term-document matrices In chapter 3 we built a term-document matrix to extract term-frequency features from a set of e-mails. This chapter has a similar exercise, applied to both e-mail messages and their subjects. In the code for that chapter, I built a TDM function that wrapped the term-document matrix function in the textmining package, adding some options that tried to mimic the tdm function in R’s tm package. I use that same function, tdm_df, here. In the post for that chapter, I lamented that I couldn’t find a decent term-document matrix function for Python. The one in textmining was too barebones and I was surprised there was nothing that fit the bill in NLTK.

In comments to that post, Vishal Goklani pointed me to the CountVectorizer function in scikits-learn (in the sklearn.feature_extraction.text module). Despite the rather generic name, this will give you a TDM from a set of documents, returned in the form of a sparse matrix. Here’s quick-and-dirty wrapper function that returns a TDM in the form of a pandas DataFrame.

def sklearn_tdm_df(docs, **kwargs):
    '''
    Create a term-document matrix (TDM) in the form of a pandas DataFrame
    Uses sklearn's CountVectorizer function.

    Parameters
    ----------
    docs: a sequence of documents (files, filenames, or the content) to be
        included in the TDM. See the `input` argument to CountVectorizer.
    **kwargs: keyword arguments for CountVectorizer options.

    Returns
    -------
    tdm_df: A pandas DataFrame with the term-document matrix. Columns are terms,
        rows are documents.
    '''
    # Initialize the vectorizer and get term counts in each document.
    vectorizer = CountVectorizer(**kwargs)
    word_counts = vectorizer.fit_transform(docs)

    # .vocabulary_ is a Dict whose keys are the terms in the documents,
    # and whose entries are the columns in the matrix returned by fit_transform()
    vocab = vectorizer.vocabulary_

    # Make a dictionary of Series for each term; convert to DataFrame
    count_dict = {w: Series(word_counts.getcol(vocab[w]).data) for w in vocab}
    tdm_df = DataFrame(count_dict).fillna(0)
    return tdm_df

# Call the function on e-mail messages. The token_pattern is set so that terms are only
# words with two or more letters (no numbers or punctuation)
message_tdm = sklearn_tdm_df(train_df['message'],
                             stop_words = 'english',
                             charset_error = 'ignore',
                             token_pattern = '[a-zA-Z]{2,}')

3. Timezone issues and rank instability. In the book, the authors compute stats measuring how active threads are. This depends on the time-stamps of the messages, which the authors parse out of the e-mail files. They ignore the time-zone information in the time-stamps, and this seems to create some bugs. For example, the following thread has two e-mails:

Name: [sadev] [bug 840] spam_level_char option change/removal
    734    2002-09-06 10:56:23-07:00
    763    2002-09-06 13:56:19-04:00

If you ignore the timezones, it looks like 763 comes three hours after 734. But looking at the timezones, you can see that 734 actually comes four seconds after 763. So this is a far more active thread than the code in the book calculates.

This sort of issue has a pretty big effect on the ranks of the messages. The rank is just the product of 5 feature weights (based on sender info., thread activity, and term features). Even though the authors scale the individual feature weights (typically with log-scales), by calculating the final rank as a product, you can get big rank difference based on what might seem to be practically similar features (even without any bugs)–for example, in some cases it doesn’t take a big difference to double a feature’s weight, which then doubles the e-mail’s rank.So it seems to me the ranking procedure in the book is not very stable. This is fine, since it’s just meant to be illustrative, but of course you want to be aware of this issue for a more serious exercise.

Conclusion

I didn’t go into much detail here. If you’re interested in seeing a lot of Python and pandas text parsing in action, definitely check out the code.

Posted in Data Analysis, Will it Python | Tagged , , | 3 Comments

Will it Python? ARM Chapter 5: Logistic models of well-switching in Bangladesh

Introduction

The logistic regression we ran for chapter 2 of Machine Learning for Hackers was pretty simple. So I wanted to find an example that would dig a little deeper into statsmodels’s capabilities and the power of the patsy formula language.

So, I’m taking an intermission from Machine Learning for Hackers and am going to show an example from Gelman and Hill’s Data Analysis Using Regression and Multilevel/Hierarchical Models (“ARM”). The chapter has a great example of going through the process of building, interpreting, and diagnosing a logistic regression model. We’ll end up with a model with lots of interactions and variable transforms, which is a great showcase for patsy and the statmodels formula API.

Logistic model of well-switching in Bangladesh

Our data are information on about 3,000 respondent households in Bangladesh with wells having an unsafe amount of arsenic. The data record the amount of arsenic in the respondent’s well, the distance to the nearest safe well (in meters), whether that respondent “switched” wells by using a neighbor’s safe well instead of their own, as well as the respondent’s years of education and a dummy variable indicating whether they belong to a community association.

   switch  arsenic       dist  assoc  educ
1       1     2.36  16.826000      0     0
2       1     0.71  47.321999      0     0
3       0     2.07  20.966999      0    10
4       1     1.15  21.486000      0    12
5       1     1.10  40.874001      1    14
...

Our goal is to model well-switching decision. Since it’s a binary variable (1 = switch, 0 = no switch), we’ll use logistic regression.

The IPython notebook is at the Github repo here, and you can go here to view it on nbviewer. The analysis follows ARM chapter 5.4.

Model 1: Distance to a safe well

For our first pass, we’ll just use the distance to the nearest safe well. Since the distance is recorded in meters, and the effect of one meter is likely to be very small, we can get nicer model coefficients if we scale it. Instead of creating a new scaled variable, we’ll just do it in the formula description using the I() function.

model1 = logit('switch ~ I(dist/100.)', df = df).fit()
print model1.summary()

 


Optimization terminated successfully.
         Current function value: 2038.118913
         Iterations 4
                           Logit Regression Results
==============================================================================
Dep. Variable:                 switch   No. Observations:                 3020
Model:                          Logit   Df Residuals:                     3018
Method:                           MLE   Df Model:                            1
Date:                Sat, 22 Dec 2012   Pseudo R-squ.:                 0.01017
Time:                        13:05:25   Log-Likelihood:                -2038.1
converged:                       True   LL-Null:                       -2059.0
                                        LLR p-value:                 9.798e-11
==================================================================================
                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
Intercept          0.6060      0.060     10.047      0.000         0.488     0.724
I(dist / 100.)    -0.6219      0.097     -6.383      0.000        -0.813    -0.431

Let’s plot this model. We’ll want to jitter the switch data, since it’s all 0/1 and will over-plot.

switch_dist_jittter

Another way to look at this is to plot the densities of distance for switchers and non-switchers. We expect the distribution of switchers to have more mass over short distances and the distribution of non-switchers to have more mass over long distances.

switch_dist_kde

Model 2: Distance to a safe well and the arsenic level of own well

Next, let’s add the arsenic level as a regressor. We’d expect respondents with higher arsenic levels to be more motivated to switch.

model2 = logit('switch ~ I(dist / 100.) + arsenic', df = df).fit()
print model2.summary()

 


Optimization terminated successfully.
         Current function value: 1965.334134
         Iterations 5
                           Logit Regression Results
==============================================================================
Dep. Variable:                 switch   No. Observations:                 3020
Model:                          Logit   Df Residuals:                     3017
Method:                           MLE   Df Model:                            2
Date:                Sat, 22 Dec 2012   Pseudo R-squ.:                 0.04551
Time:                        13:05:29   Log-Likelihood:                -1965.3
converged:                       True   LL-Null:                       -2059.0
                                        LLR p-value:                 1.995e-41
==================================================================================
                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
Intercept          0.0027      0.079      0.035      0.972        -0.153     0.158
I(dist / 100.)    -0.8966      0.104     -8.593      0.000        -1.101    -0.692
arsenic            0.4608      0.041     11.134      0.000         0.380     0.542
==================================================================================

Which is what we see. The coefficients are what we’d expect: the farther to a safe well, the less likely a respondent is to switch, but the higher the arsenic level in their own well, the more likely.

Marginal Effects

To see the effect of these on the probability of switching, let’s calculate the marginal effects at the mean of the data.

model2.margeff(at = 'mean')
array([-0.21806505,  0.11206108])

So, for the mean respondent, an increase of 100 meters to the nearest safe well is associated with a 22% lower probability of switching. But an increase of 1 in the arsenic level is associated with an 11% higher probability of switching.

Class separability

To get a sense of how well this model might classify switchers and non-switchers, we can plot each class of respondent in (distance-arsenic)-space.
We don’t see very clean separation, so we’d expect the model to have a fairly high error rate. But we do notice that the short-distance/high-arsenic region of the graph is mostly comprised switchers, and the long-distance/low-arsenic region is mostly comprised of non-switchers.

dist_arsenic_sep

Model 3: Adding an interaction

It’s sensible that distance and arsenic would interact in the model. In other words, the effect of an 100 meters on your decision to switch would be affected by how much arsenic is in your well.

Again, we don’t have to pre-compute an explicit interaction variable. We can just specify an interaction in the formula description using the : operator.

model3 = logit('switch ~ I(dist / 100.) + arsenic + I(dist / 100.):arsenic',
                   df = df).fit()
print model3.summary()

 

Optimization terminated successfully.
         Current function value: 1963.814202
         Iterations 5
                           Logit Regression Results
==============================================================================
Dep. Variable:                 switch   No. Observations:                 3020
Model:                          Logit   Df Residuals:                     3016
Method:                           MLE   Df Model:                            3
Date:                Sat, 22 Dec 2012   Pseudo R-squ.:                 0.04625
Time:                        13:05:33   Log-Likelihood:                -1963.8
converged:                       True   LL-Null:                       -2059.0
                                        LLR p-value:                 4.830e-41
==========================================================================================
                             coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------------
Intercept                 -0.1479      0.118     -1.258      0.208        -0.378     0.083
I(dist / 100.)            -0.5772      0.209     -2.759      0.006        -0.987    -0.167
arsenic                    0.5560      0.069      8.021      0.000         0.420     0.692
I(dist / 100.):arsenic    -0.1789      0.102     -1.748      0.080        -0.379     0.022
==========================================================================================

The coefficient on the interaction is negative and significant. While we can’t directly intepret its quantitative effect on switching, the qualitative interpretation gels with our intuition. Distance has a negative effect on switching, but this negative effect is reduced when arsenic levels are high. Alternatively, the arsenic level have a positive effect on switching, but this positive effect is reduced as distance to the nearest safe well increases.

Model 4: Adding education, more interactions, and centering variables

Respondents with more eduction might have a better understanding of the harmful effects of arsenic and therefore may be more likely to switch. Education is in years, so we’ll scale it for more sensible coefficients. We’ll also include interactions amongst all the regressors.

We’re also going to center the variables, to help with interpretation of the coefficients. Once more, we can just do this in the formula, without pre-computing centered variables.

model_form = ('switch ~ center(I(dist / 100.)) + center(arsenic) + ' +
              'center(I(educ / 4.)) + ' +
              'center(I(dist / 100.)) : center(arsenic) + ' +
              'center(I(dist / 100.)) : center(I(educ / 4.)) + ' +
              'center(arsenic) : center(I(educ / 4.))'
             )
model4 = logit(model_form, df = df).fit()
print model4.summary()

 


Optimization terminated successfully.
         Current function value: 1945.871775
         Iterations 5
                           Logit Regression Results
==============================================================================
Dep. Variable:                 switch   No. Observations:                 3020
Model:                          Logit   Df Residuals:                     3013
Method:                           MLE   Df Model:                            6
Date:                Sat, 22 Dec 2012   Pseudo R-squ.:                 0.05497
Time:                        13:05:35   Log-Likelihood:                -1945.9
converged:                       True   LL-Null:                       -2059.0
                                        LLR p-value:                 4.588e-46
===============================================================================================================
                                                  coef    std err          z      P>|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------------------------------------
Intercept                                       0.3563      0.040      8.844      0.000         0.277     0.435
center(I(dist / 100.))                         -0.9029      0.107     -8.414      0.000        -1.113    -0.693
center(arsenic)                                 0.4950      0.043     11.497      0.000         0.411     0.579
center(I(educ / 4.))                            0.1850      0.039      4.720      0.000         0.108     0.262
center(I(dist / 100.)):center(arsenic)         -0.1177      0.104     -1.137      0.256        -0.321     0.085
center(I(dist / 100.)):center(I(educ / 4.))     0.3227      0.107      3.026      0.002         0.114     0.532
center(arsenic):center(I(educ / 4.))            0.0722      0.044      1.647      0.100        -0.014     0.158
===============================================================================================================

Model assessment: binned residual plots

Plotting residuals to regressors can alert us to issues like nonlinearity or heteroskedasticity. Plotting raw residuals in a binary model isn’t usually informative, so we do some smoothing. Here, we’ll averaging the residuals within bins of the regressor. (A lowess or moving average might also work.)

I’m going to write a function to provide the binned residual data dynamically (and another helper function to plot the data). To create the bins I’m going to use the handy qcut function in pandas, which bins a vector of data into quantiles. Then I’ll use groupby to calculate the bin means and confidence intervals.

def bin_residuals(resid, var, bins):
    '''
    Compute average residuals within bins of a variable.

    Returns a dataframe indexed by the bins, with the bin midpoint,
    the residual average within the bin, and the confidence interval
    bounds.
    '''
    resid_df = DataFrame({'var': var, 'resid': resid})
    resid_df['bins'] = qcut(var, bins)
    bin_group = resid_df.groupby('bins')
    bin_df = bin_group['var', 'resid'].mean()
    bin_df['count'] = bin_group['resid'].count()
    bin_df['lower_ci'] = -2 * (bin_group['resid'].std() /
                               np.sqrt(bin_group['resid'].count()))
    bin_df['upper_ci'] =  2 * (bin_group['resid'].std() /
                               np.sqrt(bin_df['count']))
    bin_df = bin_df.sort('var')
    return(bin_df)

def plot_binned_residuals(bin_df):
    '''
    Plotted binned residual averages and confidence intervals.
    '''
    plt.plot(bin_df['var'], bin_df['resid'], '.')
    plt.plot(bin_df['var'], bin_df['lower_ci'], '-r')
    plt.plot(bin_df['var'], bin_df['upper_ci'], '-r')
    plt.axhline(0, color = 'gray', lw = .5)

arsenic_resids = bin_residuals(model4.resid, df['arsenic'], 40)
dist_resids = bin_residuals(model4.resid, df['dist'], 40)
plt.figure(figsize = (12, 5))
plt.subplot(121)
plt.ylabel('Residual (bin avg.)')
plt.xlabel('Arsenic (bin avg.)')
plot_binned_residuals(arsenic_resids)
plt.subplot(122)
plot_binned_residuals(dist_resids)
plt.ylabel('Residual (bin avg.)')
plt.xlabel('Distance (bin avg.)')

arsenic_dist_bin_resid

Model 5: log-scaling arsenic

The binned residual plot indicates some nonlinearity in the arsenic variable. Note how the model over-estimated for low arsenic and underestimates for high arsenic. This suggests a log transformation or something similar.

We can again do this transformation right in the formula.

model_form = ('switch ~ center(I(dist / 100.)) + center(np.log(arsenic)) + ' +
              'center(I(educ / 4.)) + ' +
              'center(I(dist / 100.)) : center(np.log(arsenic)) + ' +
              'center(I(dist / 100.)) : center(I(educ / 4.)) + ' +
              'center(np.log(arsenic)) : center(I(educ / 4.))'
             )

model5 = logit(model_form, df = df).fit()
print model5.summary()

 


Optimization terminated successfully.
         Current function value: 1931.554102
         Iterations 5
                           Logit Regression Results
==============================================================================
Dep. Variable:                 switch   No. Observations:                 3020
Model:                          Logit   Df Residuals:                     3013
Method:                           MLE   Df Model:                            6
Date:                Sat, 22 Dec 2012   Pseudo R-squ.:                 0.06192
Time:                        13:05:57   Log-Likelihood:                -1931.6
converged:                       True   LL-Null:                       -2059.0
                                        LLR p-value:                 3.517e-52
==================================================================================================================
                                                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------------------------------------
Intercept                                          0.3452      0.040      8.528      0.000         0.266     0.425
center(I(dist / 100.))                            -0.9796      0.111     -8.809      0.000        -1.197    -0.762
center(np.log(arsenic))                            0.9036      0.070     12.999      0.000         0.767     1.040
center(I(educ / 4.))                               0.1785      0.039      4.577      0.000         0.102     0.255
center(I(dist / 100.)):center(np.log(arsenic))    -0.1567      0.185     -0.846      0.397        -0.520     0.206
center(I(dist / 100.)):center(I(educ / 4.))        0.3384      0.108      3.141      0.002         0.127     0.550
center(np.log(arsenic)):center(I(educ / 4.))       0.0601      0.070      0.855      0.393        -0.078     0.198
==================================================================================================================

And the binned residual plot for arsenic now looks better.

logarsenic_dist_bin_resid

Model error rates

The pred_table() gives us a confusion matrix for the model. We can use this to compute the error rate of the model.

We should compare this to the null error rates, which comes from a model that just classifies everything as whatever the most prevalent response is. Here 58% of the respondents were switchers, so the null model just classifies everyone as a switcher, and therefore has an error rate of 42%.

print model5.pred_table()
print 'Model Error rate: {0: 3.0%}'.format(
    1 - np.diag(model5.pred_table()).sum() / model5.pred_table().sum())
print 'Null Error Rate: {0: 3.0%}'.format(
    1 - df['switch'].mean())

 

[[  568.   715.]
 [  387.  1350.]]
Model Error rate:  36%
Null Error Rate:  42%

Conclusion

So this was a more in-depth example of running a logistic regression with statsmodels and the formula API. Unlike last time, when we were just specifying the variables in the model, here we used the formula language to apply transforms and create interactions. I really love this: it drastically reduces the number of steps between thinking up a model and fitting it.

Posted in Uncategorized | 4 Comments

Will it Python? Machine Learning for Hackers, Chapter 2, Part 2: Logistic regression with statsmodels

Introduction

I last left chapter 2 of Maching Learning for Hackers (a long time ago), running some kernel density estimators on height and weight data (see here. The next part of the chapter plots a scatterplot of weight vs. height and runs a lowess smoother through it. I’m not going to write any more about the lowess function in statsmodels. I’ve discussed some issues with it (i.e. it’s slow) here. And it’s my sense that the lowess API, as it is now in statsmodels, is not long for this world. The code is all in the IPython notebooks in the Github repo and is pretty straightforward.

Patsy and statsmodels formulas

What I want to skip to here is the logistic regressions the authors run to close out the chapter. Back in the spring, I coded up the chapter in this notebook. At this point, there wasn’t really much cohesion between pandas and statsmodels. You’d end up doing data exploration and munging with pandas, then pulling what you needed out of dataframes into numpy arrays, and passing those arrays to statsmodels. (After writing seemingly needless boilerplate code like X = sm.add_constant(X, prepend = True). Who’s out there running all these regressions without constant terms, such that it makes sense to force the use to explicitly add a constant vector to the data matrix?)

Over the summer, though, something quite cool happened. patsy brought a formula interface to Python, and it got integrated into a number components of statsmodels. Skipper Seabold’s Pydata presentation is a good overview and demo. In a nutshell, statsmodels now talks to your pandas dataframes via an expressive “formula” description of your model.

For example, imagine we had a dataframe, df, with variables x1, x2, and y. If we wanted to regress y on x1 and x2 with the standard statmodels API, we’d code something like the following:

Xmat = sm.add_constant(df[['x1', 'x2']].values, prepend = True)
yvec = df['y'].values
ols_model = OLS(yvec, Xmat).fit()

Which is tolerable with short variable names. Once you start using longer names or need more RHS variables it becomes a mess. With patsy and the formula API, you just have:

ols_model = ols('y ~ x1 + x2', df = df).fit()

Which is just as simple as using lm in R. You can also specify variable transformations and interactions in the formula, without needing to pre-compute variable for them. It’s pretty slick.

All of this is still brand new, and largely undocumented, so proceed with caution. But I’ve gotten very excited incorporating it into my code. Stuff I wrote just 5 or 6 months ago looks clunky and outdated.

So I’ve updated the IPython notebook for chapter 2, here, to incorporate the formula API. That’s what I’ll discuss in the rest of the post.

Logistic regression with formulas in statmodels

The authors run a logistic regression to see if they can use a person’s height and weight to determine their gender. I’m not really sure why you’d run such a model (or how meaningful it is once you run it, given how co-linear height and weight are), but it’s easy enough for illustrating how to mechanically run a logistic regression and use it to linearly separate groups.

The dataset contains variables Height, Weight, and Gender. The latter is a string encoded either Male or Female. To run a logistic regression, we’ll want to transform this to a numerical 0/1 variable. We can do this a number of ways, but I’ll use the map method.

heights_weights['Male'] = heights_weights['Gender'].map({'Male': 1, 'Female': 0})

The statstmodels.formula.api module has a number of functions, including ols, logit, and glm. If we import logit from the module we can run a logistic regression easily.

male_logit = logit(formula = 'Male ~ Height + Weight', df = heights_weights).fit()
print male_logit.summary()

Optimization terminated successfully.
         Current function value: 2091.297971
         Iterations 8
                           Logit Regression Results
==============================================================================
Dep. Variable:                   Male   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9997
Method:                           MLE   Df Model:                            2
Date:                Thu, 20 Dec 2012   Pseudo R-squ.:                  0.6983
Time:                        14:41:33   Log-Likelihood:                -2091.3
converged:                       True   LL-Null:                       -6931.5
                                        LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      0.6925      1.328      0.521      0.602        -1.911     3.296
Height        -0.4926      0.029    -17.013      0.000        -0.549    -0.436
Weight         0.1983      0.005     38.663      0.000         0.188     0.208
==============================================================================

Just for fun, we can also run the logistic regression via a GLM with a binomial family and logit link. This is similar to how I’d run it in R.

male_glm_logit = glm('Male ~ Height + Weight', df = heights_weights,
                     family = sm.families.Binomial(sm.families.links.logit)).fit()
print male_glm_logit.summary()

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                   Male   No. Observations:                10000
Model:                            GLM   Df Residuals:                     9997
Model Family:                Binomial   Df Model:                            2
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -2091.3
Date:                Thu, 20 Dec 2012   Deviance:                       4182.6
Time:                        14:41:37   Pearson chi2:                 9.72e+03
No. Iterations:                     8
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      0.6925      1.328      0.521      0.602        -1.911     3.296
Height        -0.4926      0.029    -17.013      0.000        -0.549    -0.436
Weight         0.1983      0.005     38.663      0.000         0.188     0.208
==============================================================================

And we get the same results, which is reassuring.

Now we can use the coefficients to plot a separating line in height-weight space.

logit_pars = male_logit.params
intercept = -logit_pars['Intercept'] / logit_pars['Weight']
slope =  -logit_pars['Height'] / logit_pars['Weight']

Let’s plot the data, color-coded by sex, and the separating line.

fig = plt.figure(figsize = (10, 8))
# Women points (coral)
plt.plot(heights_f, weights_f, '.', label = 'Female',
         mfc = 'None', mec='coral', alpha = .4)
# Men points (blue)
plt.plot(heights_m, weights_m, '.', label = 'Male',
         mfc = 'None', mec='steelblue', alpha = .4)
# The separating line
plt.plot(array([50, 80]), intercept + slope * array([50, 80]),
         '-', color = '#461B7E')
plt.xlabel('Height (in.)')
plt.ylabel('Weight (lbs.)')
plt.legend(loc='upper left')

logit_hw_sex_separate

Conclusion

There are several more examples using Patsy formulas with statsmodels functions in later chapters. If you’re accustomed to R’s formula notation, the transition from running models in R to running models in statsmodels is easy. One of the annoying things in Python versus R is the need to pull arrays out of pandas dataframes, because the functions you want to apply to the data (say estimating models, or plotting) don’t interface with the dataframe, but instead numpy arrays. It’s not terrible, but it adds a layer of friction in the analysis. So it’s great that statsmodels is starting to integrate well with pandas.

Posted in Data Analysis, Will it Python | Tagged , , | 4 Comments

Will it Python? Machine Learning for Hackers, Chapter 3: Naive Bayes Text Classification

Introduction

I realize I haven’t blogged about the rest of chapter 2 yet. I’ll get back to that, but chapter 3 is on my mind today. If you haven’t seen them yet, IPython notebooks up to chapter 9 are all up in the Github repo. To view them online, you can check the links on this page.

Chapter 3 is about text classification. The authors build a classifier that will identify whether an e-mail is spam or not (“ham”) based on the content of the e-mail’s message. I won’t go into much detail on how the Naive Bayes classifier they use works (beyond what’s evident in the code). The theory is described well in the book and many other places. I’m just going to discuss implementation, assuming you know how the classifier works in theory. The Python code for this project relies heavily on the NLTK (Natural Language Toolkit) package, which is a comprehensive library that includes functions for doing NLP and text analysis, as well as an array of benchmark text corpora to use them on. If you want to go deep into this stuff, two good resources are:

Two versions of the program

I’ve coded up two different versions of this chapter. The first, here, tries to follow the book relatively closely. The general procedure they use is:

  1. Parse and tokenize the e-mails
  2. Create a term-document matrix of the e-mails
  3. Calculate features of the training e-mails using the term-document matrix
  4. Train the classifier on these features
  5. Test the classifier on other sets of spam and ham e-mails

I’m not going to discuss this version in much detail, but you should take a look at the notebook if you’re interested. Two big takeaways from this are:

Python lacks a good term-document matrix tool. I was surprised to find that NLTK, which has so much functionality including helper functions like FreqDist, doesn’t have a function for making term-document matrices similar to the tdm function in R’s tm package. There is a Python module called textmining (which you can install with pip) that does have a term-document matrix function, but it’s pretty rudimentary. What you’ll see in this chapter is that I’ve coded up a term-document matrix function that uses the one in textmining but adds some bells and whistles, and returns the TDM as a (typically sparse) pandas dataframe.

The authors’ classifier suffers from numerical errors. The Naive Bayes classifier calcalates the probability that a message is spam by calculating the probability that the message’s terms occur in a spam message. So if the message is just “buy viagra”, and “buy” occurs in 75% of the training spam, and “viagra” occurs in 50% of the training spam, then the classifier assigns this a ‘spam’ probability of .75 * .50 = 37.5%. The problem with this calculation is that there are typically many terms, and the probabilities are often small, so their product can end up smaller than machine precision and underflow to zero. The way around this is to take the sum of the log probabilities (so log(.75) + log(.25)). The authors don’t do this, though, and it’s apparent that they end up with underflow errors. See, for example, the code output on page 89. This is also what leads to them having essentially the same error rates for “hard” ham as they do for “easy” ham in the tables on pages 89 and 92. Once you fix this problem, it turns out the classifier is actually much better for spam and easy ham than it appears in the book, but it’s way worse for hard ham.

I’m going to focus on the second version of the program, though, in the notebook called ch3_nltk.ipynb. You can view it online here.In this version, I use NLTK’s built-in NaiveBayesClassifier function, and avoid creating the TDM (which isn’t really used for much in the original code anyway).

Building a Naive Bayes spam classifier with NLTK

I’ll follow the same logic as the program from chapter 3, but I’ll do so with a workflow more suited to NLTK’s functions. So instead of creating a term-document matrix, and building my own Naive Bayes classifier, Ill build a features → label association for each training e-mail, and feed a list of these to NLTK’s NaiveBayesClassifier function.

Extracting word features from the e-mail messages

The program begins with some simple code that loads the e-mail files from the directories, extracts the “message” or body of the e-mail, and loads all those messages into a list. This follows the book’s code pretty closely, and we end up with training and testing lists of spam, easy ham, and hard ham. The training data will be the e-mails in the training directories for spam and easy ham. (So, like in the book, we’re not training on any hard ham.)

Each e-mail in our classifier’s training data will have a label (“spam” or “ham”) and a feature set. For this application, we’re just going to use a feature set that is just a set of the unique words in the e-mail. Below, I’ll turn this into a dictionary to feed into the NaiveBayesClassifier, but first, let’s get the set.

Note: This is a similar to a “bag-of-words” model, in that it doesn’t care about word order or other semantic information. But a “bag-of-words” usually considers the frequency of the word within the document (like a histogram of the words), whereas we’re only concerned with whether it’s in an e-mail, not how often it occurs.

Parsing and tokenizing the e-mails

I’m going to use NLTK’s wordpunct_tokenize function to break the message into tokens. This splits tokens at white space and (most) punctuation marks, and returns the punctuation along with the tokens on each side. So "I don't know. Do you?" becomes ["I", "don","'", "t", "know", ".", "Do", "you", "?"].

If you look through some of the training e-mails in train_spam_messages and train_ham_messages, you’ll notice a few features that make extracting words tricky.

First, there are a couple of odd text artefacts. The string ’3D’ shows up in strange places in HTML attributes and other places, and we’ll remove these. Furthermore there seem to be some mid-word line wraps flagged with an ‘=’ where the word is broken across lines. For example, the word ‘apple’ might be split across lines like ‘app=\nle’. We want to strip these out so we can recover ‘apple’. We’ll want to deal with all these first, before we apply the tokenizer.

Second, there’s a lot of HTML in the messages. We’ll have to decide first whether we want to keep HTML info in our set of words. If we do, and we apply wordpunct_tokenize to some HTML, for example:


"<HEAD></HEAD><BODY><!-- Comment -->"

would tokenize to:

["<", "HEAD", "></", "HEAD", "><", "BODY", "><!--", "Comment", "-->"]

So if we drop the punctuation tokens, and get the unique set of what remains, we’d have {"HEAD", "BODY", "Comment"}, which seems like what we’d want. For example, it’s nice that this method doesn’t make, <HEAD> and </HEAD> separate words in our set, but just captures the existence of this tag with the term "HEAD". It might be a problem that we won’t distinguish between the HTML tag <HEAD> and “head” used as an English word in the message. But for the moment I’m willing to bet that sort of conflation won’t have a big effect on the classifier.

If we don’t want to count HTML information in our set of words, we can set strip_html to True, and we’ll take all the HTML tags out before tokenizing.

Lastly we’ll strip out any “stopwords” from the set. Stopwords are highly common, therefore low information words, like “a”, “the”, “he”, etc. Below I’ll use stopwords, downloaded from NLTK’s corpus library, with a minor modifications to deal with this. (In other programs I’ve used the stopwords exported from R’s tm package.)

Note that because our tokenizer splits contractions (“she’ll” → “she”, “ll”), we’d like to drop the ends (“ll”). Some of these may be picked up in NLTK’s stopwords list, others we’ll manually add. It’s an imperfect, but easy solution. There are more sophisticated ways of dealing with this which are overkill for our purposes.

Tokenizing, as perhaps you can tell, is a non-trivial operation. NLTK has a host of other tokenizing functions of varying sophistication, and even lets you define your own tokenizing rule using regex.

def get_msg_words(msg, stopwords = [], strip_html = False):
    '''
    Returns the set of unique words contained in an e-mail message. Excludes
    any that are in an optionally-provided list.

    NLTK's 'wordpunct' tokenizer is used, and this will break contractions.
    For example, don't -&gt; (don, ', t). Therefore, it's advisable to supply
    a stopwords list that includes contraction parts, like 'don' and 't'.
    '''

    # Strip out weird '3D' artefacts.
    msg = re.sub('3D', '', msg)

    # Strip out html tags and attributes and html character codes,
    # like '&amp;nbsp;'  and '&amp;lt;'.
    if strip_html:
        msg = re.sub('&lt;(.|\n)*?&gt;', ' ', msg)
        msg = re.sub('&amp;\w+;', ' ', msg)

    # wordpunct_tokenize doesn't split on underscores. We don't
    # want to strip them, since the token first_name may be informative
    # moreso than 'first' and 'name' apart. But there are tokens with long
    # underscore strings (e.g. 'name_________'). We'll just replace the
    # multiple underscores with a single one, since 'name_____' is probably
    # not distinct from 'name___' or 'name_' in identifying spam.
    msg = re.sub('_+', '_', msg)

    # Note, remove '=' symbols before tokenizing, since these
    # sometimes occur within words to indicate, e.g., line-wrapping.
    msg_words = set(wordpunct_tokenize(msg.replace('=\n', '').lower()))

    # Get rid of stopwords
    msg_words = msg_words.difference(stopwords)

    # Get rid of punctuation tokens, numbers, and single letters.
    msg_words = [w for w in msg_words if re.search('[a-zA-Z]', w) and len(w) &gt; 1]

    return msg_words

Making a (features, label) list

The NaiveBayesClassifier function trains on data that’s of the form [(features1, label1), features2, label2), ..., (featuresN, labelN)] where featuresi is a dictionary of features for e-mail i and labeli is the label for e-mail i (spam or ham).

The function features_from_messages iterates through the messages creating this list, but calls an outside function to create the features for each e-mail. This makes the function modular in case we decide to try out some other method of extracting features from the e-mails besides the set of word. It then combines the features to the e-mail’s label in a tuple and adds the tuple to the list.

The word_indicator function calls get_msg_words() to get an e-mail’s words as a set, then creates a dictionary with entries {word: True} for each word in the set. This is a little counter-intuitive (since we don’t have {word: False} entries for words not in the set) but NaiveBayesClassifier knows how to handle it.

def features_from_messages(messages, label, feature_extractor, **kwargs):
    '''
    Make a (features, label) tuple for each message in a list of a certain,
    label of e-mails ('spam', 'ham') and return a list of these tuples.

    Note every e-mail in 'messages' should have the same label.
    '''
    features_labels = []
    for msg in messages:
        features = feature_extractor(msg, **kwargs)
        features_labels.append((features, label))
    return features_labels

def word_indicator(msg, **kwargs):
    '''
    Create a dictionary of entries {word: True} for every unique
    word in a message.

    Note **kwargs are options to the word-set creator,
    get_msg_words().
    '''
    features = defaultdict(list)
    msg_words = get_msg_words(msg, **kwargs)
    for  w in msg_words:
            features[w] = True
    return features

Training and evaluating the classifier

With those functions defined, we can apply them to the training and testing spam and ham messages.

def make_train_test_sets(feature_extractor, **kwargs):
    '''
    Make (feature, label) lists for each of the training
    and testing lists.
    '''
    train_spam = features_from_messages(train_spam_messages, 'spam',
                                        feature_extractor, **kwargs)
    train_ham = features_from_messages(train_easyham_messages, 'ham',
                                       feature_extractor, **kwargs)
    train_set = train_spam + train_ham

    test_spam = features_from_messages(test_spam_messages, 'spam',
                                       feature_extractor, **kwargs)

    test_ham = features_from_messages(test_easyham_messages, 'ham',
                                      feature_extractor, **kwargs)

    test_hardham = features_from_messages(test_hardham_messages, 'ham',
                                          feature_extractor, **kwargs)

    return train_set, test_spam, test_ham, test_hardham

Notice that the training set we’ll use to train the classifier combines both the spam and easy ham training sets (since we need both types of e-mail to train it).

Finally, let’s write a function to train the classifier and check how accurate it is on the test data.

def check_classifier(feature_extractor, **kwargs):
    '''
    Train the classifier on the training spam and ham, then check its accuracy
    on the test data, and show the classifier's most informative features.
    '''

    # Make training and testing sets of (features, label) data
    train_set, test_spam, test_ham, test_hardham = \
        make_train_test_sets(feature_extractor, **kwargs)

    # Train the classifier on the training set
    classifier = NaiveBayesClassifier.train(train_set)

    # How accurate is the classifier on the test sets?
    print ('Test Spam accuracy: {0:.2f}%'
       .format(100 * nltk.classify.accuracy(classifier, test_spam)))
    print ('Test Ham accuracy: {0:.2f}%'
       .format(100 * nltk.classify.accuracy(classifier, test_ham)))
    print ('Test Hard Ham accuracy: {0:.2f}%'
       .format(100 * nltk.classify.accuracy(classifier, test_hardham)))

    # Show the top 20 informative features
    print classifier.show_most_informative_features(20)

The function also prints out the results of NaiveBayesClassifiers‘s handy show_most_informative_features method. This shows which features are most unique to one label or another. For example, if “viagra” shows up in 500 of the spam e-mails, but only 2 of the “ham” e-mails in the training set, then the method will show that “viagra” is one of the most informative features with a spam:ham ratio of 250:1.

So how do we do? I’ll check two versions. The first uses the HTML info in the e-mails in the classifier:

check_classifier(word_indicator, stopwords = sw)

Test Spam accuracy: 98.71%
Test Ham accuracy: 97.07%
Test Hard Ham accuracy: 13.71%
Most Informative Features
                   align = True             spam : ham    =    119.7 : 1.0
                      tr = True             spam : ham    =    115.7 : 1.0
                      td = True             spam : ham    =    111.7 : 1.0
                   arial = True             spam : ham    =    107.7 : 1.0
             cellpadding = True             spam : ham    =     97.0 : 1.0
             cellspacing = True             spam : ham    =     94.3 : 1.0
                     img = True             spam : ham    =     80.3 : 1.0
                 bgcolor = True             spam : ham    =     67.4 : 1.0
                    href = True             spam : ham    =     67.0 : 1.0
                    sans = True             spam : ham    =     62.3 : 1.0
                 colspan = True             spam : ham    =     61.0 : 1.0
                    font = True             spam : ham    =     61.0 : 1.0
                  valign = True             spam : ham    =     60.3 : 1.0
                      br = True             spam : ham    =     59.6 : 1.0
                 verdana = True             spam : ham    =     57.7 : 1.0
                    nbsp = True             spam : ham    =     57.4 : 1.0
                   color = True             spam : ham    =     54.4 : 1.0
                  ff0000 = True             spam : ham    =     53.0 : 1.0
                  ffffff = True             spam : ham    =     50.6 : 1.0
                  border = True             spam : ham    =     49.6 : 1.0

The classifier does a really good job for spam and easy ham, but it’s pretty miserable for hard ham. This may be because hard ham messages tend to be HTML-formatted while easy ham messages aren’t. Note how much the classifier relies on HTML information–nearly all the most informative features are HTML-related.

If we try just using the text of the messages, without the HTML information, we lose a tiny bit of accuracy in identifying spam but do much better with the hard ham.

check_classifier(word_indicator, stopwords = sw, strip_html = True)

Test Spam accuracy: 96.64%
Test Ham accuracy: 98.64%
Test Hard Ham accuracy: 56.05%
Most Informative Features
                    dear = True             spam : ham    =     41.7 : 1.0
                     aug = True              ham : spam   =     38.3 : 1.0
              guaranteed = True             spam : ham    =     35.0 : 1.0
              assistance = True             spam : ham    =     29.7 : 1.0
                  groups = True              ham : spam   =     27.9 : 1.0
                mailings = True             spam : ham    =     25.0 : 1.0
               sincerely = True             spam : ham    =     23.0 : 1.0
                    fill = True             spam : ham    =     23.0 : 1.0
                mortgage = True             spam : ham    =     21.7 : 1.0
                     sir = True             spam : ham    =     21.0 : 1.0
                 sponsor = True              ham : spam   =     20.3 : 1.0
                 article = True              ham : spam   =     20.3 : 1.0
                  assist = True             spam : ham    =     19.0 : 1.0
                  income = True             spam : ham    =     18.6 : 1.0
                     tue = True              ham : spam   =     18.3 : 1.0
                   mails = True             spam : ham    =     18.3 : 1.0
                     iso = True             spam : ham    =     17.7 : 1.0
                   admin = True              ham : spam   =     17.7 : 1.0
                  monday = True              ham : spam   =     17.7 : 1.0
                    earn = True             spam : ham    =     17.0 : 1.0

Check out the most informative features; they make a lot of sense. Note mostly spammers address you with “Dear” and “Sir” and sign off with “Sincerely,”. (Probably those Nigerian princes; they tend to be polite.) Other spam flags that gel with our intuition are “guaranteed”, “mortgage”, “assist”, “assistance”, and “income.”

Conclusion

So we’ve built a simple but decent spam classifier with just a tiny amount of code. NLTK provides a wealth of tools for doing this sort of thing more seriously including ways to extract more sophisticated features and more complex classifiers.

Posted in Data Analysis, Will it Python | Tagged , , | 6 Comments

Better typography for IPython notebooks

(Warning: ignorant rant coming up)

Like everyone else who’s ever used it, I love the IPython notebook. It’s not only an awesomely productive environment to work in, it’s also the most powerful weapon in the Python evangelist’s arsenal (suck it, Matlab).

I also think it’s not hard to imagine a world where scientific papers are all just literate programs. And the notebook is probably one of the best tools for literate programming around in any language. The intregration of markdown and LaTeX/MathJax into the notebook is just fantastic.

But it does have one weakness as a literate programming tool. The default typography is ugly as sin.

There are several issues, but two major ones are easily fixable.

Long lines

By far the biggest issue is that the text and input cells extend to 100% of the window width. Most people keep their browser windows open wider than is comfortable reading width, so you end up with long hard-to-read lines of text in the markdown cells.

And for the code, it would be nice to have the code cell discourage you from long lines. The variable width cells don’t. I’m an 80-character anal retentive, and even I have trouble in the notebook getting a sense of when a line is too long.

When you write a script in a text editor, there’s lots of previous code in the viewable window, so your eye gets a sense of the ‘right-margin’ of the code. (Not to mention many editors will indicate the 80- or whatever-character column, so you know exactly when to break). But in the notebook, your code is typically broken up into smaller blocks, and those blocks are interspersed with output and other cells. It’s hard to get a visual sense of the right margin.

Ugly fonts

Text and markdown cells are typically rendered in Helvetica or Arial. Helvetica is a fine font, obviously, but it’s not really suitable for paragraphs of text (how many books, magazines, newspapers, or academic papers do you see with body text typeset in Helvetica?). And combined with the small size and long lines makes it hard to read and just plain ugly. I don’t think I have to say anything about Arial.

The way I use the notebook–with markdown cells used for long stretches of explanatory text and result interpretation–it’s better to have the text cells render in a serif font. This way it stands out from the code and output cells more. Serif fonts also have more distinctive italics, and integrate better with LaTeX/MathJax math.

Code cells and interpreter output cells render in whatever your default monospace font is. That’s typically Courier or Courier New. This is fine, but really, this is the 21st century–we can do a lot better.

Update: one more thing

I realize I’ve made one other change that I think is important. The default ordered list in the notebook uses roman numerals (I, II, III, …). I almost always want arabic numerals (1, 2, 3, …) instead. We can change this in the file renderedhtml.css with


.rendered_html ol {list-style:decimal; margin: 1em 2em;}

(Also check the comments for other, and typically better ways to make changes.) You can also modify sub-levels ol ol, ol ol ol, etc. Ideally I’d like to have nested numbers 1.1, 1.1.1, but this isn’t straightforward so I haven’t implemented it. If anyone has tips, I’d be thrilled to hear them.

Fixing it (locally, at least)

(Warning: I don’t know what I’m doing. Don’t make any of these changes, or any others, without backing up the files first.)

(Update: Matthias Bussonnier has an informative post showing the right way to make these changes. If you make the CSS changes I describe below, do it the way he advises, not through the files I describe here.)

The notebook is served through the browser, so its frontend is basically just HTML, Javascript, and CSS. The typography and appearance of the notebook is nearly all driven by CSS files located where IPython is stored on your system. This will differ based on your OS and your Python distribution. On my mac, with the AnacondaCE distribution, the stylesheets are located in /Users/cvogel/anaconda/lib/python2.7/site-packages/IPython/frontend/html/notebook/static. There are several subfolders there, including one called /css and /codemirror. You can also take a look at the stylesheet files by firing up a notebook, and using your browser’s inspector. If your browser (e.g. Chrome) lets you edit stylesheets on the fly in the inspector, you can try out changes relatively safely.

Here are the edits I’ve made on my system to address the issues above. First, in the /css folder, in the file called notebook.css

1. Set code input cells to be narrower (code that runs past the width will be invisible). I try to set this for about 80 characters plus some buffer. There’s not way to set width as number of characters in CSS, so you may have to experiment to see what ex-widths works with your font.

div.input {
width: 105ex; /* about 80 chars + buffer */
...
}

2. Fixing markdown/text cells. I make changes to the font, width, and linespacing. I’m using Charis SIL, a font based on the classic Bitstream Charter, and freely available here. Shortening the lines and adding some line space (120% to 150% of point size is usually a good range) for legibility.

div.text_cell {
width: 105ex /* instead of 100%, */
...
}
div.text_cell_render {
/*font-family: "Helvetica Neue", Arial, Helvetica, Geneva, sans-serif;*/
font-family: "Charis SIL", serif; /* Make non-code text serif. */
line-height: 145%; /* added for some line spacing of text. */
width: 105ex; /* instead of 'inherit' for shorter lines */
...
}

3. Add styles to specify sizes for headers.

/* Set the size of the headers */
div.text_cell_render h1 {
font-size: 18pt;
}
div.text_cell_render h2 {
font-size: 14pt;
}

Then, in the /codemirror/lib subfolder, there’s a file called codemirror.css. In here we can change the font used for code, both input and interpreter output. I’m using Consolas.


.CodeMirror {
 font-family: Consolas, monospace;
 }

Obviously these changes only affect notebooks you view on your local machine, and whoever views your notebooks on their own machine, or on nbviewer will see the default style.

Here are before and after shots of these changes:

ipynb_unstyled

ipynb_styled

Fixing it (globally?)

So this is all cute right? And it’s nice that we can do some customizations to the notebook, but, you know, big deal.

I’d argue this is actually more important than just aesthetic tinkering. The IPython notebook is becoming a one-stop-shop for exploration, collaboration, publication, distribution, and replication in data analysis. Like I said above, I think it’s not unreasonable that notebooks could replace a large class of scientific papers. But to do that, it has to perform as well as all the fragmented tools that researchers are currently using. Otherwise, people are going to keep pasting their code and results into Word and Latex documents. In other words, the notebook has to work not just as an interactive environment, but also as a static document. The IPython team realizes this, which is why tools like nbconvert exist.

People are doing amazing things in the notebook. The typography should encourage people to read them, and not just serve as suped-up comments.

Tools are often strongly associated with aesthetic characteristics that are only peripheral to the tool itself. ggplot can make charts that look however you want, but when people think of ggplot, they think of the gray background and the Color Brewer palette. And while main selling point of ggplot is its abstraction of the graph-making process, I think it was the distinctive and attractive style of its graphs that made it catch on so successfully. On the opposite end of the spectrum, when people think of Stata graphics, they think of this, and wince. And Latex will typeset documents with whatever crazy font you want, but in everyone’s mind, Latex <-> Computer Modern (for better or worse). Design defaults are important: they’re marketing and they encourage good habits by your users. It’d be a shame to have it be that people think of the IPython notebook and picture long lines of small, single-spaced Helvetica Neue.

It’s an insanely powerful tool. It’d be awesome if it were beautiful too, and that goal seems eminently do-able.

Posted in Uncategorized | Tagged , | 14 Comments

How do you speed up 40,000 weighted least squares calculations? Skip 36,000 of them.

Despite having finished all the programming for Chapter 2 of MLFH a while ago, there’s been a long hiatus since the first post on that chapter.

(S)lowess

Why the delay? The second part of the code focuses on two procedures: lowess scatterplot smoothing, and logistic regression. When implementing the former in statsmodels, I found that it was running dog slow on the data–in this case a scatterplot of 10,000 height-vs.-weight points. Indeed, for these 10,000 points, lowess, run with the default parameters, required about 23 seconds. After importing modules and defining variables according to my IPython notebook, we can run timeit on the function:


%timeit -n 3 lowess.lowess(heights, weights)

This results in

 3 loops, best of 3: 42.6 s per loop 

on the machine I’m writing this on  (a Windows laptop with a 2.67 GHz i5 processor; timings are faster, but still in the 30 sec. range on my 2.5 GHz i7 Macbook).

An R user–or really a user of any other statistical package–is going to be confused here. We’re all used to lowess being a relatively instantaneous procedure. It’s an oft-used option for graphics packages like Lattice and ggplot2 — and it doesn’t take 20-30 seconds to generate a plot with a lowess curve superimposed. So what’s the deal? Is something wrong with the statsmodels implementation?

The naive lowess algorithm

Short answer: no. Long answer: yeah, kinda. Let’s start by looking at the lowess algorithm in general, sticking to the 2-D y-vs.-x scatterplot case. (I don’t really find multi-dimensional lowess useful anyway; maybe others put it to frequent use. If so, I’d like to hear about it).

Let’s say we have data {x1, … xnand {y1, …, yn}. The idea is to fit a set of values {y*1, …, y*n} where each is the prediction at xfrom a weighted regression using a fixed neighborhood of points around xi. The weighting scheme puts less weight on points that are far from xi. The regression can be linear, or polynomial, but linear is typical, and lowess procedures that use polynomials with more than 2 degrees are rare.

After we get this first set of fits, we usually run the regressions a few more times, each time modifying the weights to take into account residuals from the previous fit. These “robustifying” iterations apply successively less weight to outlying points in the data, reducing their influence on the final curve.

Here’s the recipe:

  1. Select the number of neighbors, k, to use in each local regression, and the number of robustifying iterations.
  2. Sort the data, both and y, by the order of the x-values.
  3. For each xi in {x1, … xn}:
    1. Find the k points nearest to xi (the neighborhood).
    2. Calculate the weights for each xin the neighborhood. This requires:
      1. Calculating the distance between each xj and xi and applying a weighting function to these distances.
      2. Take the weights calculated from the previous fit’s residuals (if this is not the first fit) and multiply them by the distance weights.
    3. Run a regression of the yjs on the xjs in the neighborhood, using the weights calculated in part B above. Predict y*i.
  4. Calculate the residuals from this fitted series of {y*1, …, y*n}, and compute a weight from each of them.
  5. Repeat 3 and 4 for the specified number of robustifying iterations.

Clearly, this is an expensive procedure. For 10,000 points and 3 robustifying iterations (which is the default in R and statsmodels), you’re calculating weights and running regressions 40,000 times (1 initial fit + 3 robustifying iterations).  Running R’s lm.fit (which is the lean, fast engine under lm) 40,000 times costs about 11 seconds. Add on all the costs from weight calculations–which will happen 40,000 * k times, since a weight needs to be calculated for each point’s neightbor–and it’s not surprising that the statsmodels version is as slow as it is. It is an inherently expensive algorithm.

Cheating our way to a faster lowess

The question is, why is R’s lowess so fast? The answer is that R–and most other implementations, going back to Clevelands lowess.f Fortan program–don’t perform lowess calculations on all that data.

If you look at the R help file for lowess, you’ll see that in addition to the parameters we’d expect–the data x and y; a parameter to determine the size of the neighborhood; and the number of robustifying iterations–there’s an argument called delta.

The idea behind delta is the following: xi that are close together aren’t very interesting. If we’ve already calculated y*i from the neighborhood of data around xi, and |xi+1xi| < delta, then we don’t really need to calculate y*i+1. It’s bound to be near y*i.

Instead let’s go out to an xj that’s farther away from x– say the farthest one still within delta distance. Let’s fit another weighted regression here. All those points in between–within that delta distance–can be approximated by a line going between the two regression fits we made.   Then, just keep skipping along in these delta-sized steps–back-filling the predictions by linear interpolation as we go–until the end of the data.

How much work have we saved ourselves? Assume as above 10,000 points and 4 iterations. If the x‘s are uniformly distributed along the axis, and we take delta to be 0.01 * (max(x) - min(x)) (which is the default value in R), then we’re only running 100 regressions per iteration, or 400 overall. Compared to the 40,000 that statsmodels is running, we can see why R is much faster. It’s cheating!

This kind of approximating is fine, really. It’s just assuming that, if our model is y = f(x) + e and f(x) is what we’re trying to estimate with lowess, we can take the linear approximation of it in small neighborhoods.

Implementing a faster lowess in Python

Algorithms for lowess written in low level languages aren’t hard to find. In addition to Cleveland’s Fortran implementation, there’s also a C version used by R (which is basically a direct translation of Cleveland’s, but without all the pesky commenting to let you know what it’s doing).

The statsmodel version though, is nicely organized–broken into sub-functions with  clear names, and exploiting vectorized operations. But it’s slowness is not because it doesn’t exploit the delta trick. It also runs some expensive operations, like a call to SciPy’s lstsq function in each tight loop.

So, in addition to adding the delta trick, we’d like to speed up those calculations in the tight loop (part 3 in the list above) as much as possible. Luckily, Cython lets us split the difference.

My Cython version of lowess is in my github repo, here, in the file cylowess.py. There’s also an IPython notebook demonstrating it in action, and files comprising a testing suite, comparing its output to R’s.

Let’s take a look at some real squiggly data to see how it works. The Silverman motorcycle collision data, which is available as mcycle in R’s MASS package, is great test data for non-parametric curve fitting procedures. In addition to not having any simple parametric shape, it’s got some edge case issues that can cause problems, like repeated x-values.

This plot compares my lowess implementation with statsmodels’ and R’s:

The aggregate difference between R’s lowess and mine?

print 'R and New Lowess MAD: %5.2e' % np.mean(np.abs(r_lowess['y'] - new_lowess[:, 1]))

 

R and New Lowess MAD: 1.62e-13

So it looks like it works.

Now let’s look at some timings. I’ll create some test data: 10,000 points, where x is uniformly distributed on [0, 20], and y = sin(x) + N(0, 0.5).

Statsmodel’s lowess:

%timeit -n 3 smlw.lowess(y, x)
3 loops, best of 3: 22.8 s per loop

The new Cythonized lowess:

%timeit -n 3 cyl.lowess(y, x)
3 loops, best of 3: 10.8 s per loop

This is without the delta trick. Skimming the fat off of those tight-looped operations and Cythonizing them cut the run time in half. 11 seconds still sucks, though, so let’s see what delta gets us.

delta = (x.max() - x.min()) * 0.01
%timeit -n 3 cyl.lowess(y, x, delta = delta)
3 loops, best of 3: 125 ms per loop

Much better. That’s the kind of time skipping 36,000 weighted least-squares calculations will save you. Given that this is some curvy data, is all this linear interpolation acceptable? I’ll re-run both with a better level of the frac parameter; the default is 2/3, but I’ll reduce it to 1/10 to use smaller neighborhoods in the regression and allow for more curvature. Here’s the plot:

sm_lowess = smlw.lowess(y, x, frac = 0.1)
new_lowess = cyl.lowess(y, x, frac = 0.1, delta = delta)

Which looks just as good as the non-interpolated version, but doesn’t leave you twiddling your thumbs.

Conclusion

After all this, we have a version of lowess that’s competitive with R’s lowess function. R also has a much richer loess function, for which there’s no real statmodels equivalent. loess is a full-blown class from which one can make predictions and compute confidence intervals, among other things. It also allows for fitting a higher-dimensional surface, not just a curve. But I have a day job, so that’s all for some other time. This kind of simple lowess is typically enough for most needs.

With this obsessive compulsive diversion into the guts of lowess out of the way, I’ll wrap up Chapter 2 of MLFH in my next post.

Posted in Data Analysis | Tagged , | 1 Comment