Will it Python? Machine Learning for Hackers, Chapter 4: Priority e-mail ranking

UPDATE 1/15/2014: This blog is no longer in service.

This post is now located at: http://slendermeans.org/ml4h-ch4.html


This entry was posted in Data Analysis, Will it Python and tagged , , . Bookmark the permalink.

3 Responses to Will it Python? Machine Learning for Hackers, Chapter 4: Priority e-mail ranking

  1. Miki Tebeka says:

    Once again, a great reading. See few comments below.

    One of the main advantages of using Python is the “batteries included”. In the case the email package which will do the heavy lifting of parsing email messages and also parsing dates and emails (see email.util.parsedate_tz and email.util.parseaddr).

    In `make_email_df` you can use DataFrame.from_records instead of building the columns:
    email_df = DataFrame.from_records([parse_email(f) for f in file_list], columns=…)

    scikit-learn has a utility for splitting data for test and train, see cross_validation.train_test_split.

    Threading of email can be done with the ‘In-Reply-To’ Mime header.

    • Carl says:

      Great tips, thanks! If you’re ever inclined to fork the github repo, I’d love to see improved implementations and I’ll link to them here.

  2. Josh Hemann says:

    A little late, but… This is a fantastic series of posts; I have lost count of how many friends I have passed them on to.

    One quick note about your TF-IDF comment, “…I lamented that I couldn’t find a decent term-document matrix function for Python”

    Have you seen gensim[1]? It has a lot of batteries included for topic modeling, including a TF-IDF class[2]. I have used this for a couple of projects and it definitely made life easier.

    [1] http://radimrehurek.com/gensim/
    [2] http://radimrehurek.com/gensim/models/tfidfmodel.html

Comments are closed.