Introduction to Topic Modeling in Python

PyTexas 2015

by Christine Doig

Introduction

About me

Data Scientist at Continuum Analytics

Barcelona & Austin

http://chdoig.github.com

@ch_doig

About Continuum Analytics

Free Python distribution: Anaconda

Open source: conda, blaze, dask, bokeh, numba...

Proud sponsor of PyTexas, PyData, SciPy, PyCon, Europython...

We are hiring!

http://continuum.io

About this talk

  • Introduction
  • Topic Modeling
  • LDA Algorithm
  • Python libraries
  • Pipelines
  • Other algorithms
  • Additional resources
  • http://chdoig.github.com/pytexas2015-topic-modeling

    Topic Modeling

    Topic Modeling Applications

    Building the NYT Recommendation Engine: From keywords over collaborative filtering to Collaborative Topic Modeling

    Definitions

  • A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents [1]
  • Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. These algorithms help us develop new ways to search, browse and summarize large archives of texts [2]
  • Topic models provide a simple way to analyze large volumes of unlabeled text. A "topic" consists of a cluster of words that frequently occur together[3]

  • http://en.wikipedia.org/wiki/Topic_model
    http://www.cs.princeton.edu/~blei/topicmodeling.html
    http://mallet.cs.umass.edu/topics.php

    Characteristics

    Diagram

    LDA

    LDA vs LDA

    LDA Plate notation


  • http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
  • Parameters and variables

    Understanding LDA

    LDA algorithm

    Iterative algorithm

    1. Initialize parameters
    2. Initialize topic assignments randomly
    3. Iterate
      • For each word in each document:
      • Resample topic for word, given all other words and their current topic assignments
    4. Get results
    5. Evaluate model

    Initialize parameters

    Initialize topic assignments randomly

    Iterate

    Resample topic for word, given all other words and their current topic assignments

    Resample topic for word, given all other words and their current topic assignments

  • Which topics occur in this document?
  • Which topics like the word X?
  • Get results

    Evaluate model

    Hard: Unsupervised learning. No labels.

    Human-in-the-loop

  • Word intrusion [1]: For each trained topic, take first ten words, substitute one of them with another, randomly chosen word (intruder!) and see whether a human can reliably tell which one it was. If so, the trained topic is topically coherent (good); if not, the topic has no discernible theme (bad) [2]

  • Topic intrusion: Subjects are shown the title and a snippet from a document. Along with the document they are presented with four topics. Three of those topics are the highest probability topics assigned to that document. The remaining intruder topic is chosen randomly from the other low-probability topics in the model [1]

  • [1] - http://www.umiacs.umd.edu/~jbg/docs/nips2009-rtl.pdf
    [2] - http://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html

    Evaluate model

    Human-in-the-loop

    Evaluate model

    Metrics

  • Cosine similarity: split each document into two parts, and check that topics of the first half are similar to topics of the second halves of different documents are mostly dissimilar

  • [1] - http://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html

    Evaluate model

    Metrics

    Evaluate model

    More Metrics [1]:

  • Size (# of tokens assigned)
  • Within-doc rank
  • Similarity to corpus-wide distribution
  • Locally-frequent words
  • Co-doc Coherence
  • [1] - http://mimno.infosci.cornell.edu/slides/details.pdf

    Python libraries

    Python libraries

    Warning: Current LDA in scikit-learn refers to Linear Discriminant Analysis!


    [1] - https://de.dariah.eu/tatom/topic_model_python.html

    [2] - http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf.html

    Gensim

    
    import gensim
    # load id->word mapping (the dictionary)
    id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
    # load corpus iterator
    mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
    # extract 100 LDA topics, using 20 full passes, (batch mode) no online updates
    lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=0, passes=20)
    						

  • http://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation
  • Graphlab

    
    import graphlab as gl
    docs = graphlab.SArray('http://s3.amazonaws.com/dato-datasets/nytimes')
    m = gl.topic_model.create(docs,
                              num_topics=20,       # number of topics
                              num_iterations=10,   # algorithm parameters
                              alpha=.01, beta=.1)  # hyperparameters
                            

  • https://dato.com/products/create/docs/generated/graphlab.topic_model.create.html
  • lda

    
    import lda
    X = lda.datasets.load_reuters()
    model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
    model.fit(X)  # model.fit_transform(X) is also available
    				

  • http://pythonhosted.org//lda/
  • sklearn.decomposition.LatentDirichletAllocation

    
    from sklearn.decomposition import NMF, LatentDirichletAllocation
    X = ...
    lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5, random_state=0)
    lda.fit(X)
    						

  • scikit-learn LDA example
  • Pipeline

    Pipeline

    Preprocessing

    Vector Space

    Model

    Gensim Models

    Scikit-learn example

    
    from sklearn.decomposition import LatentDirichletAllocation
    from sklearn.datasets import fetch_20newsgroups
    
    # Initialize variables
    n_samples = 2000
    n_features = 1000
    n_topics = 10
    
    dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                                 remove=('headers', 'footers', 'quotes'))
    data_samples = dataset.data[:n_samples]
    
    # use tf feature for LDA model
    tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features,
                                    stop_words='english')
    tf = tf_vectorizer.fit_transform(data_samples)
    
    lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                    learning_method='online', learning_offset=50.,
                                    random_state=0)
    lda.fit(tf)
    						

  • http://scikit-learn.org/dev/auto_examples/applications/topics_extraction_with_nmf_lda.html
  • Evaluation - Visualization

    LDAVis

    https://github.com/cpsievert/LDAvis, https://github.com/bmabey/pyLDAvis

    Resources

  • http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf

  • http://miriamposner.com/blog/very-basic-strategies-for-interpreting-results-from-the-topic-modeling-tool/

  • http://journalofdigitalhumanities.org/2-1/topic-modeling-a-basic-introduction-by-megan-r-brett/

  • https://beta.oreilly.com/ideas/topic-models-past-present-and-future
  • Resources

    IPython notebooks explaining Dirichlet Processes, HDPs, and Latent Dirichlet Allocation, Timothy Hopper
  • https://github.com/tdhopper/notes-on-dirichlet-processes

  • Visualizing Topic Models, Data Science Summit & Dato Conference 2015
  • Video, Ben Mabey
  • Questions?

    Slides:
    http://chdoig.github.com/pytexas2015-topic-modeling

    Email: christine.doig@continuum.io

    Twitter: ch_doig