Machine Learning with Python:
Topic Modeling

Austin ACM SIGKDD

by Christine Doig

This talk

Introduction

Topic Modeling

LDA

Python libraries

Pipelines

Other algorithms

Additional resources

http://chdoig.github.com/acm-sigkdd-topic-modeling

Introduction

Machine Learning with Python course

April 22: GETTING STARTED WITH PYTHON ML, by Chance Coble

April 29: HOW TO CLASSIFY WITH REAL-WORLD EXAMPLES, by Roger Huang

May 06: CLUSTERING – FINDING RELATED POSTS, by Misty Nodine

May 13: TOPIC MODELING, by Christine Doig

May 20: CLASSIFICATION, by Mark Landry

May 27: CLASSIFICATION II – SENTIMENT ANALYSIS, by Jennifer Davis

June 03: REGRESSION – RECOMMENDATIONS, by Jessica Williamson

June 10: REGRESSION – RECOMMENDATIONS IMPROVED, by Omar Odibat

June 17: COMPUTER VISION – PATTERN RECOGNITION, by Ed Solis

June 24: DIMENSIONALITY REDUCTION, by Jessica Williamson

July 01: DIMENSIONALITY REDUCTION II, by Robert Chong

July 08: BIG(GER) DATA, by Chance Coble

Topic Modeling

Definitions

A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents [1]

Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. These algorithms help us develop new ways to search, browse and summarize large archives of texts [2]

Topic models provide a simple way to analyze large volumes of unlabeled text. A "topic" consists of a cluster of words that frequently occur together[3]

http://en.wikipedia.org/wiki/Topic_model
http://www.cs.princeton.edu/~blei/topicmodeling.html
http://mallet.cs.umass.edu/topics.php

Characteristics

Diagram

LDA

LDA vs LDA

LDA Plate notation

http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

Parameters and variables

Understanding LDA

LDA algorithm

Iterative algorithm

Initialize parameters
Initialize topic assignments randomly
Iterate

Resample topic for word, given all other words and their current topic assignments

Get results
Evaluate model

Initialize parameters

Initialize topic assignments randomly

Iterate

Resample topic for word, given all other words and their current topic assignments

Which topics occur in this document?

Which topics like the word X?

When to stop?

Get results

Evaluate model

Hard: Unsupervised learning. No labels.

Human-in-the-loop

Word intrusion [1]: For each trained topic, take first ten words, substitute one of them with another, randomly chosen word (intruder!) and see whether a human can reliably tell which one it was. If so, the trained topic is topically coherent (good); if not, the topic has no discernible theme (bad) [2]

Topic intrusion: Subjects are shown the title and a snippet from a document. Along with the document they are presented with four topics. Three of those topics are the highest probability topics assigned to that document. The remaining intruder topic is chosen randomly from the other low-probability topics in the model [1]

[1] - http://www.umiacs.umd.edu/~jbg/docs/nips2009-rtl.pdf
[2] - http://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html

Evaluate model

Human-in-the-loop

Evaluate model

Metrics

Cosine similarity: split each document into two parts, and check that topics of the first half are similar to topics of the second halves of different documents are mostly dissimilar

[1] - http://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html

Evaluate model

Metrics

Evaluate model

More Metrics [1]:

Size (# of tokens assigned)

Within-doc rank

Similarity to corpus-wide distribution

Locally-frequent words

Co-doc Coherence

[1] - http://mimno.infosci.cornell.edu/slides/details.pdf

Python libraries

Gensim: https://radimrehurek.com/gensim/
Graphlab: https://dato.com/products/create/docs/generated/graphlab.topic_model.create.html
lda: http://pythonhosted.org//lda/

Warning: LDA in scikit-learn refers to Linear Discriminant Analysis!

scikit-learn implements alternative algorithms, e.g. NMF (Non Negative Matrix Factorization) [1][2]

[1] - https://de.dariah.eu/tatom/topic_model_python.html

[2] - http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf.html

Gensim


>>> import gensim
# load id->word mapping (the dictionary)
>>> id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
# load corpus iterator
>>> mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
# extract 100 LDA topics, using 20 full passes, (batch mode) no online updates
>>> lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=0, passes=20)

http://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation

Graphlab


>>> import graphlab as gl
>>> docs = graphlab.SArray('http://s3.amazonaws.com/dato-datasets/nytimes')
>>> m = gl.topic_model.create(docs,
                              num_topics=20,       # number of topics
                              num_iterations=10,   # algorithm parameters
                              alpha=.01, beta=.1)  # hyperparameters

https://dato.com/products/create/docs/generated/graphlab.topic_model.create.html

lda


>>> import lda
>>> X = lda.datasets.load_reuters()
>>> model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
>>> model.fit(X)  # model.fit_transform(X) is also available

http://pythonhosted.org//lda/