http://chdoig.github.com/acm-sigkdd-topic-modeling
Iterative algorithm
Hard: Unsupervised learning. No labels.
Human-in-the-loop
Human-in-the-loop
Metrics
Metrics
More Metrics [1]:
Warning: LDA in scikit-learn refers to Linear Discriminant Analysis!
scikit-learn implements alternative algorithms, e.g. NMF (Non Negative Matrix Factorization) [1][2]
>>> import gensim
# load id->word mapping (the dictionary)
>>> id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
# load corpus iterator
>>> mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
# extract 100 LDA topics, using 20 full passes, (batch mode) no online updates
>>> lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=0, passes=20)
>>> import graphlab as gl
>>> docs = graphlab.SArray('http://s3.amazonaws.com/dato-datasets/nytimes')
>>> m = gl.topic_model.create(docs,
num_topics=20, # number of topics
num_iterations=10, # algorithm parameters
alpha=.01, beta=.1) # hyperparameters
>>> import lda
>>> X = lda.datasets.load_reuters()
>>> model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
>>> model.fit(X) # model.fit_transform(X) is also available
Slides:
http://chdoig.github.com/acm-sigkdd-topic-modeling
Email: christine.doig@continuum.io
Twitter: ch_doig