Free Python distribution: Anaconda
Open source: conda, blaze, dask, bokeh, numba...
Proud sponsor of PyGotham, PyData, SciPy, PyCon, Europython...
We are hiring!
Building the NYT Recommendation Engine: From keywords over collaborative filtering to Collaborative Topic Modeling
Iterative algorithm
Hard: Unsupervised learning. No labels.
Human-in-the-loop
Human-in-the-loop
Metrics
Metrics
More Metrics [1]:
Warning: LDA in scikit-learn refers to Linear Discriminant Analysis!
scikit-learn implements alternative algorithms, e.g. NMF (Non Negative Matrix Factorization) [1][2]
import gensim
# load id->word mapping (the dictionary)
id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
# load corpus iterator
mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
# extract 100 LDA topics, using 20 full passes, (batch mode) no online updates
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=0, passes=20)
import graphlab as gl
docs = graphlab.SArray('http://s3.amazonaws.com/dato-datasets/nytimes')
m = gl.topic_model.create(docs,
num_topics=20, # number of topics
num_iterations=10, # algorithm parameters
alpha=.01, beta=.1) # hyperparameters
import lda
X = lda.datasets.load_reuters()
model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
model.fit(X) # model.fit_transform(X) is also available
https://github.com/ContinuumIO/topik
... automating the pipeline
from topik.run import run_model
run_model('data.json', field='abstract', model='lda_online', r_ldavis=True, output_file=True)
Slides:
http://chdoig.github.com/pygotham-topic-modeling
Email: christine.doig@continuum.io
Twitter: ch_doig