Topics Modeling

Misc

  1. Word cloud for topic modellng

  2. (TopSBM) topic block modeling, Topsbm

NMF (Non Negative Matrix Factorization )

  1. Non-negative Matrix factorization (NMF)

  2. Medium Article about LDA and NMF (Non-negative Matrix factorization)+ code

LSA (TFIDF + SVD)

  1. A very good article about LSA (TFIDV X SVD), pLSA, LDA, and LDA2VEC. Including code and explanation about Dirichlet probability. Lda2vec code

LDA (Latent Dirichlet Allocation)

  1. (LDA) Latent Dirichlet Allocation

  2. LDA is already taken by the above algorithm!

  3. Latent Dirichlet allocation (LDA) - This algorithm takes a group of documents (anything that is made of up text), and returns a number of topics (which are made up of a number of words) most relevant to these documents.

  4. Medium Article about LDA and NMF (Non-negative Matrix factorization)+ code

  5. In case LDA groups together two topics, we can influence the algorithm in a way that makes those two topics separable - this is called Semi Supervised Guided LDA

  6. LDA tutorials plus code, used this to build my own classes - using gensim mallet wrapper, doesn't work on pyLDAviz, so use this to fix it

  7. Tutorial on lda/nmf on medium - using tfidf matrix as input!

  8. One of the best explanation about Tf-idf vs bow for LDA/NMF - tf for lda, tfidf for nmf, but tfidf can be used for top k selection in lda + visualization, important paper

  9. LDA is a probabilistic generative model that generates documents by sampling a topic for each word and then a word from the sampled topic. The generated document is represented as a bag of words.

    NMF is in its general definition the search for 2 matrices W and H such that W*H=V where V is an observed matrix. The only requirement for those matrices is that all their elements must be non negative.

    From the above definitions it is clear that in LDA only bag of words frequency counts can be used since a vector of reals makes no sense. Did we create a word 1.2 times? On the other hand we can use any non negative representation for NMF and in the example tf-idf is used.

    As far as choosing the number of iterations, for the NMF in scikit learn I don't know the stopping criterion although I believe it is the relative improvement of the loss function being smaller than a threshold so you 'll have to experiment. For LDA I suggest checking manually the improvement of the log likelihood in a held out validation set and stopping when it falls under a threshold. The rest of the parameters depend heavily on the data so I suggest, as suggested by @rpd, that you do a parameter search. So to sum up, LDA can only generate frequencies and NMF can generate any non negative matrix.

  10. How to measure the variance for LDA and NMF, against PCA. 1. Variance score the transformation and inverse transformation of data, test for 1,2,3,4 PCs/LDs/NMs.

  11. What is LDA?

    1. It is unsupervised natively; it uses joint probability method to find topics(user has to pass # of topics to LDA api). If “Doc X word” is size of input data to LDA, it transforms it to 2 matrices:

    2. Doc X topic

    3. Word X topic

    4. further if you want, you can feed “Doc X topic” matrix to supervised algorithm if labels were given.

  12. Medium on LDA, explains the random probabilistic nature of LDA

  13. Machinelearningplus on LDA in sklearn - a great read, dont forget to read the mallet article.

  14. Medium on LSA pLSA, LDA LDA2vec, high level theoretical - not clear

  15. Medium on LDA, some historical reference and general high level how to use exapmles.

  16. Incredibly useful response on LDA grid search params and about LDA expectations. Must read.

  17. Lda vs pLSA, talks about the sampling from a distribution of distributions in LDA

  18. BLog post on topic modelling - has some text about overfitting - undiscussed in many places.

  19. Perplexity vs coherence on held out unseen data, not okay and okay, respectively. Due to how we measure the metrics, ie., read the formulas. Also this and this

  20. Jupyter notebook on hacknews LDA topic modelling - missing code?

  21. Jupyter notebook for kmeans, lda, svd,nmf comparison - advice is to keep nmf or other as a baseline to measure against LDA.

  22. Medium on guidedLDA - switching from LDA to a variation of it that is guided by the researcher / data

  23. The best topic modelling explanation including Usages, insights, a great read, with code - shows how to find similar docs by topic in gensim, and shows how to transform unseen documents and do similarity using sklearn:

    1. Text classification – Topic modeling can improve classification by grouping similar words together in topics rather than using each word as a feature

    2. Recommender Systems – Using a similarity measure we can build recommender systems. If our system would recommend articles for readers, it will recommend articles with a topic structure similar to the articles the user has already read.

    3. Uncovering Themes in Texts – Useful for detecting trends in online publications for example

    4. A Form of Tagging - If document classification is assigning a single category to a text, topic modeling is assigning multiple tags to a text. A human expert can label the resulting topics with human-readable labels and use different heuristics to convert the weighted topics to a set of tags.

    5. Topic Modelling for Feature Selection - Sometimes LDA can also be used as feature selection technique. Take an example of text classification problem where the training data contain category wise documents. If LDA is running on sets of category wise documents. Followed by removing common topic terms across the results of different categories will give the best features for a category.

  24. Another great article about LDA, including algorithm, parameters!! And Parameters of LDA

    1. Alpha and Beta Hyperparameters – alpha represents document-topic density and Beta represents topic-word density. Higher the value of alpha, documents are composed of more topics and lower the value of alpha, documents contain fewer topics. On the other hand, higher the beta, topics are composed of a large number of words in the corpus, and with the lower value of beta, they are composed of few words.

    2. Number of Topics – Number of topics to be extracted from the corpus. Researchers have developed approaches to obtain an optimal number of topics by using Kullback Leibler Divergence Score. I will not discuss this in detail, as it is too mathematical. For understanding, one can refer to this[1] original paper on the use of KL divergence.

    3. Number of Topic Terms – Number of terms composed in a single topic. It is generally decided according to the requirement. If the problem statement talks about extracting themes or concepts, it is recommended to choose a higher number, if problem statement talks about extracting features or terms, a low number is recommended.

    4. Number of Iterations / passes – Maximum number of iterations allowed to LDA algorithm for convergence.

  25. Ways to improve LDA:

    1. Reduce dimentionality of document-term matrix

    2. Frequency filter

    3. POS filter

    4. Batch wise LDA

  26. History of LDA - by the frech guy

  27. Topic modelling with lda and nmf on medium - has a very good simple example with probabilities

  28. The best code article on LDA/MALLET, and using sklearn (using clustering for getting group of sentences in each topic)

  29. Exploring Topic Coherence over many models and many topics lda nmf svd, using umass and uci coherence measures

  30. Paper: Topic modelling and event identification from twitter data, says LDA vs NMI (NMF?) and using coherence to analyze

  31. What is Wrong with Topic Modeling? (and How to Fix it Using Search-based SE) LDADE's tunings dramatically reduces topic instability.

  32. (didnt read) NTM - neural topic modeling using embedded spaces with github code

Mallet LDA

  1. Diff between lda and mallet - The inference algorithms in Mallet and Gensim are indeed different. Mallet uses Gibbs Sampling which is more precise than Gensim's faster and online Variational Bayes. There is a way to get relatively performance by increasing number of passes.

  2. Alpha beta in mallet: contribution

    1. The default for alpha is 5.0 divided by the number of topics. You can think of this as five "pseudo-words" of weight on the uniform distribution over topics. If the document is short, we expect to stay closer to the uniform prior. If the document is long, we would feel more confident moving away from the prior.

    2. With hyperparameter optimization, the alpha value for each topic can be different. They usually become smaller than the default setting.

    3. The default value for beta is 0.01. This means that each topic has a weight on the uniform prior equal to the size of the vocabulary divided by 100. This seems to be a good value. With optimization turned on, the value rarely changes by more than a factor of two.

Visualization

  1. How to interpret topics using pyldaviz: Let’s interpret the topic visualization. Notice how topics are shown on the left while words are on the right. Here are the main things you should consider:

    1. Larger topics are more frequent in the corpus.

    2. Topics closer together are more similar, topics further apart are less similar.

    3. When you select a topic, you can see the most representative words for the selected topic. This measure can be a combination of how frequent or how discriminant the word is. You can adjust the weight of each property using the slider.

    4. Hovering over a word will adjust the topic sizes according to how representative the word is for the topic.

    5. pyLDAviz - what am i looking at ? by spacy. There are a lot of moving parts in the visualization. Here's a brief summary:

      1. On the left, there is a plot of the "distance" between all of the topics (labeled as the Intertopic Distance Map)

      2. The plot is rendered in two dimensions according a multidimensional scaling (MDS) algorithm. Topics that are generally similar should be appear close together on the plot, while dissimilar topics should appear far apart.

      3. The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus.

      4. An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.

      5. On the right, there is a bar chart showing top terms.

      6. When no topic is selected in the plot on the left, the bar chart shows the top-30 most "salient" terms in the corpus. A term's saliency is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics.

      7. When a particular topic is selected, the bar chart changes to show the top-30 most "relevant" terms for the selected topic. The relevance metric is controlled by the parameter λλ, which can be adjusted with a slider above the bar chart.

        1. Setting the λλ parameter close to 1.0 (the default) will rank the terms solely according to their probability within the topic.

        2. Setting λλ close to 0.0 will rank the terms solely according to their "distinctiveness" or "exclusivity" within the topic — i.e., terms that occur only in this topic, and do not occur in other topics.

        3. Setting λλ to values between 0.0 and 1.0 will result in an intermediate ranking, weighting term probability and exclusivity accordingly.

        4. Rolling the mouse over a term in the bar chart on the right will cause the topic circles to resize in the plot on the left, to show the strength of the relationship between the topics and the selected term.

    6. A more detailed explanation of the pyLDAvis visualization can be found here. Unfortunately, though the data used by gensim and pyLDAvis are the same, they don't use the same ID numbers for topics. If you need to match up topics in gensim's LdaMulticore object and pyLDAvis' visualization, you have to dig through the terms manually.

COHERENCE (Topic)

  1. Paper: umass, uci, nmpi, cv, cp etv Exploring the Space of Topic Coherence Measures

  2. Paper: Evaluating topic coherence - Abstract: Topic models extract representative word sets—called topics—from word counts in documents without requiring any semantic annotations. Topics are not guaranteed to be well interpretable, therefore, coherence measures have been proposed to distinguish between good and bad topics. Studies of topic coherence so far are limited to measures that score pairs of individual words. For the first time, we include coherence measures from scientific philosophy that score pairs of more complex word subsets and apply them to topic scoring.

Conclusion: The results of the first experiment show that if we are using the one-any, any-any and one-all coherences directly for optimization they are leading to meaningful word sets. The second experiment shows that these coherence measures are able to outperform the UCI coherence as well as the UMass coherence on these generated word sets. For evaluating LDA topics any-any and one-any coherences perform slightly better than the UCI coherence. The correlation of the UMass coherence and the human ratings is not as high as for the other coherences.

  1. Code: Evaluating topic coherence, using gensim umass or cv parameter - To conclude, there are many other approaches to evaluate Topic models such as Perplexity, but its poor indicator of the quality of the topics.Topic Visualization is also a good way to assess topic models. Topic Coherence measure is a good way to compare difference topic models based on their human-interpretability.The u_mass and c_v topic coherences capture the optimal number of topics by giving the interpretability of these topics a number called coherence score.

  2. Formulas: UCI vs UMASS

  3. Advice for coherence, then Good vs bad model (50 vs 1 iterations) measuring u_mass coherence - 2nd code - “In your data we can see that there is a peak between 0-100 and a peak between 400-500. What I would think in this case is that "does ~480 topics make sense for the kind of data I have?" If not, you can just do an np.argmax for 0-100 topics and trade-off coherence score for simpler understanding. Otherwise just do an np.argmax on the full set.”

  4. Topic stability Metric, a novel method, compared against jaccard, spearman, silhouette.: Measuring LDA Topic Stability from Clusters of Replicated Runs

LDA2VEC

  1. “if you want to rework your own topic models that, say, jointly correlate an article’s topics with votes or predict topics over users then you might be interested in lda2vec.”

  2. Moody’s Slide Share (excellent read)

  3. Example on gh on jupyter

TOP2VEC

  1. Topic modeling with distillibert on medium, bertTopic!, c-tfidf, umap, hdbscan, merging similar topics, visualization, berTopic (same method as the above)

Last updated