Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jhlau/topic_interpretability
Computation of the semantic interpretability of topics produced by topic models.
https://github.com/jhlau/topic_interpretability
Last synced: 2 months ago
JSON representation
Computation of the semantic interpretability of topics produced by topic models.
- Host: GitHub
- URL: https://github.com/jhlau/topic_interpretability
- Owner: jhlau
- Created: 2013-05-28T06:40:05.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2017-04-19T05:52:46.000Z (over 7 years ago)
- Last Synced: 2024-08-03T18:20:58.011Z (5 months ago)
- Language: Roff
- Homepage:
- Size: 10.9 MB
- Stars: 179
- Watchers: 11
- Forks: 43
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-topic-models - topic_interpretability - Computation of the semantic interpretability of topics produced by topic models [:page_facing_up:](https://aclanthology.org/E14-1056.pdf) (Models / Latent Dirichlet Allocation (LDA) [:page_facing_up:](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf))
README
This package contains the scripts and various python tools for computing the semantic
interpretability of topics via: (1) the word intrusion task; (2) PMI/NPMI/LCP-based observed
coherence.Updates
=======
* 2016-10-31: updated ComputeObservedCoherence to compute mean coherence over multiple top-N words; e.g. using option "-t 5 10 15 20" means it will compute coherence for top-5/10/15/20 words and then take the mean over the 4 values. Our latest study found that using multiple top-N words improves performance (see "The Sensitivity of Topic Coherence Evaluation to Topic Cardinality" in [Other Related Papers](#other-related-papers))Directory Structure and Files
=============================
* ComputeObservedCoherence.py: computes the topic observed coherence (pairwise PMI/NPMI/LCP)
* ComputeWordCount.py: samples the word and word pair occurrences based on a reference corpus.
* ComputeWordIntrusion.py: computes the model precision of the word intrusion task.
* data: contains the input files (topics and intruder words).
* GenSVMInput.py: generates the feature file for SVM.
* ref_corpus: contains the reference corpus.
* results: contains the computed results for the topics.
* run-oc.sh: the main script for computing the observed coherence.
* run-wi.sh: the main script for running the word intrusion task.
* SplitSVM: splits the feature file generated by GenSVMInput.py to do 10-fold cross validation.
* svm_rank: contains the svm program and input feature files.
* wordcount: contains the word counts sampled by ComputeWordCount.py.Running the System
==================
Pairwse PMI/NPMI/LCP observed coherence:
* Generate the topic file and put it in data/
* Set up the parameters in run-oc.sh
* Execute run-oc.shWord intrusion:
* Generate the topic file (with intruder words) and the intruder word file and put them in data/
* Set up the parameters in run-wi.sh
* Execute run-wi.shInput Format
============
* Topic file: one line per topic (displaying top-N words).
* Topic file with intruder word: one line per topic including the intruder word.
* Intruder word file: one line per intruder word (each line corresponds to the topic of the same
line number.Examples are given in data/
Reference Corpus
================
Parallel processing for sampling the word counts can be achieved by splitting the reference corpus
into multiple partitions. The format of the reference corpus is one line per document, and the words
should be tokenised (separated by white space). Best results is achieved by lemmatising the
reference corpus (and the document collection where the topic model is run on). An example reference
corpus is given in the package.Output
======
* Debug OFF (in ComputeObservedCoherence.py/ComputeWordIntrusion.py): one score per line, each score corresponds to the topic of the same line
* Debug ON (in ComputeObservedCoherence.py/ComputeWordIntrusion.py): score, topics and intruder words (for the word intrusion task only) are displayedNote
====
The sampling of word counts work for multi-word topics (i.e. topics with phrases/collocations). Use
the underscore symbol ("_") to concatenate the phrases/collocations. E.g. Topic 1: hello_world this_is_a_collocation apple orange banana durianLicensing
=========
* MIT license - http://opensource.org/licenses/MIT.Publications
------------
#### Original Paper
* Jey Han Lau, David Newman and Timothy Baldwin (2014). Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2014), Gothenburg, Sweden, pp. 530—539.#### Other Related Papers
* David Newman, Jey Han Lau, Karl Grieser and Timothy Baldwin (2010). Automatic Evaluation of Topic
Coherence. In Proceedings of Human Language Technologies: The 11th Annual Conference of the North
American Chapter of the Association for Computational Linguistics (NAACL HLT 2010), Los Angeles,
USA, pp. 100—108.
* Jey Han Lau, Timothy Baldwin and David Newman (2013). On Collocations and Topic Models. ACM
Transactions on Speech and Language Processing 10(3), pp. 10:1—10:14.
* Jey Han Lau and Timothy Baldwin (2016). The Sensitivity of Topic Coherence Evaluation to Topic Cardinality. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics — Human Language Technologies (NAACL HLT 2016), San Diego, USA, pp. 483—487.