https://github.com/jhlau/topic-coherence-sensitivity

Code to compute topic coherence for several topic cardinalities and aggregate scores across them
https://github.com/jhlau/topic-coherence-sensitivity

Last synced: 5 months ago
JSON representation

Code to compute topic coherence for several topic cardinalities and aggregate scores across them

Host: GitHub
URL: https://github.com/jhlau/topic-coherence-sensitivity
Owner: jhlau
License: mit
Created: 2016-03-29T01:39:12.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2023-08-07T02:33:40.000Z (over 1 year ago)
Last Synced: 2024-08-03T18:20:57.837Z (8 months ago)
Language: Python
Homepage:
Size: 63.5 KB
Stars: 22
Watchers: 4
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-topic-models - topic-coherence-sensitivity - Code to compute topic coherence for several topic cardinalities and aggregate scores across them [:page_facing_up:](https://aclanthology.org/N16-1057.pdf) (Models / Latent Dirichlet Allocation (LDA) [:page_facing_up:](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf))

README

        This repository contains code and dataset described in the publication "The Sensitivity of Topic Coherence Evaluation

to Topic Cardinality"

Running the System

==================

* The code depends on jhlau/topic_interpretability, so check out the repository: https://github.com/jhlau/topic_interpretability

* Use **run-wordcount.sh** to collect word co-occurrence statistics between topic words

* If doing word intrusion, use **run-wi.sh**; the script will:

 * generate SVM features based on word count features

 * train an SVM rank model to predict intruder words

* If doing NPMI, use **run-npmi.sh**; the script will:

 * compute topic coherence using word count features

* Both scripts will aggregate coherence scores over different cardinalities and print them at the end

* Note: an example toy dataset is given in example_data. To test, execute **run-wordcount.sh** followed by **run-[npmi/wi].sh**

Scripts

=======

* run_wordcount.sh: runs topic_interpretability/ComputeWordCount.py to collect word statistics

* run_wi.sh: computes topic coherence using word intrusion

* run_npmi.sh: computes topic coherence using NPMI

Mechanical Turk Annotations

===========================

The coherence ratings of topics collected via mturk are in 

mturk_annotation/annotations.csv (tab-delimited).

Description of columns:

* domain: domain of topic (wiki or news)

* topic: top-20 words of the topic

* top-N: top-N average rating (e.g. top-5 means only the top 5 of the 20 words are presented when collecting the rating)

Processed Corpus (News and Wiki)

================================

* [2016-naacl-topic-cardinality/news_wiki.tgz](https://unimelbcloud-my.sharepoint.com/:f:/g/personal/jeyhan_lau_unimelb_edu_au/EgzpOQsDqjJIqN8Pd0DksgUBGXr6oW4NX1csPPWBjYFr-Q?e=3dqgGg)

Publication

-----------

* Jey Han Lau and Timothy Baldwin. The Sensitivity of Topic Coherence Evaluation to Topic Cardinality. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2016), San Diego, California, to appear.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jhlau/topic-coherence-sensitivity

Awesome Lists containing this project

README