https://github.com/dhchenx/tm-eval

A toolkit to quickly evaluate model goodness over number of topics
https://github.com/dhchenx/tm-eval

topic-modeling-analysis

Last synced: about 1 year ago
JSON representation

A toolkit to quickly evaluate model goodness over number of topics

Host: GitHub
URL: https://github.com/dhchenx/tm-eval
Owner: dhchenx
Created: 2022-06-12T19:07:44.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2022-06-12T19:09:32.000Z (about 4 years ago)
Last Synced: 2025-03-03T13:39:49.170Z (over 1 year ago)
Topics: topic-modeling-analysis
Language: Python
Homepage:
Size: 74.2 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          ## Topic Modeling Evaluation

A toolkit to quickly evaluate model goodness over number of topics

### Metrics

Coherence measure to be used. 

- Fastest method - 'u_mass', 'c_uci' also known as `c_pmi`. 

- For 'u_mass' corpus should be provided, if texts is provided, it will be converted to corpus using the dictionary. 

- For 'c_v', 'c_uci' and 'c_npmi' `texts` should be provided (`corpus` isn't needed)

### Examples

Example 1: estimate metrics for one topic model with specific number of topics

```python

from tm_eval import *

# load a dictionary with document key and its term list split by ','.

input_file = "datasets/covid19_symptoms.pickle"

output_folder = "outputs"

model_name = "symptom"

num_topics = 10

# run

results = evaluate_all_metrics_from_lda_model(input_file=input_file, 

                                              output_folder=output_folder,

                                              model_name=model_name, 

                                              num_topics=num_topics)

print(results)

```

Example 2: find model goodness change over number of topics

```python

from tm_eval import *

if __name__=="__main__":

    # start configure

    # load a dictionary (key,value) with document id as key and its term list combined by ',' as value.

    input_file = "datasets/covid19_symptoms.pickle"

    output_folder = "outputs"

    model_name = "symptom"

    start=2

    end=5

    # end configure

    # run and explore

    list_results = explore_topic_model_metrics(input_file=input_file, 

                                               output_folder=output_folder,

                                               model_name=model_name,

                                               start=start,

                                               end=end)

    # summarize results

    show_topic_model_metric_change(list_results,save=True,

                                   save_path=f"{output_folder}/metrics.csv")

    # plot metric changes

    plot_tm_metric_change(csv_path=f"{output_folder}/metrics.csv",

                          save=True,save_folder=output_folder)

```

### Output results

![c_v](https://dhchenx.github.io/projects/tm-eval/c_v.jpg)

![u_mass](https://dhchenx.github.io/projects/tm-eval/u_mass.jpg)

![c_npmi](https://dhchenx.github.io/projects/tm-eval/c_npmi.jpg)

![c_uci](https://dhchenx.github.io/projects/tm-eval/c_uci.jpg)

### License

The `tm-eval` toolkit is provided by [Donghua Chen](https://github.com/dhchenx) with MIT License.

### References

1. [Topic Modeling in Python: Latent Dirichlet Allocation (LDA)](https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0)

2. [Evaluate Topic Models: Latent Dirichlet Allocation (LDA)](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dhchenx/tm-eval

Awesome Lists containing this project

README