https://github.com/abeusher/malletwrapper

A Python wrapper for MALLET topic modeling
https://github.com/abeusher/malletwrapper
Last synced: 6 months ago
JSON representation
A Python wrapper for MALLET topic modeling
Host: GitHub
URL: https://github.com/abeusher/malletwrapper
Owner: abeusher
Created: 2020-05-02T21:34:32.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2018-02-11T22:34:22.000Z (over 8 years ago)
Last Synced: 2025-06-25T20:07:55.876Z (about 1 year ago)
Size: 27.3 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # MalletWrapper

A Python wrapper for [MALLET](http://mallet.cs.umass.edu/about.php). Topic modeling only, for now.

## Mallet Installation Instructions (Mac)

1. [Download](http://mallet.cs.umass.edu/download.php) the Mallet directory

2. [Download](http://www.oracle.com/technetwork/java/javase/downloads/index.html) JDK

3. [Install](https://brew.sh) Homebrew

4. Install Ant via ```brew install ant```

```

cd /dir/containing/mallet-2.x.x

ant

```

## MalletWrapper Usage

```python

from MalletWrapper import Mallet

model = Mallet('/Users/mikeronayne/mallet-2.0.8/')

model.import_dir(input='/Users/mikeronayne/mallet-2.0.8/sample-data/web/en')

model.train_topics()

print(model.topic_keys) # see output_topic_keys parameter in Train Topics documentation

print(model.doc_topics) # see output_doc_topics parameter in Train Topics documentation

print(model.word_weights) # see topic_word_weights_file parameter in Train Topics documentation

```

```python

# {topic # (int): {dirichlet parameter: float, words: list}, ... }

{0: {'dirichlet': 0.5, 'words': ['rings', 'are', 'were', 'norway', 'ring', 'dust', 'only', 'number', 'may', 'moons', 'narrow', 'uranian', 'particles', 'discovered', 'uranus', 'survived', 'some', 'saga', 'including', 'system']}, ... }

# {document # (int): {document name: e.g. file path (str), topics: {topic # (int): weight (float), ... }}, ... }

{0: {'name': 'file:/Users/mikeronayne/mallet-2.0.8/sample-data/web/en/elizabeth_needham.txt', 'topics': {7: 0.3105263157894737, 6: 0.3, 1: 0.19473684210526315, 8: 0.07894736842105263, 0: 0.03684210526315789, 9: 0.02631578947368421, 2: 0.02631578947368421, 3: 0.015789473684210527, 5: 0.005263157894736842, 4: 0.005263157894736842}}, ... }

# {topic # (int): {word (str): weight (float)}, ... }

{0: {'elizabeth': 0.01, 'needham': 0.01, 'died': 0.01, 'may': 3.01, 'also': 0.01, 'known': 0.01, 'mother': 0.01, 'was': 0.01, 'english': 0.01, 'procuress': 0.01, 'and': 0.01, 'brothel-keeper': 1.01, 'th-century': 0.01, 'london': 0.01, ... }, ... }

```

## MalletWraper Documentation

### Constructor

```python

Mallet(mallet_dir, memory=1)

```

Parameter | Type | Description | Default

--- | --- | --- | ---

mallet_dir | str | File path of Mallet-2.x.x directory

memory | int, float | Maximum gigabytes of memory to allocate to Mallet | 1

### Importing Data

#### Via Directory

```python

import_dir(**kwargs)

```

Parameter | Type | Description | Default

--- | --- | --- | ---

input | str, list | The directories containing text files to be classified, one directory per class | null

preserve_case | bool | If true, do not force all strings to lowercase. | False

replacement_files | str, list | Files containing string replacements, one per line: 'A B [tab] C' replaces A B with C; 'A B' replaces A B with A_B | null

deletion_files | str, list | Files containing strings to delete after replacements but before tokenization (ie multiword stop terms) | null

remove_stopwords | bool | If true, remove a default list of common English "stop words" from the text. | False

stoplist_file | str | Instead of the default list, read stop words from a file, one per line. Implies ```remove_stopwords``` | null

extra_stopwords | str | Read whitespace-separated words from this file, and add them to either the default English stoplist or the list specified by ```stoplist_file```. | null

stop_pattern_file | str | Read regular expressions from a file, one per line. Tokens matching these regexps will be removed. | null

skip_header | bool | If true, in each document, remove text occurring before a blank line. This is useful for removing email or UseNet header | False

skip_html | bool | If true, remove text occurring inside <...>, as in HTML or SGML. | False

gram_sizes | int, str | Include among the features all n-grams of sizes specified. For example, to get all unigrams and bigrams, use ```gram_sizes='1,2'```. This option occurs after the removal of stop words, if removed. | 1

encoding | str | Character encoding for input file | UTF-8

token_regex | str | Regular expression used for tokenization. Example: ```[\p{L}\p{N}_]+\|[\p{P}]+``` (unicode letters, numbers and underscore OR all punctuation) | ```\p{L}[\p{L}\p{P}]+\p{L}```

print_output | bool | If true, print a representation of the processed data to standard output. This option is intended for debugging. | False

#### Via File

```python

import_file(**kwargs)

```

Parameter | Type | Description | Default

--- | --- | --- | ---

input | str | The file containing data to be classified, one instance per line | null

line_regex | str | Regular expression containing regex-groups for label, name and data. | ```^(\S*)[\s,]*(\S*)[\s,]*(.*)$```

name | int | The index of the group containing the instance name. Use 0 to indicate that the name field is not used. | 1

data |int | The index of the group containing the data. | 3

remove_stopwords | bool | If true, remove a default list of common English "stop words" from the text. | False

replacement_files | str, list | Files containing string replacements, one per line: 'A B [tab] C' replaces A B with C; 'A B' replaces A B with A_B | null

deletion_files | str, list | Files containing strings to delete after replacements but before tokenization (ie multiword stop terms) | null

stoplist_file | str | Instead of the default list, read stop words from a file, one per line. Implies ```remove_stopwords``` | null

extra_stopwords | str | Read whitespace-separated words from this file, and add them to either the default English stoplist or the list specified by ```stoplist_file```. | null

stop_pattern_file | str | Read regular expressions from a file, one per line. Tokens matching these regexps will be removed. | null

preserve_case | bool | If true, do not force all strings to lowercase. | False

encoding | str | Character encoding for input file | UTF-8

token_regex | str | Regular expression used for tokenization. Example: ```[\p{L}\p{N}_]+\|[\p{P}]+``` (unicode letters, numbers and underscore OR all punctuation) | ```\p{L}[\p{L}\p{P}]+\p{L}```

print_output | bool | If true, print a representation of the processed data to standard output. This option is intended for debugging. | False

### Train Topics

```python

train_topics(**kwargs)

```

Parameter | Type | Description | Default

--- | --- | --- | ---

input_model | str | The filename from which to read the binary topic model. The ```input``` option is ignored. By default this is null, indicating that no file will be read. | null

input_state | str | The filename from which to read the gzipped Gibbs sampling state created by ```output_state```. The original input file must be included, using ```input```. By default this is null, indicating that no file will be read. | null

output_model | str | The filename in which to write the binary topic model at the end of the iterations. By default this is null, indicating that no file will be written. | null

output_state | str | The filename in which to write the Gibbs sampling state after at the end of the iterations. By default this is null, indicating that no file will be written. | null

output_model_interval | int | The number of iterations between writing the model (and its Gibbs sampling state) to a binary file. You must also set the output_model to use this option, whose argument will be the prefix of the filenames. | 0

output_state_interval | int | The number of iterations between writing the sampling state to a text file. You must also set the ```output_state``` to use this option, whose argument will be the prefix of the filenames. | 0

inferencer_filename | str | A topic inferencer applies a previously trained topic model to new documents. By default this is null, indicating that no file will be written. | null

evaluator_filename | str | A held-out likelihood evaluator for new documents. By default this is null, indicating that no file will be written. | null

output_topic_keys | str | The filename in which to write the top words for each topic and any Dirichlet parameters. By default this is null, indicating that no file will be written. | null

num_top_words | int | The number of most probable words to print for each topic after model estimation. | 20

show_topics_interval | int | The number of iterations between printing a brief summary of the topics so far. | 50

topic_word_weights_file | str | The filename in which to write unnormalized weights for every topic and word type. By default this is null, indicating that no file will be written. | null

word_topic_counts_file | str | The filename in which to write a sparse representation of topic-word assignments. By default this is null, indicating that no file will be written. | null

diagnostics_file | str | The filename in which to write measures of topic quality, in XML format. By default this is null, indicating that no file will be written. | null

xml_topic_report | str | The filename in which to write the top words for each topic and any Dirichlet parameters in XML format. By default this is null, indicating that no file will be written. | null

xml_topic_phrase_report | str | The filename in which to write the top words and phrases for each topic and any Dirichlet parameters in XML format. By default this is null, indicating that no file will be written. | null

output_topic_docs | str | The filename in which to write the most prominent documents for each topic, at the end of the iterations. By default this is null, indicating that no file will be written. | null

num_top_docs | int | When writing topic documents with ```output_topic_docs```, report this number of top documents. | 100

output_doc_topics | str | The filename in which to write the topic proportions per document, at the end of the iterations. By default this is null, indicating that no file will be written. | null

doc_topics_threshold | float | When writing topic proportions per document with ```output_doc_topics```, do not print topics with proportions less than this threshold value. | 0.0

doc_topics_max | int | When writing topic proportions per document with ```output_doc_topics```, do not print more than INTEGER number of topics. A negative value indicates that all topics should be printed. | -1

num_topics | int | The number of topics to fit. | 10

num_threads | int | The number of threads for parallel training. | 1

num_iterations | int | The number of iterations of Gibbs sampling. | 1000

num_icm_iterations | int | The number of iterations of iterated conditional modes (topic maximization). | 0

no_inference | bool | Do not perform inference, just load a saved model and create a report. Equivalent to ```num_iterations``` 0. | False

random_seed | int | The random seed for the Gibbs sampler. Default is 0, which will use the clock. | 0

optimize_interval | int | The number of iterations between reestimating dirichlet hyperparameters. | 0

optimize_burn_in | int | The number of iterations to run before first estimating dirichlet hyperparameters. | 200

use_symmetric_alpha | bool | Only optimize the concentration parameter of the prior over document-topic distributions. This may reduce the number of very small, poorly estimated topics, but may disperse common words over several topics. | False

alpha | float | SumAlpha parameter: sum over topics of smoothing over doc-topic distributions. alpha_k = [this value] / [num topics] | 5.0

beta | float | Beta parameter: smoothing parameter for each topic-word. beta_w = [this value] | 0.01

## Future Improvements

 * Provide interface to move away from file reading (e.g. no extra stopwords file)

 * Better error handling, especially checking for bad inputs

 * Classification
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/abeusher/malletwrapper

Awesome Lists containing this project

README