Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/thunlp/auto_cliwc

Code for Chinese LIWC Lexicon Expansion via Hierarchical Classification of Word Embeddings with Sememe Attention (AAAI18)
https://github.com/thunlp/auto_cliwc

sememe

Last synced: about 2 months ago
JSON representation

Code for Chinese LIWC Lexicon Expansion via Hierarchical Classification of Word Embeddings with Sememe Attention (AAAI18)

Host: GitHub
URL: https://github.com/thunlp/auto_cliwc
Owner: thunlp
License: mit
Created: 2017-11-13T06:17:54.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2018-03-06T05:12:44.000Z (almost 7 years ago)
Last Synced: 2024-05-21T04:10:01.938Z (8 months ago)
Topics: sememe
Language: Python
Homepage:
Size: 6.32 MB
Stars: 141
Watchers: 9
Forks: 44
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Auto LIWC
The code for **Chinese LIWC Lexicon Expansion via Hierarchical Classification
of Word Embeddings with Sememe Attention** (AAAI18).

## Datasets
This folder `datasets` contains two datasets.

1. `HowNet.txt` is an Chinese knowledge base with annotated word-sense-sememe information.
2. `sc_liwc.dic` is the Chinese LIWC lexicon. This is revised version of
the original C-LIWC file. Because the original contains part of speech (POS)
categories such as _verb_, _adverb_, and _auxverb_, we believe it is
more accurate to utilize POS tagging programs when conducting text analysis
in a given text. Therefore, we delete POS categories in our experiment. Furthermore,
the hierarchical structure is slightly different from the original English
version of LIWC, so we altered the hierarchical structure based on the English LIWC.
As for the exact meaning of each category, you can refer to [here](https://cliwc.weebly.com/318672103521517312162354529031214503353920363.html)
and [here](https://liwc.wpengine.com/compare-dictionaries/).

Please note that the above datasets files are for academic and educational use **only**.
They are **not** for commercial use. If you have any questions, please contact us first
before downloading the datasets.

Due to the large size of the embedding file, we can only release the code for training
the word embeddings. Please see `word2vec.py` for details.

## Run
Run the following command for training and testing:

`python3 train_liwc.py`

If the datasets are in a different folder, please change the path
[here](https://github.com/thunlp/Auto_CLIWC/blob/master/train_liwc.py#L30).

The current code generates different training and testing set every time.
To reproduce the results in the paper, you can load `train.bin` and `test.bin`
located in `bin_data` using `pickle`.

## Dependencies

- Tensorflow == 1.4.0
- Scipy == 0.19.0
- Numpy == 1.13.1
- Scikit-learn == 0.18.1
- Gensim == 2.0.0

## Cite
If you use the code, please cite this paper:

_Xiangkai Zeng, Cheng Yang, Cunchao Tu, Zhiyuan Liu, Maosong Sun.
Chinese LIWC Lexicon Expansion via Hierarchical Classification of Word
Embeddings with Sememe Attention. The 32nd AAAI Conference on Artificial
Intelligence (AAAI 2018)._