Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
https://github.com/OctoberChang/X-Transformer

X-Transformer: Taming Pretrained Transformers for eXtreme Multi-label Text Classification
https://github.com/OctoberChang/X-Transformer
extreme-multi-label-classification pytorch text-classification transformers
Last synced: 6 days ago
JSON representation
X-Transformer: Taming Pretrained Transformers for eXtreme Multi-label Text Classification
Host: GitHub
URL: https://github.com/OctoberChang/X-Transformer
Owner: OctoberChang
License: bsd-3-clause
Created: 2020-06-23T14:42:08.000Z (almost 4 years ago)
Default Branch: master
Last Pushed: 2021-04-27T19:45:21.000Z (about 3 years ago)
Last Synced: 2024-02-29T22:33:43.560Z (4 months ago)
Topics: extreme-multi-label-classification, pytorch, text-classification, transformers
Language: C++
Homepage:
Size: 86.9 KB
Stars: 136
Watchers: 6
Forks: 28
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists

awesome-stars - OctoberChang/X-Transformer - X-Transformer: Taming Pretrained Transformers for eXtreme Multi-label Text Classification (C++)
README

        # Taming Pretrained Transformers for XMC problems

This is a README for the experimental code of the following paper

>[Taming Pretrained Transformers for eXtreme Multi-label Text Classification](https://arxiv.org/abs/1905.02331)

>Wei-Cheng Chang, Hsiang-Fu Yu, Kai Zhong, Yiming Yang, Inderjit Dhillon

>KDD 2020

## Upates (2021-04-27)

Latest implementation (faster training with stronger performance) of X-Transformer is available at [PECOS](https://github.com/amzn/pecos), feel free to try it out!

## Installation

 

### Depedencies via Conda Environment

	

	> conda env create -f environment.yml

	> source activate pt1.2_xmlc_transformer

	> (pt1.2_xmlc_transformer) pip install -e .

	> (pt1.2_xmlc_transformer) python setup.py install --force

	

	

**Notice: the following examples are executed under the ```> (pt1.2_xmlc_transformer)``` conda virtual environment

## Reproduce Evaulation Results in the Paper

We demonstrate how to reproduce the evaluation results in our paper

by downloading the raw dataset and pretrained models.

### Download Dataset (Eurlex-4K, Wiki10-31K, AmazonCat-13K, Wiki-500K)

Change directory into ./datasets folder, download and unzip each dataset

```bash

cd ./datasets

bash download-data.sh Eurlex-4K

bash download-data.sh Wiki10-31K

bash download-data.sh AmazonCat-13K

bash download-data.sh Wiki-500K

cd ../

```

Each dataset contains the following files

- ```label_map.txt```: each line is the raw text of the label

- ```train_raw_text.txt, test_raw_text.txt```: each line is the raw text of the instance

- ```X.trn.npz, X.tst.npz```: instance's embedding matrix (either sparse TF-IDF or fine-tuned dense embedding)  

- ```Y.trn.npz, Y.tst.npz```: instance-to-label assignment matrix

  

### Download Pretrained Models (processed data, Indexing codes, fine-tuned Transformer models)

Change directory into ./pretrained_models folder, download and unzip models for each dataset

	

```bash

cd ./pretrained_models

bash download-models.sh Eurlex-4K

bash download-models.sh Wiki10-31K

bash download-models.sh AmazonCat-13K

bash download-models.sh Wiki-500K

cd ../

```

Each folder has the following strcture

- ```proc_data```: a sub-folder containing: X.{trn|tst}.{model}.128.pkl, C.{label-emb}.npz, L.{label-emb}.npz

- ```pifa-tfidf-s0```: a sub-folder containing indexer and matcher

- ```pifa-neural-s0```: a sub-folder containing indexer and matcher 

- ```text-emb-s0```: a sub-folder containing indexer and matcher

### Evaluate Linear Models

Given the provided indexing codes (label-to-cluster assignments), train/predict linear models, and evaluate with Precision/Recall@k:

```bash

bash eval_linear.sh ${DATASET} ${VERSION}

```

- ```DATASET```: the dataset name such as Eurlex-4K, Wiki10-31K, AmazonCat-13K, or Wiki-500K.

- ```VERSION```: v0=sparse TF-IDF features. v1=sparse TF-IDF features concatenate with dense fine-tuned XLNet embedding.	

The evaluaiton results should located at

``` ./results_linear/${DATASET}.${VERSION}.txt ```

### Evaluate Fine-tuned X-Transformer Models

Given the provided indexing codes (label-to-cluster assignments) and the fine-tuned Transformer models, train/predict ranker of the X-Transformer framework, and evaluate with Precision/Recall@k:

```bash

bash eval_transformer.sh ${DATASET}

```

- ```DATASET```: the dataset name such as Eurlex-4K, Wiki10-31K, AmazonCat-13K, or Wiki-500K.	

The evaluaiton results should located at

``` ./results_transformer/${DATASET}.final.txt ```

## Running X-Transformer on customized datasets

The X-Transformer framework consists of 9 configurations (3 label-embedding times 3 model-type).

For simplicity, we show you 1 out-of 9 here, using ```LABEL_EMB=pifa-tfidf``` and ```MODEL_TYPE=bert```.

We will use Eurlex-4K as an example. In the ./datasets/Eurlex-4K folder, we assume the following files are provided:

- ```X.trn.npz```: the instance TF-IDF feature matrix for the train set. The data type is scipy.sparse.csr_matrix of size (N_trn, D_tfidf), where N_trn is the number of train instances and D_tfidf is the number of features. 

- ```X.tst.npz```: the instance TF-IDF feature matrix for the test set. The data type is scipy.sparse.csr_matrix of size (N_tst, D_tfidf), where N_tst is the number of test instances and D_tfidf is the number of features.

- ```Y.trn.npz```: the instance-to-label matrix for the train set. The data type is scipy.sparse.csr_matrix of size (N_trn, L), where n_trn is the number of train instances and L is the number of labels. 

- ```Y.tst.npz```: the instance-to-label matrix for the test set. The data type is scipy.sparse.csr_matrix of size (N_tst, L), where n_tst is the number of test instances and L is the number of labels.

- ```train_raw_texts.txt```: The raw text of the train set. 

- ```test_raw_texts.txt```: The raw text of the test set.

- ```label_map.txt```: the label's text description. 

Given those input files, the pipeline can be divided into three stages: Indexer, Matcher, and Ranker. 

### Indexer

In stage 1, we will do the following

- (1) construct label embedding

- (2) perform hierarchical 2-means and output the instance-to-cluster assignment matrix

- (3) preprocess the input and output for training Transformer models.

**TLDR**: we combine and summarize (1),(2),(3) into two scripts: ```run_preprocess_label.sh``` and ```run_preprocess_feat.sh```. See more detailed explaination in the following.

(1) To construct label embedding,

```bash

OUTPUT_DIR=save_models/${DATASET}

PROC_DATA_DIR=${OUTPUT_DIR}/proc_data

mkdir -p ${PROC_DATA_DIR}

python -m xbert.preprocess \

    --do_label_embedding \

    -i ${DATA_DIR} \

    -o ${PROC_DATA_DIR} \

    -l ${LABEL_EMB} \

    -x ${LABEL_EMB_INST_PATH}

```

- ```DATA_DIR```: ./datasets/Eurlex-4K

- ```PROC_DATA_DIR```: ./save_models/Eurlex-4K/proc_data

- ```LABEL_EMB```: pifa-tfidf (you can also try text-emb or pifa-neural if you have fine-tuned instance embeddings)

- ```LABEL_EMB_INST_PATH```: ./datasets/Eurlex-4K/X.trn.npz

This should yield ```L.${LABEL_EMB}.npz``` in the ```PROC_DATA_DIR```.

(2) To perform hierarchical 2-means,

```bash

SEED_LIST=( 0 1 2 )

for SEED in "${SEED_LIST[@]}"; do

    LABEL_EMB_NAME=${LABEL_EMB}-s${SEED}

    INDEXER_DIR=${OUTPUT_DIR}/${LABEL_EMB_NAME}/indexer

    python -u -m xbert.indexer \

    python -m xbert.preprocess \

        -i ${PROC_DATA_DIR}/L.${LABEL_EMB}.npz \

        -o ${INDEXER_DIR} --seed ${SEED}

```

This should yield ```code.npz``` in the ```INDEXIER_DIR```.

(3) To preprocess input and output for Transformer models,

```bash

SEED=0

LABEL_EMB_NAME=${LABEL_EMB}-s${SEED}

INDEXER_DIR=${OUTPUT_DIR}/${LABEL_EMB_NAME}/indexer

python -u -m xbert.preprocess \

    --do_proc_label \

    -i ${DATA_DIR} \

    -o ${PROC_DATA_DIR} \

    -l ${LABEL_EMB_NAME} \

    -c ${INDEXER_DIR}/code.npz

```

This should yield the instance-to-cluster matrix ```C.trn.npz``` and ```C.tst.npz``` in the ```PROC_DATA_DIR```.

```bash

OUTPUT_DIR=save_models/${DATASET}

PROC_DATA_DIR=${OUTPUT_DIR}/proc_data

python -u -m xbert.preprocess \

    --do_proc_feat \

    -i ${DATA_DIR} \

    -o ${PROC_DATA_DIR} \

    -m ${MODEL_TYPE} \

    -n ${MODEL_NAME} \

    --max_xseq_len ${MAX_XSEQ_LEN} \

    |& tee ${PROC_DATA_DIR}/log.${MODEL_TYPE}.${MAX_XSEQ_LEN}.txt

```

- ```MODEL_TYPE```: bert (or roberta, xlnet)

- ```MODEL_NAME```: bert-large-cased-whole-word-masking (or roberta-large, xlnet-large-cased)

- ```MAX_XSEQ_LEN```: maximum number of tokens, we set to 128

This should yield ```X.trn.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pt``` and ```X.tst.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pt``` in the ```PROC_DATA_DIR```.

### Matcher

In stage 2, we will do the following

- (1) train deep Transformer models to map instances to the induced clusters

- (2) output the predicted cluster scores and fine-tune instance embeddings

**TLDR**: ```run_transformer_train.sh```. See more detailed explaination in the following.

(1) Assume we have 8 Nvidia V100 GPUs. To train the models,

```bash

MODEL_DIR=${OUTPUT_DIR}/${INDEXER_NAME}/matcher/${MODEL_NAME}

mkdir -p ${MODEL_DIR}

```

```python

python -m torch.distributed.launch \

    --nproc_per_node 8 xbert/transformer.py \

    -m ${MODEL_TYPE} -n ${MODEL_NAME} --do_train \

    -x_trn ${PROC_DATA_DIR}/X.trn.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pkl \

    -c_trn ${PROC_DATA_DIR}/C.trn.${INDEXER_NAME}.npz \

    -o ${MODEL_DIR} --overwrite_output_dir \

    --per_device_train_batch_size ${PER_DEVICE_TRN_BSZ} \

    --gradient_accumulation_steps ${GRAD_ACCU_STEPS} \

    --max_steps ${MAX_STEPS} \

    --warmup_steps ${WARMUP_STEPS} \

    --learning_rate ${LEARNING_RATE} \

    --logging_steps ${LOGGING_STEPS} \

    |& tee ${MODEL_DIR}/log.txt

```

- ```MODEL_TYPE```: bert (or roberta, xlnet)

- ```MODEL_NAME```: bert-large-cased-whole-word-masking (or roberta-large, xlnet-large-cased)

- ```PER_DEVICE_TRN_BSZ```: 16 if using Nvidia V100 (or set to 8 if using Nvidia 2080Ti)

- ```GRAD_ACCU_STEPS```: 2 if using Nvidia V100 (or set to 4 if using Nvidia 2080Ti) 

- ```MAX_STEPS```: set to 1,000 for Eurlex-4K. Depending on your datasets

- ```WARMUP_STEPS```: set to 1,00 for Eurlex-4K. Depending on your datasets

- ```LEARNING_RATE```: set to 5e-5 for Eurlex-4K. Depending on your datasets

- ```LOGGING_STEPS```: set to 100

(2) To generate predictions and instance embedding,

```bash

GPID=0,1,2,3,4,5,6,7

PER_DEVICE_VAL_BSZ=32

```

```python

CUDA_VISIBLE_DEVICES=${GPID} python -u xbert/transformer.py

    -m ${MODEL_TYPE} -n ${MODEL_NAME} \

    --do_eval -o ${MODEL_DIR} \

    -x_trn ${PROC_DATA_DIR}/X.trn.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pkl \

    -c_trn ${PROC_DATA_DIR}/C.trn.${INDEXER_NAME}.npz \

    -x_tst ${PROC_DATA_DIR}/X.tst.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pkl \

    -c_tst ${PROC_DATA_DIR}/C.tst.${INDEXER_NAME}.npz \

    --per_device_eval_batch_size ${PER_DEVICE_VAL_BSZ}

```

This should yield the following output in the ```MODEL_DIR```

- ```C_trn_pred.npz``` and ```C_tst_pred.npz```: model-predicted cluster scores

- ```trn_embeddings.npy``` and ```tst_embeddings.npy```: fine-tuned instance embeddings

### Ranker

In stage 3, we will do the following

- (1) train linear rankers to map instances and predicted cluster scores to label scores

- (2) output top-k predicted labels

**TLDR**: ```run_transformer_predict.sh```. See more detailed explaination in the following.

(1) To train linear rankers,

```bash

LABEL_NAME=pifa-tfidf-s0

MODEL_NAME=bert-large-cased-whole-word-masking

OUTPUT_DIR=save_models/${DATASET}/${LABEL_NAME}

INDEXER_DIR=${OUTPUT_DIR}/indexer

MATCHER_DIR=${OUTPUT_DIR}/matcher/${MODEL_NAME}

RANKER_DIR=${OUTPUT_DIR}/ranker/${MODEL_NAME}

mkdir -p ${RANKER_DIR}

```

```python

python -m xbert.ranker train \

    -x1 ${DATA_DIR}/X.trn.npz \

    -x2 ${MATCHER_DIR}/trn_embeddings.npy \

    -y ${DATA_DIR}/Y.trn.npz \

    -z ${MATCHER_DIR}/C_trn_pred.npz \

    -c ${INDEXER_DIR}/code.npz \

    -o ${RANKER_DIR} -t 0.01 \

    -f 0 --mode ranker

```

(2) To predict the final top-k labels,

```bash

PRED_NPZ_PATH=${RANKER_DIR}/tst.pred.npz

python -m xbert.ranker predict \

    -m ${RANKER_DIR} -o ${PRED_NPZ_PATH} \

    -x1 ${DATA_DIR}/X.tst.npz \

    -x2 ${MATCHER_DIR}/tst_embeddings.npy \

    -y ${DATA_DIR}/Y.tst.npz \

    -z ${MATCHER_DIR}/C_tst_pred.npz \

    -f 0 -t noop

```

This should yield the predicted top-k labels tst.pred.npz specified in ```PRED_NPZ_PATH```.

## Acknowledge

Some portions of this repo is borrowed from the following repos:

- [transformers(v2.2.0)](https://github.com/huggingface/transformers)

- [liblinear](https://github.com/cjlin1/liblinear)

- [TRMF](https://github.com/rofuyu/exp-trmf-nips16)