https://github.com/castorini/castor

PyTorch deep learning models for text processing
https://github.com/castorini/castor

deep-learning

Last synced: over 1 year ago
JSON representation

PyTorch deep learning models for text processing

Host: GitHub
URL: https://github.com/castorini/castor
Owner: castorini
License: apache-2.0
Created: 2017-03-22T14:46:05.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2019-04-08T00:07:28.000Z (over 7 years ago)
Last Synced: 2025-03-02T13:11:20.343Z (over 1 year ago)
Topics: deep-learning
Language: Python
Homepage: http://castor.ai/
Size: 1.12 MB
Stars: 176
Watchers: 19
Forks: 56
Open Issues: 28
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Castor

This is the common repo for deep learning models implemented in PyTorch by the Data Systems Group at the University of Waterloo.

## Models

### Predictions Over One Input Text Sequence

Moved to https://github.com/castorini/hedwig

### Predictions Over Two Input Text Sequences

For paraphrase detection, question answering, etc.

+ [SM-CNN](./sm_cnn/): Siamese CNN for ranking texts [(Severyn and Moschitti, SIGIR 2015)](https://dl.acm.org/citation.cfm?id=2767738)

+ [MP-CNN](./mp_cnn/): Multi-Perspective CNN [(He et al., EMNLP 2015)](http://anthology.aclweb.org/D/D15/D15-1181.pdf)

+ [NCE](./nce/): Noise-Contrastive Estimation for answer selection applied on SM-CNN and MP-CNN [(Rao et al., CIKM 2016)](https://dl.acm.org/citation.cfm?id=2983872)

+ [VDPWI](./vdpwi): Very-Deep Pairwise Word Interaction NNs for modeling textual similarity [(He and Lin, NAACL 2016)](http://www.aclweb.org/anthology/N16-1108)

+ [IDF Baseline](./idf_baseline/): IDF overlap between question and candidate answers

Each model directory has a `README.md` with further details.

## Setting up PyTorch

**If you are an internal Castor contributor using GPU machines in the lab, follow the instructions [here](./docs/internal-instructions.md).**

Castor is designed for Python 3.6 and [PyTorch](https://pytorch.org/) 0.4.

PyTorch recommends [Anaconda](https://www.anaconda.com/distribution/) for managing your environment.

We'd recommend creating a custom environment as follows:

```

$ conda create --name castor python=3.6

$ source activate castor

```

And installing the packages as follows:

```

$ conda install pytorch torchvision -c pytorch

```

Other Python packages we use can be installed via pip:

```

$ pip install -r requirements.txt

```

Code depends on data from NLTK (e.g., stopwords) so you'll have to download them. Run the Python interpreter and type the commands:

```python

>>> import nltk

>>> nltk.download()

```

Finally, run the following inside the `utils` directory to build the `trec_eval` tool for evaluating certain datasets.

```bash

$ ./get_trec_eval.sh

```

## Data and Pre-Trained Models

**If you are an internal Castor contributor using GPU machines in the lab, follow the instructions [here](./docs/internal-instructions.md).**

To fully take advantage of code here, clone these other two repos:

+ [`Castor-data`](https://git.uwaterloo.ca/jimmylin/Castor-data): embeddings, datasets, etc.

+ [`Caster-models`](https://git.uwaterloo.ca/jimmylin/Castor-models): pre-trained models

Organize your directory structure as follows:

```

.

├── Castor

├── Castor-data

└── Castor-models

```

For example (using HTTPS):

```bash

$ git clone https://github.com/castorini/Castor.git

$ git clone https://git.uwaterloo.ca/jimmylin/Castor-data.git

$ git clone https://git.uwaterloo.ca/jimmylin/Castor-models.git

```

After cloning the Castor-data repo, you need to unzip embeddings and run data pre-processing scripts. You can choose

to follow instructions under each dataset and embedding directory separately, or just run the following script in Castor-data

to do all of the steps for you:

```bash

$ ./setup.sh

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/castorini/castor

Awesome Lists containing this project

README