https://github.com/castorini/castor
PyTorch deep learning models for text processing
https://github.com/castorini/castor
deep-learning
Last synced: over 1 year ago
JSON representation
PyTorch deep learning models for text processing
- Host: GitHub
- URL: https://github.com/castorini/castor
- Owner: castorini
- License: apache-2.0
- Created: 2017-03-22T14:46:05.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2019-04-08T00:07:28.000Z (about 7 years ago)
- Last Synced: 2025-03-02T13:11:20.343Z (over 1 year ago)
- Topics: deep-learning
- Language: Python
- Homepage: http://castor.ai/
- Size: 1.12 MB
- Stars: 176
- Watchers: 19
- Forks: 56
- Open Issues: 28
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Castor
This is the common repo for deep learning models implemented in PyTorch by the Data Systems Group at the University of Waterloo.
## Models
### Predictions Over One Input Text Sequence
Moved to https://github.com/castorini/hedwig
### Predictions Over Two Input Text Sequences
For paraphrase detection, question answering, etc.
+ [SM-CNN](./sm_cnn/): Siamese CNN for ranking texts [(Severyn and Moschitti, SIGIR 2015)](https://dl.acm.org/citation.cfm?id=2767738)
+ [MP-CNN](./mp_cnn/): Multi-Perspective CNN [(He et al., EMNLP 2015)](http://anthology.aclweb.org/D/D15/D15-1181.pdf)
+ [NCE](./nce/): Noise-Contrastive Estimation for answer selection applied on SM-CNN and MP-CNN [(Rao et al., CIKM 2016)](https://dl.acm.org/citation.cfm?id=2983872)
+ [VDPWI](./vdpwi): Very-Deep Pairwise Word Interaction NNs for modeling textual similarity [(He and Lin, NAACL 2016)](http://www.aclweb.org/anthology/N16-1108)
+ [IDF Baseline](./idf_baseline/): IDF overlap between question and candidate answers
Each model directory has a `README.md` with further details.
## Setting up PyTorch
**If you are an internal Castor contributor using GPU machines in the lab, follow the instructions [here](./docs/internal-instructions.md).**
Castor is designed for Python 3.6 and [PyTorch](https://pytorch.org/) 0.4.
PyTorch recommends [Anaconda](https://www.anaconda.com/distribution/) for managing your environment.
We'd recommend creating a custom environment as follows:
```
$ conda create --name castor python=3.6
$ source activate castor
```
And installing the packages as follows:
```
$ conda install pytorch torchvision -c pytorch
```
Other Python packages we use can be installed via pip:
```
$ pip install -r requirements.txt
```
Code depends on data from NLTK (e.g., stopwords) so you'll have to download them. Run the Python interpreter and type the commands:
```python
>>> import nltk
>>> nltk.download()
```
Finally, run the following inside the `utils` directory to build the `trec_eval` tool for evaluating certain datasets.
```bash
$ ./get_trec_eval.sh
```
## Data and Pre-Trained Models
**If you are an internal Castor contributor using GPU machines in the lab, follow the instructions [here](./docs/internal-instructions.md).**
To fully take advantage of code here, clone these other two repos:
+ [`Castor-data`](https://git.uwaterloo.ca/jimmylin/Castor-data): embeddings, datasets, etc.
+ [`Caster-models`](https://git.uwaterloo.ca/jimmylin/Castor-models): pre-trained models
Organize your directory structure as follows:
```
.
├── Castor
├── Castor-data
└── Castor-models
```
For example (using HTTPS):
```bash
$ git clone https://github.com/castorini/Castor.git
$ git clone https://git.uwaterloo.ca/jimmylin/Castor-data.git
$ git clone https://git.uwaterloo.ca/jimmylin/Castor-models.git
```
After cloning the Castor-data repo, you need to unzip embeddings and run data pre-processing scripts. You can choose
to follow instructions under each dataset and embedding directory separately, or just run the following script in Castor-data
to do all of the steps for you:
```bash
$ ./setup.sh
```