{"id":13585146,"url":"https://github.com/castorini/hedwig","last_synced_at":"2025-04-04T22:04:31.921Z","repository":{"id":37601728,"uuid":"174762036","full_name":"castorini/hedwig","owner":"castorini","description":"PyTorch deep learning models for document classification","archived":false,"fork":false,"pushed_at":"2023-07-21T16:23:08.000Z","size":24475,"stargazers_count":595,"open_issues_count":33,"forks_count":125,"subscribers_count":19,"default_branch":"master","last_synced_at":"2025-03-28T21:02:50.991Z","etag":null,"topics":["deep-learning","document-classification","pytorch"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/castorini.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-03-10T00:44:37.000Z","updated_at":"2025-03-17T09:21:57.000Z","dependencies_parsed_at":"2024-11-06T03:03:23.676Z","dependency_job_id":"80ef84bb-206b-4759-adb1-9c9c4080c4f5","html_url":"https://github.com/castorini/hedwig","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/castorini%2Fhedwig","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/castorini%2Fhedwig/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/castorini%2Fhedwig/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/castorini%2Fhedwig/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/castorini","download_url":"https://codeload.github.com/castorini/hedwig/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247256110,"owners_count":20909240,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","document-classification","pytorch"],"created_at":"2024-08-01T15:04:45.888Z","updated_at":"2025-04-04T22:04:31.876Z","avatar_url":"https://github.com/castorini.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://github.com/karkaroff/hedwig/blob/bellatrix/docs/hedwig.png\" width=\"360\"\u003e\n\u003c/p\u003e\n\nThis repo contains PyTorch deep learning models for document classification, implemented by the Data Systems Group at the University of Waterloo.\n\n## Models\n\n+ [DocBERT](models/bert/) : DocBERT: BERT for Document Classification [(Adhikari et al., 2019)](https://arxiv.org/abs/1904.08398v1)\n+ [Reg-LSTM](models/reg_lstm/): Regularized LSTM for document classification [(Adhikari et al., NAACL 2019)](https://cs.uwaterloo.ca/~jimmylin/publications/Adhikari_etal_NAACL2019.pdf)\n+ [XML-CNN](models/xml_cnn/): CNNs for extreme multi-label text classification [(Liu et al., SIGIR 2017)](http://nyc.lti.cs.cmu.edu/yiming/Publications/jliu-sigir17.pdf)\n+ [HAN](models/han/): Hierarchical Attention Networks [(Zichao et al., NAACL 2016)](https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf)\n+ [Char-CNN](models/char_cnn/): Character-level Convolutional Network [(Zhang et al., NIPS 2015)](http://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf)\n+ [Kim CNN](models/kim_cnn/): CNNs for sentence classification [(Kim, EMNLP 2014)](http://www.aclweb.org/anthology/D14-1181)\n\nEach model directory has a `README.md` with further details.\n\n## Setting up PyTorch\n\nHedwig is designed for Python 3.6 and [PyTorch](https://pytorch.org/) 0.4.\nPyTorch recommends [Anaconda](https://www.anaconda.com/distribution/) for managing your environment.\nWe'd recommend creating a custom environment as follows:\n\n```\n$ conda create --name castor python=3.6\n$ source activate castor\n```\n\nAnd installing PyTorch as follows:\n\n```\n$ conda install pytorch=0.4.1 cuda92 -c pytorch\n```\n\nOther Python packages we use can be installed via pip:\n\n```\n$ pip install -r requirements.txt\n```\n\nCode depends on data from NLTK (e.g., stopwords) so you'll have to download them. \nRun the Python interpreter and type the commands:\n\n```python\n\u003e\u003e\u003e import nltk\n\u003e\u003e\u003e nltk.download()\n```\n\n## Datasets\n\nThere are two ways to download the Reuters, AAPD, and IMDB datasets, along with word2vec embeddings:\n\nOption 1. Our [Wasabi](https://wasabi.com/)-hosted mirror:\n\n```bash\n$ wget http://nlp.rocks/hedwig -O hedwig-data.zip\n$ unzip hedwig-data.zip\n```\n\nOption 2. Our school-hosted repository, [`hedwig-data`](https://git.uwaterloo.ca/jimmylin/hedwig-data):\n\n```bash\n$ git clone https://github.com/castorini/hedwig.git\n$ git clone https://git.uwaterloo.ca/jimmylin/hedwig-data.git\n```\n\nNext, organize your directory structure as follows:\n\n```\n.\n├── hedwig\n└── hedwig-data\n```\n\nAfter cloning the hedwig-data repo, you need to unzip the embeddings and run the preprocessing script:\n\n```bash\ncd hedwig-data/embeddings/word2vec \ntar -xvzf GoogleNews-vectors-negative300.tgz\n```\n\n**If you are an internal Hedwig contributor using the machines in the lab, follow the instructions [here](docs/internal-instructions.md).**\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcastorini%2Fhedwig","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcastorini%2Fhedwig","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcastorini%2Fhedwig/lists"}