{"id":13676309,"url":"https://github.com/malteos/aspect-document-embeddings","last_synced_at":"2025-12-30T05:15:36.866Z","repository":{"id":37439999,"uuid":"470583235","full_name":"malteos/aspect-document-embeddings","owner":"malteos","description":"Code, dataset \u0026 models for the paper Specialized Document Embeddings for Aspect-based Similarity of Research Papers (#JCDL2022)","archived":false,"fork":false,"pushed_at":"2022-05-30T07:40:13.000Z","size":279,"stargazers_count":11,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-11-11T18:40:57.777Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2203.14541","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/malteos.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-03-16T12:56:46.000Z","updated_at":"2023-06-03T02:07:46.000Z","dependencies_parsed_at":"2022-08-18T20:30:43.615Z","dependency_job_id":null,"html_url":"https://github.com/malteos/aspect-document-embeddings","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/malteos%2Faspect-document-embeddings","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/malteos%2Faspect-document-embeddings/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/malteos%2Faspect-document-embeddings/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/malteos%2Faspect-document-embeddings/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/malteos","download_url":"https://codeload.github.com/malteos/aspect-document-embeddings/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251450656,"owners_count":21591407,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T13:00:22.457Z","updated_at":"2025-12-30T05:15:36.829Z","avatar_url":"https://github.com/malteos.png","language":"Jupyter Notebook","readme":"# Specialized Document Embeddings for Aspect-based Similarity of Research Papers\n\nThis repository contains the supplemental materials for the JCDL2022 paper **Specialized Document Embeddings for Aspect-based Similarity of Research Papers** \n[(PDF on ArXiv)](https://arxiv.org/abs/2203.14541).\nTrained models and datasets can be downloaded from [GitHub releases](https://github.com/malteos/aspect-document-embeddings/releases) \nand [🤗 Huggingface model hub](https://huggingface.co/malteos/aspect-scibert-task).\n\n## Demo\n\n[Try your own papers on 🤗 Huggingface spaces.](https://huggingface.co/spaces/malteos/aspect-based-paper-similarity)\n\n## How to use the pretrained models\n\nWe provide a SciBERT-based model for each of the three aspects: \n🎯 [malteos/aspect-scibert-task](https://huggingface.co/malteos/aspect-scibert-task),\n🔨 [malteos/aspect-scibert-method](https://huggingface.co/malteos/aspect-scibert-method),\n🏷️ [malteos/aspect-scibert-dataset](https://huggingface.co/malteos/aspect-scibert-dataset).\nTo use these models, you need to install 🤗 Transformers first via `pip install transformers`.\n\n```python\nimport torch\nfrom transformers import AutoTokenizer, AutoModel\n\n# load model and tokenizer (replace with `aspect-scibert-method` or `aspect-scibert-dataset)`)\ntokenizer = AutoTokenizer.from_pretrained('malteos/aspect-scibert-task')  \nmodel = AutoModel.from_pretrained('malteos/aspect-scibert-task')\n\npapers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},\n          {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]\n\n# concatenate title and abstract\ntitle_abs = [d['title'] + ': ' + (d.get('abstract') or '') for d in papers]\n\n# preprocess the input\ninputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors=\"pt\", max_length=512)\n\n# inference\noutput = model(**inputs)\n\n# Mean pool the token-level embeddings to get sentence-level embeddings\nembeddings = torch.sum(\n    output.last_hidden_state * inputs['attention_mask'].unsqueeze(-1), dim=1\n) / torch.clamp(torch.sum(inputs['attention_mask'], dim=1, keepdims=True), min=1e-9)\n\n```\n\n## Requirements\n\n- Python 3.7\n- CUDA GPU (for Transformers)\n\n## Installation\n\nCreate a new virtual environment for Python 3.7 with Conda:\n\n```bash\nconda create -n aspect-document-embeddings python=3.7\nconda activate aspect-document-embeddings\n```\n\nClone repository and install dependencies:\n\n```bash\ngit clone https://github.com/malteos/aspect-document-embeddings\ncd aspect-document-embeddings\npip install -r requirements.txt\n```\n\n## Datasets\n\nThe datasets are compatible with [Huggingface datasets](https://github.com/huggingface/datasets) and are downloaded automatically.\nTo create the datasets directly from the [Papers With Code data](https://github.com/paperswithcode/paperswithcode-data), run the following commands:\n\n```bash\n# Download PWC files (for the paper with downloaded the files 2020-10-27)\nwget https://paperswithcode.com/media/about/papers-with-abstracts.json.gz\nwget https://paperswithcode.com/media/about/evaluation-tables.json.gz\nwget https://paperswithcode.com/media/about/methods.json.gz\n\n# Build dataset\npython -m paperswithcode.dataset save_dataset \u003cinput_dir\u003e \u003coutput_dir\u003e \n```\n\n\n## Experiments\n\nTo reproduce our experiments, follow these steps:\n\n### Generic embeddings\n\nAvg. FastText\n```bash\n# Train fastText word vectors\n./data_cli.py train_fasttext paperswithcode_aspects ./output/pwc\n\n# Build avg. fastText document vectors\n./sbin/paperswithcode/avg_fasttext.sh\n```\n \nSciBERT\n```bash\n./sbin/paperswithcode/scibert_mean.sh\n```\n\nSPECTER\n```bash\n./sbin/paperswithcode/specter.sh\n```\n\n### Retrofitted embeddings\n\nFor retrofitting we utilize [Explicit Retroffing](https://github.com/codogogo/explirefit). \nPlease follow their instruction to install it and update the `EXPLIREFIT_DIR` in the shell scripts accordingly.\nThen, you can run these scripts:\n\n```bash\n# Create constraints from dataset \n./sbin/paperswithcode/explirefit_prepare.sh\n\n# Train retrofitting models\n./sbin/paperswithcode/explirefit_avg_fasttext.sh\n./sbin/paperswithcode/explirefit_specter.sh\n./sbin/paperswithcode/explirefit_scibert_mean.sh\n\n# Generate and evaluate retrofitted embeddings \n./sbin/paperswithcode/explirefit_convert_and_evaluate.sh\n```\n\n\n### Transformers\n\n```bash\n# SciBERT\n./sbin/paperswithcode/pairwise/scibert.sh\n\n# SPECTER\n./sbin/paperswithcode/specter_fine_tuned.sh\n\n# Sentence-SciBERT\n./sbin/paperswithcode/sentence_transformer_scibert.sh\n```\n\n\n\n## Evaluation\n\nAfter generating the document representations for all aspects and systems, the results can be computed and viewed with a Jupyter notebook. \nFigures and tables from the paper are part of the notebook.\n\n```bash\n# Run evaluations for all systems\n./eval_cli.py reevaluate\n\n# Open notebook for Tables and Figures\njupyter notebook evaluation.ipynb\n\n# Open notebook for sample recommendations\njupyter notebook samples.ipynb\n```\n\n## How to cite\n\nIf you are using our code or data, please cite [our paper](https://arxiv.org/abs/2203.14541):\n\n```bibtex\n@InProceedings{Ostendorff2022,\n  title = {Specialized Document Embeddings for Aspect-based Similarity of Research Papers},\n  booktitle = {Proceedings of the {ACM}/{IEEE} {Joint} {Conference} on {Digital} {Libraries} ({JCDL})},\n  author = {Ostendorff, Malte and Blume, Till, Ruas, Terry and Gipp, Bela and Rehm, Georg},\n  year = {2022},\n}\n```\n\n## License\n\nMIT\n","funding_links":[],"categories":["Jupyter Notebook"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmalteos%2Faspect-document-embeddings","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmalteos%2Faspect-document-embeddings","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmalteos%2Faspect-document-embeddings/lists"}