{"id":13736671,"url":"https://github.com/kuutsav/information-retrieval","last_synced_at":"2025-05-08T12:33:32.470Z","repository":{"id":50475975,"uuid":"493948580","full_name":"kuutsav/information-retrieval","owner":"kuutsav","description":"Neural information retrieval / Semantic search / Bi-encoders","archived":true,"fork":false,"pushed_at":"2023-08-05T13:04:15.000Z","size":7010,"stargazers_count":164,"open_issues_count":1,"forks_count":20,"subscribers_count":9,"default_branch":"main","last_synced_at":"2024-08-04T03:06:54.516Z","etag":null,"topics":["information-retrieval","machine-learning","nlp","semantic-search"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kuutsav.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2022-05-19T06:31:32.000Z","updated_at":"2024-07-07T17:10:50.000Z","dependencies_parsed_at":"2024-02-06T05:50:48.700Z","dependency_job_id":null,"html_url":"https://github.com/kuutsav/information-retrieval","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuutsav%2Finformation-retrieval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuutsav%2Finformation-retrieval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuutsav%2Finformation-retrieval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuutsav%2Finformation-retrieval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kuutsav","download_url":"https://codeload.github.com/kuutsav/information-retrieval/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224732129,"owners_count":17360416,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["information-retrieval","machine-learning","nlp","semantic-search"],"created_at":"2024-08-03T03:01:26.238Z","updated_at":"2024-11-15T04:32:17.536Z","avatar_url":"https://github.com/kuutsav.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook","Paper implementations"],"sub_categories":["Other libraries:"],"readme":"# Information Retrieval\n\n\u003e Information Retrieval is the process through which a computer system can respond to a user's query for text-based information\n\u003e on a specific topic. IR was one of the first and remains one of the most important problems in the domain of natural\n\u003e laguague processing (NLP) - [stanford cs276](https://web.stanford.edu/class/cs276/)\n\nThis repo contains tutorials covering the breadth of techniques available for IR currently.\n\nAlong with IR techniques, we will also cover:\n- Techniques/metrics for evaluating IR models.\n- Approximate Nearest Neighbor techniques used for indexing and searching dense vectors\n     (used for many dense retrieval techniques).\n- Vector databases and other relevant info.\n\n\n## Tutorials\n\n1. **Classic Information Retrieval aka \"The Inverted Index\"** [[Notebook](./1_classic_ir_inverted_index.ipynb)]\n\n    IR in it's most basic form answers the question \"how relevant is a given *query* for a *document*\". The challenge is\n    that we don't have just 1 document but potentially millions or billions of documents. So the key challenge is - how\n    can we efficiently find this \"needle in the haystack\" or the \"relevant *documents* for a *query*\".\n\n2. **Evaluation metrics** [[Notebook](./2_evaluation_metrics_ir.ipynb)]\n\n    **Binary**: MRR, MAP@k; **Graded**: nDCG@k.\n    The idea behind these evaluations is to quantitatively compare multiple IR models. Typically we have a labelled dataset where\n    we have queries mapped to relvevant documents. The documents could either be graded or non-graded(binary). For example, a\n    graded relevance score could be on a scale of 0-5 with 5 being the most relevant.\n\n3.  **Dense representations and Finetuning BERT for IR / Semantic search** [[Notebook](./3_finetuning_bert_for_ir.ipynb)]\n\n    Sparse represenation of texts using one-hot vectors is very limited. We look at ways to learn dense representations of text,\n    from count based methods like LSA(TF_IDF+SVD) to Word2Vec to RNNs. Finally we look at how transformers are used in the IR\n    setting.\n\n4. **Finetuning Sentence BERT(SBERT) with Multiple Negative Ranking loss** [[Notebook](./4_finetuning_sbert_with_mnr.ipynb)]\n    \n    We look at a better way to finetune Bi-Encoders using MNR loss. We will need lesser data and training to achieve better\n    results.\n\n5. **Finetuning a Cross-Encoder** [[Notebook](./5_finetuning_cross_encoder.ipynb)]\n\n    We will look at Cross-Encoders. How they differ from Bi-Encoders. How to train them and when to use them.\n\n6. **Multilingual SBERT** [[Notebook](./6_multilingual_sbert.ipynb)]\n\n    We see how knowledge distillation can be used to train a Multilingual Student sentence encoder using a Teacher model which has\n    been finetuned for STS tasks.\n\n7. **Unsupervised training of SBERT - TSDAE** [[Notebook](./7.1_unsupervised_training_tsdae.ipynb)]\n\n    We finally shift our attention to unsupervised techniques to train encoders for STS tasks with no labeled data. Here we look\n    into TSDAE - Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning.\n\n8. **Unsupervised training of SBERT - TSDAE (pytorch version)** [[Notebook](./7.2_unsupervised_training_tsdae_pytorch.ipynb)]\n\n9. **Unsupervised training of SBERT - SimCSE** [[Notebook](./8_unsupervised_training_simcse.ipynb)]\n    \n    We will look into SimCSE, a simple contrastive learning framework that works with both unlabeled and labeled data.\n\n10. **Unsupervised training of SBERT - GPL** [[Notebook](./9_unsupervised_training_gpl.ipynb)]\n\n    We will look into GPL, Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkuutsav%2Finformation-retrieval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkuutsav%2Finformation-retrieval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkuutsav%2Finformation-retrieval/lists"}