{"id":16913004,"url":"https://github.com/aborg-dev/information_retrieval_class","last_synced_at":"2026-05-16T00:31:55.618Z","repository":{"id":16589848,"uuid":"19344185","full_name":"aborg-dev/Information_Retrieval_Class","owner":"aborg-dev","description":"SHAD Information Retrieval Class materials ","archived":false,"fork":false,"pushed_at":"2015-05-14T09:35:33.000Z","size":520,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-20T19:37:40.123Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aborg-dev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-05-01T12:49:07.000Z","updated_at":"2015-04-07T01:38:23.000Z","dependencies_parsed_at":"2022-09-24T08:21:14.928Z","dependency_job_id":null,"html_url":"https://github.com/aborg-dev/Information_Retrieval_Class","commit_stats":null,"previous_names":["aborg-dev/information_retrieval_class"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/aborg-dev/Information_Retrieval_Class","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborg-dev%2FInformation_Retrieval_Class","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborg-dev%2FInformation_Retrieval_Class/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborg-dev%2FInformation_Retrieval_Class/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborg-dev%2FInformation_Retrieval_Class/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aborg-dev","download_url":"https://codeload.github.com/aborg-dev/Information_Retrieval_Class/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborg-dev%2FInformation_Retrieval_Class/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261416226,"owners_count":23155035,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-13T19:11:50.465Z","updated_at":"2026-05-16T00:31:55.562Z","avatar_url":"https://github.com/aborg-dev.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Information Retrieval Course [![Build Status](https://travis-ci.org/IIoTeP9HuY/Information_Retrieval_Class.png?branch=master)](https://travis-ci.org/IIoTeP9HuY/Information_Retrieval_Class)\n\nInformation Retrieval YSDA course programming assignments repository\nhttp://shad.yandex.ru/\n\nIntroduction\n============\n\nThese are my solutions for programming exercises from YSDA Information Retrieval classes. Written in Python/C++.\n\nRequirements\n============\n* gcc \u003e= 4.8 or clang \u003e= 3.4\n* cmake\n* boost \u003e= 1.55\n* glib2\n* libglibmm-2.4\n* libxml++2.6\n* libtidy\n\nUsage tutorial\n============\n\nBuild everything using\n```bash\n    mkdir build \u0026\u0026 cd build\n    cmake ..\n    make\n```\n\n####Part1, Download wiki and analyze webgraph\n\n#####Download wiki\n```bash\n    crawler http://simple.wikipedia.org/ -o wiki -t 8\n```\nThis will download whole domain to folder \"wiki\", preserving hierarchical structure and will\nfill file \"ready_urls.txt\" with downloaded urls.\n\n#####Flatten hierarchical structure to simplify parsing\n```bash\n    flatten --urlsList ready_urls.txt --urlsDir wiki --outDir flat_wiki --urlsMapping urls\n```\nThis will create folder \"flat_wiki\" with files 1.html, 2.html, ...\nand also create file urls with mapping 1.html`\u003ctab\u003e`url_1, ...\n\n#####Strip hypertext tags and extract text from webpages\n```bash\n    extract --urlsDir flat_site --outDir text_wiki --urlsMapping urls\n```\nThis will fill the folder \"text_wiki\" with files 1.txt, 2.txt, ... corresponding to extracted\ntext from 1.html, 2.html, ...\nAlso it will produce file token_frequency containing lines token_1`\u003ctab\u003e`frequency_1 ...\n\n#####Finally, analyze webgraph\n```bash\n    flat_webgraph --path flat_site --urlMapping urls --domain http://simple.wikipedia.org/ --start_page http://simple.wikipedia.org/wiki/Main_Page\n```\nThis will create files:\n* \"distances\" - distance from start_page to every other page\n* \"in_out_stats\" - input and output degrees for each page\n* \"pageranks\" - pagerank for each page (damping factor = 0.85)\n\n####Part2, Find duplicates among documents\n\nNext steps allow you to find duplicates among downloaded pages using technique called \"Simhashing\".\n\n#####Preprocess data a bit\n\nDuring this step you should remove all frequent and wiki-specific words from documents.\nThere is more general ways to do it using word frequency and scoring, but in this case\nit's easier to run simple sed command :)\n\n```bash\n    cp -r text_site text_site_clean \u0026\u0026 cd text_site_clean\n    find . -type f -exec gsed -n -i \"/Navigation menu/q;p\" {} \\;\n```\nThis will remove all footers that start with string \"Navigation menu\" till the end of file.\nThat's where extracted wiki-specific words are located.\n\n#####Build simhash signatures\n\nNow we can build simhashes for documents\n```bash\n    Simhash -b --path=text_site_clean --dest=results\n```\nThis will produce file \"simhashes\" with following format: **url**  **length\\_in\\_words**  **simhash**\n\nFor example: http://simple.wikipedia.org/wiki/Nathalia_Dill 383 11074093965332231517\n\n#####Cluster documents:\n\nTo cluster documents we need to specify maximum allowed simhash distance between documents\nin cluster. In this case we will use \"-s 5\", which means that documents that differ no more then\nin 5 positions will be considered similar.\n```bash\n    Simhash -f -s 5 --dest=results\n```\nThis will create files:\n* \"clusters_5\" - containts list of found clusters\n* \"clusters_5_sizes\" - contains sizes of found clusters\n\nCollaboration Policy\n==========\n\nI opensourced this code because I believe it can help people learn something new and improve their skills.\nYou can use it on your conscience, but I encourage you not to copy-paste this sources and use\nthem only for educational purposes :)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faborg-dev%2Finformation_retrieval_class","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faborg-dev%2Finformation_retrieval_class","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faborg-dev%2Finformation_retrieval_class/lists"}