{"id":17937904,"url":"https://github.com/luizppa/web-crawler","last_synced_at":"2025-08-17T22:12:47.748Z","repository":{"id":55928577,"uuid":"319082610","full_name":"luizppa/web-crawler","owner":"luizppa","description":"A web crawler that collects and indexes web pages. Made with chilkat and gumbo parser.","archived":false,"fork":false,"pushed_at":"2021-11-17T23:11:49.000Z","size":4240,"stargazers_count":7,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-04T22:02:46.369Z","etag":null,"topics":["chilkat","cpp","crawler","webcrawler"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/luizppa.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-12-06T16:48:32.000Z","updated_at":"2024-04-16T06:42:41.000Z","dependencies_parsed_at":"2022-08-15T09:40:32.638Z","dependency_job_id":null,"html_url":"https://github.com/luizppa/web-crawler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/luizppa/web-crawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luizppa%2Fweb-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luizppa%2Fweb-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luizppa%2Fweb-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luizppa%2Fweb-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/luizppa","download_url":"https://codeload.github.com/luizppa/web-crawler/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luizppa%2Fweb-crawler/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270914170,"owners_count":24667085,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-17T02:00:09.016Z","response_time":129,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chilkat","cpp","crawler","webcrawler"],"created_at":"2024-10-28T23:08:17.238Z","updated_at":"2025-08-17T22:12:47.681Z","avatar_url":"https://github.com/luizppa.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Web Crawler\n\n![](docs/query.gif)\n\nThis is a simplified search engine built with [Chilkat's CkSpider](https://www.chilkatsoft.com/), [Gumbo Parser](https://github.com/google/gumbo-parser) and [RapidJSON](https://github.com/Tencent/rapidjson/). The software will collect a given number of web pages anb build an index for information retrieval over that collection. The default number of pages the crawler will try to visit before halting is 100000 (one hundred thousand). You can change this value by modifying the ```PAGES_TO_COLLECT``` constant located in ```main.cpp```.\n\n* [Installing](#installing)\n* [Usage](#usage)\n* [Example](#example)\n\n## Installing\n\n[Installing Chilkat for cpp](https://www.chilkatsoft.com/downloads_CPP.asp)\n\nAfter installing Chilkat: \n\n```\n$ sudo make install\n$ make\n```\n\nThis will install [Gumbo Parser](https://github.com/google/gumbo-parser) and build the project (you may need to run ```sudo ldconfig``` afterwards), creating an executable file within ```build/```.\n\n## Usage\n\nTo run the application, you can either use ```make run``` (to run with sample inputs, this will automaticaly crawl with a predefined seed at ```input/seed``` and build the index for those pages) or use ```./build/web-crawler``` with custom options.\n\nThe available options are:\n\n* ```-c [SEED_FILE]``` replacing ```[SEED_FILE]``` with the path to the file containing your seeds, see [examples](#example). This will start the crawling process with ```[SEED_FILE]``` as seed.\n* ```-i [COLLECTION_PATH - optional]``` replacing ```[COLLECTION_PATH]``` with the path where your html collection is stored or simply leaving it blank, by default the collection path will be ```output/collection.jl```. This will build an index for the documents present at ```[COLLECTION_PATH]``` and an index for the vocabulary of the collection. Two output files briefing.doc.idx and index.idx, the index for the documents and the vocabulary respectivelly, will be created at ```output/```.\n* ```-l [VOCABULARY_INDEX_PATH]``` which will load the vocabulary index file at ```[VOCABULARY_INDEX_PATH]``` to memory (carefull there).\n* ```-q [VOCABULARY_INDEX_PATH] [COLLECTION_INDEX_PATH]``` where both ```[VOCABULARY_INDEX_PATH]``` and ```[COLLECTION_INDEX_PATH]``` are optional, however, if ```[COLLECTION_INDEX_PATH]``` is provided, so should be ```[VOCABULARY_INDEX_PATH]```. This will open the CLI for performing queries. Defaults are ```./output/index.idx``` and ```./output/briefing.doc.idx```.\n\nThe documents on the collection are indexed in batches, by default, the maximum batch size is 4096, defined in ```include/indexer.hpp```. If the batch size is too big, the application will consume a large amount of RAM, however, if it is overly small, the execution time and disk usage may increase. On [this document](https://github.com/LuizPPA/web-crawler/blob/master/docs/Information_Retrieval_Assignment_4.pdf) is presented a chart roughly illustrating how memory consumption escalates with batch size.\n\n\u003e *Be cautious when modifying the maximum batch size as it will require a lot of RAM, e.g. indexing 60000 documents at once can consume over 5GB. Also, make sure you have enough storage space.*\n\nThe output will be a ```collection.jl``` file, which is going to be the collected documents as list of JSON objects separated by line breaks, and a ```index.idx``` file, which will be the inverted index for that collection. The JSON for the documents is an object with two keys: _url_ and _html\\_content_, containing the link of the document on the web and HTML content of the document respectively. Each line of the inverted index representes a term following this format:\n\n_term n d\u003csub\u003e1\u003c/sub\u003e n\u003csub\u003ed1\u003c/sub\u003e p\u003csub\u003e1,d1\u003c/sub\u003e p\u003csub\u003e2,d1\u003c/sub\u003e p\u003csub\u003endn,dn\u003c/sub\u003e_\n\nWhere term is the indexed word, n is the number of documents where the term is present,  d\u003csub\u003ei\u003c/sub\u003e is the i-th document where the term is present, n\u003csub\u003edi\u003c/sub\u003e is the number of times the term appears in d\u003csub\u003ei\u003c/sub\u003e and p\u003csub\u003ej,di\u003c/sub\u003e is the position of the j-th occurrence of the term in d\u003csub\u003ei\u003c/sub\u003e.\n\n## Example\n\nYour seed file should be a list of urls where the crawler will start visiting separated with line breaks like:\n\n```\nufmg.br\nkurzgesagt.org\nwww.cam.ac.uk\nwww.nasa.gov\ngithub.com\nmedium.com\nwww.cnnbrasil.com.br\ndisney.com.br\nen.wikipedia.org\n```\n\nThe output collection will be formated as below (as should also be input collections):\n\n```\n{\"url\": \"www.document1.com\", \"html_content\": \"\u003chtml\u003e document 1's html content... \u003c/html\u003e\"}\n{\"url\": \"www.document2.com\", \"html_content\": \"\u003chtml\u003e document 2's html content... \u003c/html\u003e\"}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluizppa%2Fweb-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fluizppa%2Fweb-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluizppa%2Fweb-crawler/lists"}