{"id":19449354,"url":"https://github.com/aavache/llmwebcrawler","last_synced_at":"2025-10-23T20:08:24.939Z","repository":{"id":200316184,"uuid":"697703736","full_name":"Aavache/LLMWebCrawler","owner":"Aavache","description":"A Web Crawler based on LLMs implemented with Ray and Huggingface. The embeddings are saved into a vector database for fast clustering and retrieval. Use it for your RAG.","archived":false,"fork":false,"pushed_at":"2023-10-15T12:57:39.000Z","size":21,"stargazers_count":92,"open_issues_count":0,"forks_count":10,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-28T09:51:28.836Z","etag":null,"topics":["api","distributed-computing","fastapi","huggingface","large-language-models","llm","machine-learning","milvus","nlp","pydantic","python","rag","ray","raylib","transformer","vector-database","webcrawler","webcrawling"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Aavache.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-28T09:52:05.000Z","updated_at":"2025-03-13T21:25:50.000Z","dependencies_parsed_at":null,"dependency_job_id":"ccc1828b-d8b1-40e0-b45a-ee97bc28c0ff","html_url":"https://github.com/Aavache/LLMWebCrawler","commit_stats":null,"previous_names":["aavache/llmwebcrawler"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aavache%2FLLMWebCrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aavache%2FLLMWebCrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aavache%2FLLMWebCrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aavache%2FLLMWebCrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Aavache","download_url":"https://codeload.github.com/Aavache/LLMWebCrawler/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248966675,"owners_count":21190819,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["api","distributed-computing","fastapi","huggingface","large-language-models","llm","machine-learning","milvus","nlp","pydantic","python","rag","ray","raylib","transformer","vector-database","webcrawler","webcrawling"],"created_at":"2024-11-10T16:32:06.389Z","updated_at":"2025-10-23T20:08:24.832Z","avatar_url":"https://github.com/Aavache.png","language":"Python","readme":"# LLM-based Web Crawler\n\nAn scalable web crawler, here a list of the feature of this crawler:\n\n* This service can crawl recursively the web storing links it's text and the corresponding text embedding.\n* We use a large language model (e.g Bert) to obtain the text embeddings, i.e. a vector representation of the text present at each webiste.\n* The service is scalable, we use Ray to spread across multiple workers.\n* The entries are stored into a vector database. Vector databases are ideal to save and retrieve samples according to a vector representation.\n\nBy saving the representations into a vector database, you can retrieve similar pages according to how close two vectors are. This is critical for a browser to retrieve the most relevant results.\n\n# CLI\n\nRun the crawler with the terminal:\n\n```sh\n$ python cli_crawl.py --help\n\noptions:\n  -h, --help            show this help message and exit\n  -u INITIAL_URLS [INITIAL_URLS ...], --initial-urls INITIAL_URLS [INITIAL_URLS ...]\n  -lm LANGUAGE_MODEL, --language-model LANGUAGE_MODEL\n  -m MAX_DEPTH, --max-depth MAX_DEPTH\n```\n\n# API\n\nHost the API with `uvicorn` and `FastAPI`.\n\n```sh\nuvicorn api_app:app --host 0.0.0.0 --port 80\n```\n\nTake a look to the example in `start_api_and_head_node.sh`. Note that the ray head nodes needs to be initialized first.\n\n# Large Language Model\n\nFor our use case, we simply use [BERT](https://arxiv.org/abs/1810.04805) model implemented by [Huggingface](https://huggingface.co/) to extract embeddings from the web text. More precisely, we use [bert-base-uncased](https://huggingface.co/bert-base-uncased). Note that the code is agnostic and new models could be registered and added with few lines of code, take a look to `llm/best.py`.\n\n# Saving crawled data\n\nWe use [Milvus](https://milvus.io/) as our main database administrator software. We use a vector-style database due to its inherited capability of searching and saving entries based on vector representations (embeddings).\n\n## Milvus lite\n\nStart your standalone Milvus server as follows, I suggest using an multiplexer software such as `tmux`:\n\n```sh\ntmux new -s milvus\nmilvus-server\n```\n\nTake a look under `scripts/` to see some of the basic requests to Milvus.\n\n## Docker compose\n\nYou can also use the official `docker compose` template:\n\n```sh\ndocker compose --file milvus-docker-compose.yml up -d\n```\n\n# Parallel computation\n\nWe use [Ray](https://docs.ray.io/en/latest/ray-core/examples/gentle_walkthrough.html), is great python framework to run distributed and parallel processing. Ray follows the master-worker paradigm, where a `head` node will request tasks to be executed to the connected workers.\n\n## Start the head and the worker nodes in Ray\n\n## Head node\n\n1. Setup the head node\n\n```sh\nray start --head\n```\n\n2. Connect your program to the head node\n\n```py\nimport ray\n\n# Connect to the head\nray.init(\"auto\")\n```\n\nIn case you want to stop ray node:\n```sh\nray stop\n```\n\nOr checking the status:\n```sh\nray status\n```\n\n## Worker node\n\n1. Initialize the worker node\n\n```sh\nray start\n```\n\nThe worker node does not need to have the code implementation as the head node will serialize and submit the arguments and implementation to the workers.\n\n\n## Future features\n\nThe current implementation is a PoC. Many improvements can be made:\n* [Important] New entrypoint in the API to search similar URL given text.\n* Optimize search and API.\n* Adding new LLMs models and new chunking strategies with popular libraries, e.g. [LangChain](https://www.langchain.com/).\n* Storing more features in the vector DB, perhaps, generate summaries.\n\n## Contributing\n\nAll issues and PRs are welcome 🙂.\n\n## Reference\n\n* [Ray Documentation](https://docs.ray.io/en/latest/ray-core/examples/gentle_walkthrough.html)\n* [Milvus](https://milvus.io/)\n* [FastAPI](https://fastapi.tiangolo.com/)\n* [Huggingface](https://huggingface.co/)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faavache%2Fllmwebcrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faavache%2Fllmwebcrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faavache%2Fllmwebcrawler/lists"}