{"id":20189481,"url":"https://github.com/jackfsuia/chats-crawler","last_synced_at":"2025-07-09T13:35:12.420Z","repository":{"id":234949196,"uuid":"789790439","full_name":"jackfsuia/chats-crawler","owner":"jackfsuia","description":"Discourse chat data crawling and on-the-way parsing straight for LLM instruction finetuning.  论坛数据爬取和解析，直接用于对话微调。","archived":false,"fork":false,"pushed_at":"2024-05-06T04:42:57.000Z","size":351,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-13T18:41:51.929Z","etag":null,"topics":["crawler","fine-tuning","finetune-llm","gpt","html-css-javascript","instruction-tuning","llm","llm-training","llms","nlp","nlp-parsing","parser"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jackfsuia.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-21T15:07:21.000Z","updated_at":"2024-06-15T16:23:57.000Z","dependencies_parsed_at":"2024-05-06T05:36:15.144Z","dependency_job_id":null,"html_url":"https://github.com/jackfsuia/chats-crawler","commit_stats":null,"previous_names":["jackfsuia/chats-crawler"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jackfsuia%2Fchats-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jackfsuia%2Fchats-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jackfsuia%2Fchats-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jackfsuia%2Fchats-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jackfsuia","download_url":"https://codeload.github.com/jackfsuia/chats-crawler/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241622601,"owners_count":19992504,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","fine-tuning","finetune-llm","gpt","html-css-javascript","instruction-tuning","llm","llm-training","llms","nlp","nlp-parsing","parser"],"created_at":"2024-11-14T03:37:38.310Z","updated_at":"2025-03-03T07:15:37.207Z","avatar_url":"https://github.com/jackfsuia.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n    \u003cimg src=\"assets/logo.PNG\" width=\"50%\" \u003e\n\u003c/p\u003e\n\u003cdiv align=\"center\"\u003e\n    \n[![GitHub Code License](https://img.shields.io/github/license/jackfsuia/chats-crawler)](LICENSE)\n\nEnglish | [简体中文](README_zh.md)\n\u003c/div\u003e\n\u003c!-- # chats-crawler --\u003e\n\n[**Discourse-based websites**](https://github.com/discourse/discourse) chat data crawling and on-the-way parsing straight for LLM instruction finetuning. Data include the texts, images (crucial for multimodal finetuning) and links. Will support more than Discourse-based websites soon.\n\n## Table of Contents\n\n- [Quick Start](#quick-start)\n- [Examples](#examples)\n- [Notice](#notice)\n- [Future Work](#future-work)\n- [License](#license)\n- [Citation](#citation)\n- [Acknowledgement](#acknowledgement)\n## Quick Start\nRun\n```bash\ngit clone https://github.com/jackfsuia/chats-crawler.git \u0026\u0026 cd chats-crawler\n```\nThen install the requirements, run\n```bash\nnpm i\n```\nBefore crawling, please read the [Notice](#Notice). Config the target website at [config.ts](config.ts), edit the `url` and `rex` properties to match your needs, i.e., replace the two `https://discuss.pytorch.org`s there with your target [**Discourse-based**](https://github.com/discourse/discourse) website. A [**Discourse-based**](https://github.com/discourse/discourse) website basically all looks like this:\n\n\u003cimg src=\"assets/discourse.PNG\" width=\"61%\"\u003e\n\nTo start crawling, run\n```bash\nnpm start\n```\nThat's all! The discourse chat data are saved at `storage/datasets/default` as .json files, and the images at `storage/datasets/imgs`.\n## Examples\nLets say we crawling https://discuss.pytorch.org. We should edit the [config.ts](config.ts) as:\n```\n...\n url: \"https://discuss.pytorch.org/\",\n...\nrex: \"https://discuss.pytorch.org/t/[^/]+/[0-9]+$\",\n```\nOne of the chat page we have crawled might be this one:\n\n\u003cimg src=\"assets/conversation.PNG\" width=\"61%\"\u003e\n\nthen at one of the .json files in `storage/datasets/default`, the `\"conversations\"` property inside will be\n```\n\u003c# ztf-ucasTengfei Zhang #\u003e:\nHow to delete a Tensor in GPU to free up memory？\nI can get a Tensor in GPU by Tensor.cuda(), but it just returns a copy in GPU. I wonder how can I delete this Tensor in GPU? I try to delete it with “del Tnesor” but it doesn’t work.\n\n\n              Quote:\"\n                Could you show a minimum example? The following code works for me for PyTorch 1.1.0:\nimport torch\na = torch.zero(300000000, dtype=torch.int8, device='cuda')\nb = torch.zero(300000000, dtype=torch.int8, device='cuda')\n# Check GPU memory using nvidia-smi\ndel a\ntorch.cuda.empty_cache()\n# Check GPU memo…\n              \"\n\n\u003c# smth #\u003e:\ndel Tensor will delete it from GPU memory. Why do you think it doesn’t work?\n\u003c# ztf-ucasTengfei Zhang #\u003e:\nThank you very much!\nI loaded an OrderedDict of pre-trained weights to gpu by torch.load(), then used a for loop to delete its elements, but there was no change in gpu memory.\nBesides, it is strange that there was no change in gpu memory even I deleted the OrderedDict of pre-trained weights.\nPytorch version is 0.4.0.2\n...\n```\n`\u003c# ztf-ucasTengfei Zhang #\u003e` and `\u003c# smth #\u003e` are the two posters' names, and are formatted this way for you to easily template it to instruction-finetune LLMs (e.g., maybe replace `\u003c# smth #\u003e` with `\u003cassistant\u003e`, and `\u003c# ztf-ucasTengfei Zhang #\u003e` with `\u003cuser\u003e`, etc.). If there are images interspersed in the texts, they will not only be downloaded and saved in `storage/datasets/imgs` with a new FILENAME, but also replaced in place with `\"[img FILENAME]\"` in texts. If there are links interspersed in the texts, they will be replaced in place with `\"[link LINK]\"` in texts. All the other elements are deleted.\n\n *Do you like this repo? Give us a :star:*\n## Notice\nMake sure by yourself the crawling is **legal**, check the website's robots.txt if you're not sure. We are not responsible for any law risks and issues.\n\n## Future Work\n- Support image data auto OCR to texts, then inserted among original texts data. It makes the data complete in text form, and save some space too if OCR happens when on the crawling, not post crawling.\n  \n## License\n\nchats-crawler is licensed under the MIT License found in the [LICENSE](LICENSE) file in the root directory of this repository.\n\n## Citation\n\nIf this work is helpful, please kindly cite as:\n\n```bibtex\n@article{chats-crawler,\n  title={chats-crawler: discourse chat data crawling and parsing for LLM instruction finetuning.}, \n  author={Yannan Luo},\n  year={2024},\n  url={https://github.com/jackfsuia/chats-crawler}\n}\n```\n## Acknowledgement\n\nLearned a lot from [gpt-crawler](https://github.com/BuilderIO/gpt-crawler) and [crawlee](https://github.com/apify/crawlee). Thanks for their wonderful works.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjackfsuia%2Fchats-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjackfsuia%2Fchats-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjackfsuia%2Fchats-crawler/lists"}