https://github.com/jackfsuia/chats-crawler

Discourse chat data crawling and on-the-way parsing straight for LLM instruction finetuning. 论坛数据爬取和解析，直接用于对话微调。
https://github.com/jackfsuia/chats-crawler

crawler fine-tuning finetune-llm gpt html-css-javascript instruction-tuning llm llm-training llms nlp nlp-parsing parser

Last synced: 4 months ago
JSON representation

Discourse chat data crawling and on-the-way parsing straight for LLM instruction finetuning. 论坛数据爬取和解析，直接用于对话微调。

Host: GitHub
URL: https://github.com/jackfsuia/chats-crawler
Owner: jackfsuia
License: mit
Created: 2024-04-21T15:07:21.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-05-06T04:42:57.000Z (about 1 year ago)
Last Synced: 2025-01-13T18:41:51.929Z (6 months ago)
Topics: crawler, fine-tuning, finetune-llm, gpt, html-css-javascript, instruction-tuning, llm, llm-training, llms, nlp, nlp-parsing, parser
Language: TypeScript
Homepage:
Size: 343 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        


    





    

[![GitHub Code License](https://img.shields.io/github/license/jackfsuia/chats-crawler)](LICENSE)

English | [简体中文](README_zh.md)



[**Discourse-based websites**](https://github.com/discourse/discourse) chat data crawling and on-the-way parsing straight for LLM instruction finetuning. Data include the texts, images (crucial for multimodal finetuning) and links. Will support more than Discourse-based websites soon.

## Table of Contents

- [Quick Start](#quick-start)

- [Examples](#examples)

- [Notice](#notice)

- [Future Work](#future-work)

- [License](#license)

- [Citation](#citation)

- [Acknowledgement](#acknowledgement)

## Quick Start

Run

```bash

git clone https://github.com/jackfsuia/chats-crawler.git && cd chats-crawler

```

Then install the requirements, run

```bash

npm i

```

Before crawling, please read the [Notice](#Notice). Config the target website at [config.ts](config.ts), edit the `url` and `rex` properties to match your needs, i.e., replace the two `https://discuss.pytorch.org`s there with your target [**Discourse-based**](https://github.com/discourse/discourse) website. A [**Discourse-based**](https://github.com/discourse/discourse) website basically all looks like this:



To start crawling, run

```bash

npm start

```

That's all! The discourse chat data are saved at `storage/datasets/default` as .json files, and the images at `storage/datasets/imgs`.

## Examples

Lets say we crawling https://discuss.pytorch.org. We should edit the [config.ts](config.ts) as:

```

...

 url: "https://discuss.pytorch.org/",

...

rex: "https://discuss.pytorch.org/t/[^/]+/[0-9]+$",

```

One of the chat page we have crawled might be this one:



then at one of the .json files in `storage/datasets/default`, the `"conversations"` property inside will be

```

<# ztf-ucasTengfei Zhang #>:

How to delete a Tensor in GPU to free up memory？

I can get a Tensor in GPU by Tensor.cuda(), but it just returns a copy in GPU. I wonder how can I delete this Tensor in GPU? I try to delete it with “del Tnesor” but it doesn’t work.

              Quote:"

                Could you show a minimum example? The following code works for me for PyTorch 1.1.0:

import torch

a = torch.zero(300000000, dtype=torch.int8, device='cuda')

b = torch.zero(300000000, dtype=torch.int8, device='cuda')

# Check GPU memory using nvidia-smi

del a

torch.cuda.empty_cache()

# Check GPU memo…

              "

<# smth #>:

del Tensor will delete it from GPU memory. Why do you think it doesn’t work?

<# ztf-ucasTengfei Zhang #>:

Thank you very much!

I loaded an OrderedDict of pre-trained weights to gpu by torch.load(), then used a for loop to delete its elements, but there was no change in gpu memory.

Besides, it is strange that there was no change in gpu memory even I deleted the OrderedDict of pre-trained weights.

Pytorch version is 0.4.0.2

...

```

`<# ztf-ucasTengfei Zhang #>` and `<# smth #>` are the two posters' names, and are formatted this way for you to easily template it to instruction-finetune LLMs (e.g., maybe replace `<# smth #>` with ``, and `<# ztf-ucasTengfei Zhang #>` with ``, etc.). If there are images interspersed in the texts, they will not only be downloaded and saved in `storage/datasets/imgs` with a new FILENAME, but also replaced in place with `"[img FILENAME]"` in texts. If there are links interspersed in the texts, they will be replaced in place with `"[link LINK]"` in texts. All the other elements are deleted.

 *Do you like this repo? Give us a :star:*

## Notice

Make sure by yourself the crawling is **legal**, check the website's robots.txt if you're not sure. We are not responsible for any law risks and issues.

## Future Work

- Support image data auto OCR to texts, then inserted among original texts data. It makes the data complete in text form, and save some space too if OCR happens when on the crawling, not post crawling.

  

## License

chats-crawler is licensed under the MIT License found in the [LICENSE](LICENSE) file in the root directory of this repository.

## Citation

If this work is helpful, please kindly cite as:

```bibtex

@article{chats-crawler,

  title={chats-crawler: discourse chat data crawling and parsing for LLM instruction finetuning.}, 

  author={Yannan Luo},

  year={2024},

  url={https://github.com/jackfsuia/chats-crawler}

}

```

## Acknowledgement

Learned a lot from [gpt-crawler](https://github.com/BuilderIO/gpt-crawler) and [crawlee](https://github.com/apify/crawlee). Thanks for their wonderful works.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jackfsuia/chats-crawler

Awesome Lists containing this project

README