https://github.com/apify/aidevworld2023
How to get clean web data for chatbots and LLMs slides and supporting materials.
https://github.com/apify/aidevworld2023
Last synced: 8 months ago
JSON representation
How to get clean web data for chatbots and LLMs slides and supporting materials.
- Host: GitHub
- URL: https://github.com/apify/aidevworld2023
- Owner: apify
- Created: 2023-10-22T18:40:17.000Z (over 2 years ago)
- Default Branch: materials
- Last Pushed: 2023-12-15T06:00:01.000Z (over 2 years ago)
- Last Synced: 2025-02-16T16:19:53.033Z (over 1 year ago)
- Language: JavaScript
- Size: 33.9 MB
- Stars: 3
- Watchers: 4
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# How to get clean web data for chatbots and LLMs
This is a companion repository to go with [Ondra Urban](https://github.com/mnmkng)'s AI Dev World 2023 talk: How to get clean web data for chatbots and LLMs.
The presentation slides are available here in [PDF](./how-to-get-clean-web-data-for-chatbots-and-llms.pdf) and [PPTX](how-to-get-clean-web-data-for-chatbots-and-llms.pptx) formats.
## Chatbot examples
To run the chatbot examples, you need to have Node.js installed and install dependencies with:
```bash
npm install
```
To run the chatbots, you will need to export your OpenAI API key as an environment variable or use an alternative way of setting this env var:
```bash
export OPENAI_API_KEY=your-api-key
```
Finally, run them with:
```bash
node tesla-chatbot.js
```
```bash
node bmw-chatbot.js
```
For more information on how they work. Reference [Crawlee](https://crawlee.dev) and [LangChain JS](https://js.langchain.com) documentation.
## Useful links
- [Apify](https://apify.com)
- [Website Content Crawler](https://apify.com/apify/website-content-crawler)
- [Crawlee](https://crawlee.dev)
- [LangChain](https://www.langchain.com)
- [Mozilla Readability](https://github.com/mozilla/readability)
- [Scrapy](https://scrapy.org)
- [Puppeteer](https://pptr.dev)
- [Playwright](https://playwright.dev)
- [Selenium](https://www.selenium.dev)
- [curl impersonate](https://github.com/lwthiker/curl-impersonate)
- [Got Scraping](https://github.com/apify/got-scraping)