https://github.com/apify/aidevworld2023

How to get clean web data for chatbots and LLMs slides and supporting materials.
https://github.com/apify/aidevworld2023

Last synced: 8 months ago
JSON representation

How to get clean web data for chatbots and LLMs slides and supporting materials.

Host: GitHub
URL: https://github.com/apify/aidevworld2023
Owner: apify
Created: 2023-10-22T18:40:17.000Z (over 2 years ago)
Default Branch: materials
Last Pushed: 2023-12-15T06:00:01.000Z (over 2 years ago)
Last Synced: 2025-02-16T16:19:53.033Z (over 1 year ago)
Language: JavaScript
Size: 33.9 MB
Stars: 3
Watchers: 4
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # How to get clean web data for chatbots and LLMs

This is a companion repository to go with [Ondra Urban](https://github.com/mnmkng)'s AI Dev World 2023 talk: How to get clean web data for chatbots and LLMs.

The presentation slides are available here in [PDF](./how-to-get-clean-web-data-for-chatbots-and-llms.pdf) and [PPTX](how-to-get-clean-web-data-for-chatbots-and-llms.pptx) formats.

## Chatbot examples

To run the chatbot examples, you need to have Node.js installed and install dependencies with:

```bash

npm install

```

To run the chatbots, you will need to export your OpenAI API key as an environment variable or use an alternative way of setting this env var:

```bash

export OPENAI_API_KEY=your-api-key

```

Finally, run them with:

```bash

node tesla-chatbot.js

```

```bash

node bmw-chatbot.js

```

For more information on how they work. Reference [Crawlee](https://crawlee.dev) and [LangChain JS](https://js.langchain.com) documentation.

## Useful links

- [Apify](https://apify.com)

- [Website Content Crawler](https://apify.com/apify/website-content-crawler)

- [Crawlee](https://crawlee.dev)

- [LangChain](https://www.langchain.com)

- [Mozilla Readability](https://github.com/mozilla/readability)

- [Scrapy](https://scrapy.org)

- [Puppeteer](https://pptr.dev)

- [Playwright](https://playwright.dev)

- [Selenium](https://www.selenium.dev)

- [curl impersonate](https://github.com/lwthiker/curl-impersonate)

- [Got Scraping](https://github.com/apify/got-scraping)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/apify/aidevworld2023

Awesome Lists containing this project

README