Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/systemfsoftware/youtube-autocomplete-scraper
YouTube AutoComplete Scraper - An Apify actor that scrapes YouTube's search suggestions with intelligent deduplication using pglite and trigram similarity matching. Perfect for content research, SEO, and trend analysis.
https://github.com/systemfsoftware/youtube-autocomplete-scraper
actor apify autocomplete crawler deduplication pglite scraper search similarity suggestions trigram youtube youtube-api
Last synced: about 8 hours ago
JSON representation
YouTube AutoComplete Scraper - An Apify actor that scrapes YouTube's search suggestions with intelligent deduplication using pglite and trigram similarity matching. Perfect for content research, SEO, and trend analysis.
- Host: GitHub
- URL: https://github.com/systemfsoftware/youtube-autocomplete-scraper
- Owner: systemfsoftware
- License: mit
- Created: 2024-12-14T10:46:12.000Z (28 days ago)
- Default Branch: master
- Last Pushed: 2025-01-06T00:47:03.000Z (5 days ago)
- Last Synced: 2025-01-06T01:31:04.937Z (5 days ago)
- Topics: actor, apify, autocomplete, crawler, deduplication, pglite, scraper, search, similarity, suggestions, trigram, youtube, youtube-api
- Language: TypeScript
- Homepage:
- Size: 165 KB
- Stars: 2
- Watchers: 0
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Youtube AutoComplete Scraper
A TypeScript library for scraping YouTube's autocomplete suggestions with intelligent deduplication.
## Features
- Scrapes YouTube's autocomplete API to get search suggestions
- Uses pglite for efficient similarity filtering
- Removes near-duplicate suggestions using trigram similarity
- Configurable similarity threshold
- TypeScript support
- Ready to deploy on Apify platform## Installation
```bash
git clone https://github.com/yourusername/youtube-autocomplete-scraper.git
cd youtube-autocomplete-scraper
pnpm install
```## Usage
There are two ways to use this scraper:
### 1. Local Development
Run the scraper locally by setting the required environment variables and using `pnpm start`:
```bash
# Set your input
export INPUT='{"query": "how to make"}'# Run the scraper
pnpm start
```The scraper will output results to the console and save them in the `apify_storage` directory.
### 2. Deploy to Apify
This scraper is designed to run on the Apify platform. To deploy:
1. Push this code to your Apify actor
2. Set the input JSON in Apify console:```json
{
"query": "how to make",
"similarityThreshold": 0.7,
"maxResults": 100,
"language": "en",
"region": "US"
}
```## How it Works
Under the hood, this scraper does a few key things:
1. **API Querying**: Makes requests to YouTube's internal autocomplete API endpoint to get raw suggestions
2. **Deduplication**: Uses pglite (a lightweight Postgres implementation) to filter out near-duplicate results:
- Converts suggestions to trigrams (3-letter sequences)
- Calculates similarity scores between suggestions using trigram matching
- Filters out suggestions that are too similar based on a configurable threshold
- For example, "how to cook pasta" and "how to cook noodles" might be considered unique, while "how to make pancake" and "how to make pancakes" would be filtered as duplicates3. **Result Processing**: Cleans and normalizes the suggestions before returning them
## Input Schema
The scraper accepts the following input parameters:
```typescript
interface Input {
query: string // The search query to get suggestions for
similarityThreshold?: number // How similar suggestions need to be to be considered duplicates (0-1)
maxResults?: number // Maximum number of suggestions to return
language?: string // Language code for suggestions
region?: string // Region code for suggestions
}
```## Output
The scraper outputs an array of unique autocomplete suggestions. Results are saved to the default dataset in Apify storage and can be accessed via the Apify API or console.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
MIT