https://github.com/lucasmsa/vscodethemes-scrapper
π½ Scrapping the vscode-themes website and filling a database with its theme's images with different language variations
https://github.com/lucasmsa/vscodethemes-scrapper
jest nodejs puppeteer puppeteer-screenshot s3 scraping typescript
Last synced: 12 days ago
JSON representation
π½ Scrapping the vscode-themes website and filling a database with its theme's images with different language variations
- Host: GitHub
- URL: https://github.com/lucasmsa/vscodethemes-scrapper
- Owner: lucasmsa
- License: mit
- Created: 2023-05-21T22:50:26.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2023-05-30T17:20:16.000Z (almost 2 years ago)
- Last Synced: 2025-03-30T05:04:34.442Z (about 1 month ago)
- Topics: jest, nodejs, puppeteer, puppeteer-screenshot, s3, scraping, typescript
- Language: TypeScript
- Homepage:
- Size: 1.06 MB
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# πΈοΈ Vscode Themes Crawler
This is a web scraping project that is designed to get images from the [Vscode Themes website](https://vscodethemes.com/). It processes the pages, retrieves the themes images, and uploads them to an Amazon S3 bucket. It's built with Node.js and Typescript, using Puppeteer for the web scraping tasks## Purpose
The main goal of this project is to fetch vscode theme images based on its 7 different programming languages available on the website. It cycles through different pages on the Vscode Themes website, retrieves the themes' images, processes and stores them to a specified S3 bucket separating the folders based on the theme. This project contains the data collection part for a bigger project that will use the images to train a machine learning model## Technologies
The following technologies were used for this project:- **Node.js** with **Typescript**
- **Puppeteer**: For web scraping tasks, to manipulate the webpage and extract data
- **AWS SDK**: To interact with Amazon Web Services, particularly to upload files to an S3 bucket
- **Jest**: Used as the testing framework, all of the application's features were unit tested
- **ESLint** and **Prettier**: Used to maintain consistency and format the code
## β How to run
- First, you need to set up the environment variables. Create a .env file in the root directory with the following structure (described also on the `.env.example` file):```
AWS_ACCESS_KEY=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_access_key
AWS_REGION=your_region
BUCKET_NAME=your_bucket_name
```- Then, run the following commands (if you don't have yarn installed, you can use npm instead) to install and run the project:
```
$ yarn add$ yarn run dev
```## Testing
To run the tests, use the following command:
```
$ yarn run test
```### Final Thoughts
---
This project is a good example of how you can combine various technologies to build a simple yet effective web crawler. It's also a great starting point for anyone looking to dive deeper into the world of web scraping or to learn more about interacting with Amazon Web Services.
### License
---
[MIT License](https://opensource.org/license/mit/)