https://github.com/lucasmsa/vscodethemes-scrapper

💽 Scrapping the vscode-themes website and filling a database with its theme's images with different language variations
https://github.com/lucasmsa/vscodethemes-scrapper

jest nodejs puppeteer puppeteer-screenshot s3 scraping typescript

Last synced: 12 days ago
JSON representation

💽 Scrapping the vscode-themes website and filling a database with its theme's images with different language variations

Host: GitHub
URL: https://github.com/lucasmsa/vscodethemes-scrapper
Owner: lucasmsa
License: mit
Created: 2023-05-21T22:50:26.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2023-05-30T17:20:16.000Z (almost 2 years ago)
Last Synced: 2025-03-30T05:04:34.442Z (about 1 month ago)
Topics: jest, nodejs, puppeteer, puppeteer-screenshot, s3, scraping, typescript
Language: TypeScript
Homepage:
Size: 1.06 MB
Stars: 5
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# 🕸️ Vscode Themes Crawler
This is a web scraping project that is designed to get images from the [Vscode Themes website](https://vscodethemes.com/). It processes the pages, retrieves the themes images, and uploads them to an Amazon S3 bucket. It's built with Node.js and Typescript, using Puppeteer for the web scraping tasks

## Purpose
The main goal of this project is to fetch vscode theme images based on its 7 different programming languages available on the website. It cycles through different pages on the Vscode Themes website, retrieves the themes' images, processes and stores them to a specified S3 bucket separating the folders based on the theme. This project contains the data collection part for a bigger project that will use the images to train a machine learning model

## Technologies
The following technologies were used for this project:

- **Node.js** with **Typescript**

- **Puppeteer**: For web scraping tasks, to manipulate the webpage and extract data

- **AWS SDK**: To interact with Amazon Web Services, particularly to upload files to an S3 bucket

- **Jest**: Used as the testing framework, all of the application's features were unit tested

- **ESLint** and **Prettier**: Used to maintain consistency and format the code

## ⌘ How to run
- First, you need to set up the environment variables. Create a .env file in the root directory with the following structure (described also on the `.env.example` file):

```
AWS_ACCESS_KEY=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_access_key
AWS_REGION=your_region
BUCKET_NAME=your_bucket_name
```

- Then, run the following commands (if you don't have yarn installed, you can use npm instead) to install and run the project:
```
$ yarn add

$ yarn run dev
```

## Testing
To run the tests, use the following command:
```
$ yarn run test
```

### Final Thoughts

---

This project is a good example of how you can combine various technologies to build a simple yet effective web crawler. It's also a great starting point for anyone looking to dive deeper into the world of web scraping or to learn more about interacting with Amazon Web Services.

### License

---

[MIT License](https://opensource.org/license/mit/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lucasmsa/vscodethemes-scrapper

Awesome Lists containing this project

README