An open API service indexing awesome lists of open source software.

https://github.com/gridaco/contents-crawler

a fully customizable web contents crawler for collecting ml dataset
https://github.com/gridaco/contents-crawler

Last synced: 21 days ago
JSON representation

a fully customizable web contents crawler for collecting ml dataset

Awesome Lists containing this project

README

          

# contents-crawler
A fully customizable web contents crawler for collecting ml dataset.

## packages

- text crawler (general text crawler)
- classified text crawler (crawls text contained by button, input placeholder, etc..)
- image crawler
- screen shot crawler

## Contribution
Follows general bridged contributing guideline

## Development
Crawlers powered by [Scrapy](https://github.com/scrapy/scrapy) with Python3.
It'll later use Selenium for collecting screenshots & supporting client-side rendered apps.

## Run it on your own
(WIP) - tutorial will be provided soon

## No, I just want the ready data set.
Goto [ui-dataset](https://github.com/bridgedxyz/ui-dataset) for ml-ready dataset