https://github.com/gridaco/contents-crawler

a fully customizable web contents crawler for collecting ml dataset
https://github.com/gridaco/contents-crawler

Last synced: 21 days ago
JSON representation

a fully customizable web contents crawler for collecting ml dataset

Host: GitHub
URL: https://github.com/gridaco/contents-crawler
Owner: gridaco
License: mit
Created: 2020-12-13T13:42:46.000Z (over 5 years ago)
Default Branch: main
Last Pushed: 2020-12-26T15:52:34.000Z (over 5 years ago)
Last Synced: 2024-05-23T00:01:56.432Z (about 2 years ago)
Language: Python
Size: 7.81 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# contents-crawler
A fully customizable web contents crawler for collecting ml dataset.

## packages

- text crawler (general text crawler)
- classified text crawler (crawls text contained by button, input placeholder, etc..)
- image crawler
- screen shot crawler

## Contribution
Follows general bridged contributing guideline

## Development
Crawlers powered by [Scrapy](https://github.com/scrapy/scrapy) with Python3.
It'll later use Selenium for collecting screenshots & supporting client-side rendered apps.

## Run it on your own
(WIP) - tutorial will be provided soon

## No, I just want the ready data set.
Goto [ui-dataset](https://github.com/bridgedxyz/ui-dataset) for ml-ready dataset

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gridaco/contents-crawler

Awesome Lists containing this project

README