https://github.com/gridaco/contents-crawler
a fully customizable web contents crawler for collecting ml dataset
https://github.com/gridaco/contents-crawler
Last synced: 21 days ago
JSON representation
a fully customizable web contents crawler for collecting ml dataset
- Host: GitHub
- URL: https://github.com/gridaco/contents-crawler
- Owner: gridaco
- License: mit
- Created: 2020-12-13T13:42:46.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2020-12-26T15:52:34.000Z (over 5 years ago)
- Last Synced: 2024-05-23T00:01:56.432Z (about 2 years ago)
- Language: Python
- Size: 7.81 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# contents-crawler
A fully customizable web contents crawler for collecting ml dataset.
## packages
- text crawler (general text crawler)
- classified text crawler (crawls text contained by button, input placeholder, etc..)
- image crawler
- screen shot crawler
## Contribution
Follows general bridged contributing guideline
## Development
Crawlers powered by [Scrapy](https://github.com/scrapy/scrapy) with Python3.
It'll later use Selenium for collecting screenshots & supporting client-side rendered apps.
## Run it on your own
(WIP) - tutorial will be provided soon
## No, I just want the ready data set.
Goto [ui-dataset](https://github.com/bridgedxyz/ui-dataset) for ml-ready dataset