https://github.com/tylershin/uri-luda
https://github.com/tylershin/uri-luda
Last synced: 10 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/tylershin/uri-luda
- Owner: TylerShin
- Created: 2018-06-20T03:09:00.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T02:15:00.000Z (over 3 years ago)
- Last Synced: 2025-03-03T16:21:30.143Z (over 1 year ago)
- Language: Jupyter Notebook
- Size: 4.88 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Luda Project
Collect and archive Luda's photos.
## User scenario
- Crawl Luda's photo's from seed url page.
- seed url page should be easily changed by an user.
- Archive the photos to local drive or S3.
- There is admin page that shows the photos and statistics about the crawling.
## Design specs
### Crawler
- Crawler should parse and execute Javascript to read SPA websites.
- There should be another crawler also parse only HTML because of the speed issue.
- Crawler should handle a blocking logic of the target webpage. So, it shouldn't be super fast and has too many parallel instances.
### Detector
- Detector should find and grab images on the webpage.
- Detector should know whether the photo's main character is Luda or not.
## Archiver
- Archiver should know whether there was already same photo in local(s3) drive or not.
- To perform upper spec, Archiver should discriminate which photo is same photo. Not just by a file name.
- If same photo exists, Archiver will save a better one. (normally bigger size)
- If same photo doesn't exist, Archiver just save the photo.
## Admin page
- WIP
## What we've done
- Make README.md