https://github.com/tylershin/uri-luda

Last synced: 10 months ago
JSON representation

Host: GitHub
URL: https://github.com/tylershin/uri-luda
Owner: TylerShin
Created: 2018-06-20T03:09:00.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2022-12-08T02:15:00.000Z (over 3 years ago)
Last Synced: 2025-03-03T16:21:30.143Z (over 1 year ago)
Language: Jupyter Notebook
Size: 4.88 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Luda Project

Collect and archive Luda's photos.

## User scenario

- Crawl Luda's photo's from seed url page.
- seed url page should be easily changed by an user.
- Archive the photos to local drive or S3.
- There is admin page that shows the photos and statistics about the crawling.

## Design specs

### Crawler

- Crawler should parse and execute Javascript to read SPA websites.
- There should be another crawler also parse only HTML because of the speed issue.
- Crawler should handle a blocking logic of the target webpage. So, it shouldn't be super fast and has too many parallel instances.

### Detector

- Detector should find and grab images on the webpage.
- Detector should know whether the photo's main character is Luda or not.

## Archiver

- Archiver should know whether there was already same photo in local(s3) drive or not.
- To perform upper spec, Archiver should discriminate which photo is same photo. Not just by a file name.
- If same photo exists, Archiver will save a better one. (normally bigger size)
- If same photo doesn't exist, Archiver just save the photo.

## Admin page

- WIP

## What we've done

- Make README.md

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tylershin/uri-luda

Awesome Lists containing this project

README