https://github.com/ashutoshvarma/ggsipu_results_crawler
Fully automated solution for extraction & archiving of data from GGSIPU result PDFs
https://github.com/ashutoshvarma/ggsipu_results_crawler
Last synced: 3 months ago
JSON representation
Fully automated solution for extraction & archiving of data from GGSIPU result PDFs
- Host: GitHub
- URL: https://github.com/ashutoshvarma/ggsipu_results_crawler
- Owner: ashutoshvarma
- License: gpl-3.0
- Created: 2020-07-07T15:27:56.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2023-12-15T08:24:45.000Z (over 1 year ago)
- Last Synced: 2024-12-28T00:42:50.785Z (5 months ago)
- Language: Python
- Homepage:
- Size: 79.1 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# **grc.py** - _the uncompromising results crawler_
> Inspired from the [ggsipu-notice-tracker](https://github.com/ggsipu-usict/ggsipu-notice-tracker)_grc_ aim is to automate the extraction and archiving of data from ggsipu results pdfs.
## How it Works ?
It scrap and process the _new_ results pdf from results website (`RESULTS_URL`) and save
the last processed pdf (`LAST_JSON`) for future reference.For pdf processing, [ggsipu_result](https://github.com/ashutoshvarma/ggsipu_result) module is used
and extracted data is passed to specialized classes inherited from `BaseDump` class which uploads/archive
the data respectively.Currently we have only Firebase Realtime Database (for json data) and Firebase CloudStorage (for student's images)
as Dumps.## Requirements
Need Python >= 3.8To install requirements:
```
pip -r requirements.txt
```## How to Use ?
There are two ways to use _grc_:
- Local - `python grc.py`
- In Server/CI - `bash start.sh`### Local - `python grc.py`
Since grc.py uses Firebase as backend you need to define two environment for authentication:
- `FIREBASE_CONFIG` - For Firebase options, must have `databaseURL` and `storageBucket` set in it. Read More [here](https://firebase.google.com/docs/admin/setup#initialize-without-parameters).
- `GOOGLE_APPLICATION_CREDENTIALS` - For authenticate with Google Cloud. Read More [here](https://cloud.google.com/docs/authentication/production#providing_credentials_to_your_application)### CI/Server/Container - `bash start.sh`
`start.sh` is a wrapper script for grc.py to run it in a isolated environment
where file system may be temporary (ephemeral filesystem) like Heroku, CI Servers, Containers.This loads last pdf info from [git repo](https://github.com/GGSIPUResultTracker/ggsipu_results_archive) and start the grc.py and upload the last pdf details to git repo.
Same as running grc.py, it requires firebase and github authentication details using environment variables:-
- `GCLOUD_KEY` - Contents of Google Cloud Auth Key file (`GOOGLE_APPLICATION_CREDENTIALS`).
- `ARCHIVE_GIT_REPO` - Github repo to save last pdf details in, example `ashutoshvarma/results_archive`
- `ARCHIVE_GIT_BRANCH` - Git branch for `ARCHIVE_GIT_REPO`
- `GIT_OAUTH_TOKEN` - Github Auth Key with push rights to `ARCHIVE_GIT_REPO`
- `FIREBASE_CONFIG`## Extra Configuration
See 'GLOBAL OPTIONs' in grc.py