https://github.com/evgeniradev/page_scraper
An Elixir-based page scraper app built on Phoenix. It detects changes on a given web page and logs them to a database.
https://github.com/evgeniradev/page_scraper
elixir phoenix scraper
Last synced: 5 days ago
JSON representation
An Elixir-based page scraper app built on Phoenix. It detects changes on a given web page and logs them to a database.
- Host: GitHub
- URL: https://github.com/evgeniradev/page_scraper
- Owner: evgeniradev
- Created: 2019-09-18T19:47:33.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2023-01-05T22:46:29.000Z (over 3 years ago)
- Last Synced: 2025-02-21T19:14:16.554Z (over 1 year ago)
- Topics: elixir, phoenix, scraper
- Language: Elixir
- Size: 279 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 9
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# PageScraper
An Elixir-based page scraper app built on Phoenix.
It detects changes on a given web page and logs them to a database.
It uses Selenium and the Hound package to load pages in a Chrome session.
Currently, the app uses a single Chrome session, which has an impact on the polling speed when more than 1 pages are being polled at the same time.
## Installation
Please, use [Docker](https://docs.docker.com/) to use the app.
Run the below setup command to build the containers, create a new database and run the migrations. Please note, the command drops any existing database.
```
$ ./setup.sh
```
Start the app in development mode:
```
$ ./start.sh
```
Finally, load [http://localhost](http://localhost) in your browser.
## Running the tests
```
$ ./test.sh
```
## Details
Create a .env file in the app's root directory to use the below options.
To specify a Timezone, add the following environment variable to the file:
```
TZ=your_time_zone Default: Europe/London
```
To specify the limit of logged changes per page, add the following environment variable to the file:
```
PAGE_CHANGES_LIMIT=100 Default: 100
```
## To-do list
* Add a Healthcheck to the page_scraper_selenium_chrome docker container
* Display live workers' status using channels/WebSockets
* Improve and finish off tests
* Improve frontend/design
* Implement a way to get page status before pulling page source
* Implement pagination
* Implement multiple Chrome sessions