https://github.com/evgeniradev/page_scraper

An Elixir-based page scraper app built on Phoenix. It detects changes on a given web page and logs them to a database.
https://github.com/evgeniradev/page_scraper

elixir phoenix scraper

Last synced: 5 days ago
JSON representation

An Elixir-based page scraper app built on Phoenix. It detects changes on a given web page and logs them to a database.

Host: GitHub
URL: https://github.com/evgeniradev/page_scraper
Owner: evgeniradev
Created: 2019-09-18T19:47:33.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2023-01-05T22:46:29.000Z (over 3 years ago)
Last Synced: 2025-02-21T19:14:16.554Z (over 1 year ago)
Topics: elixir, phoenix, scraper
Language: Elixir
Size: 279 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 9
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# PageScraper

An Elixir-based page scraper app built on Phoenix.
It detects changes on a given web page and logs them to a database.
It uses Selenium and the Hound package to load pages in a Chrome session.
Currently, the app uses a single Chrome session, which has an impact on the polling speed when more than 1 pages are being polled at the same time.

## Installation

Please, use [Docker](https://docs.docker.com/) to use the app.

Run the below setup command to build the containers, create a new database and run the migrations. Please note, the command drops any existing database.
```
$ ./setup.sh
```

Start the app in development mode:
```
$ ./start.sh
```

Finally, load [http://localhost](http://localhost) in your browser.

## Running the tests

```
$ ./test.sh
```

## Details
Create a .env file in the app's root directory to use the below options.

To specify a Timezone, add the following environment variable to the file:
```
TZ=your_time_zone Default: Europe/London
```

To specify the limit of logged changes per page, add the following environment variable to the file:
```
PAGE_CHANGES_LIMIT=100 Default: 100
```

## To-do list

* Add a Healthcheck to the page_scraper_selenium_chrome docker container
* Display live workers' status using channels/WebSockets
* Improve and finish off tests
* Improve frontend/design
* Implement a way to get page status before pulling page source
* Implement pagination
* Implement multiple Chrome sessions

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/evgeniradev/page_scraper

Awesome Lists containing this project

README