Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tomashubelbauer/globus
Scrapes the Globus PDF catalogue using Puppeteer
https://github.com/tomashubelbauer/globus
globus pdf-scraping puppeteer puppeteer-firefox
Last synced: 7 days ago
JSON representation
Scrapes the Globus PDF catalogue using Puppeteer
- Host: GitHub
- URL: https://github.com/tomashubelbauer/globus
- Owner: TomasHubelbauer
- Created: 2019-07-23T11:36:17.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2022-04-14T20:23:48.000Z (over 2 years ago)
- Last Synced: 2024-05-02T03:55:00.845Z (8 months ago)
- Topics: globus, pdf-scraping, puppeteer, puppeteer-firefox
- Language: JavaScript
- Homepage:
- Size: 25.3 MB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Globus
Globus provides a PDF catalog of offers for the period. This is a good exercise for PDF scraping.
Puppeteer cannot be used, because it cannot navigate to a PDF. The PDF viewer component in Chrome
is a native component which is not available in Chromium. One would have to use PDF.js either in
or outside of Puppeteer to get the job done.Playwright Firefox can be used as Firefox uses PDF.js internally when navigating to PDF documents.
## To-Do
### Use Playwright Firefox to scrape the PDF instead
Like in https://github.com/TomasHubelbauer/albert
### Use text and image coordinates to group them into clusters by proximity
from those clusters recognize ones which look like an item and parse out data from
the texts by their position relative to one another (vertically: name, price, …,
only a handful of variations of these datums in various order exist).### Generate an HTML page for visualizing what the script associated in clustering
### Set up Github Actions and run the extractor in one using a scheduled trigger