Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/blvckbytes/digiscrapper
Multithreaded scrapping-tool in order to download currently free school-books off a digital library.
https://github.com/blvckbytes/digiscrapper
Last synced: 21 days ago
JSON representation
Multithreaded scrapping-tool in order to download currently free school-books off a digital library.
- Host: GitHub
- URL: https://github.com/blvckbytes/digiscrapper
- Owner: BlvckBytes
- Created: 2020-03-25T19:57:46.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2022-09-01T23:22:21.000Z (over 2 years ago)
- Last Synced: 2023-03-04T10:46:08.708Z (almost 2 years ago)
- Language: Java
- Size: 94.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# DigiScrapper
## Disclaimer
All of the books I scrapped are avilable for **free** at the current time (25.03.2020), so everything I downloaded for myself (**not** for sharing purposes) is completely legal. If you come across this at a later point in time, the offer of free books might no longer be active or the whole site's architecture changed and thus my code won't work anymore.## Short and simple - What is it?
It's a multithreaded scrapper-tool which allowed me to download the whole library off of *digi4school.at* in about 2.5 days, yielding over 2500 books.## Introduction
There is an online schoolbook library called *Digi4School* hosted at [digi4school.at](https://digi4school.at), which only contains books in the german language. So, if you don't understand german, the books will probably be of small interest to you. But still - the code shows how I managed to scrap books automatically, so you might still want to check it out.## How did it come to this?
The platform released all books available for free because of the current *COVID-19 situation*, which imo is a very kind act. In order to maybe find a few new interesting books about IT I decided to scrap all books in an automated process to have a little something to read during this isolation period. They released a searchbar which needs at least three letters to yield results, so I loop all combinations of a three letter lowercase string and put all token-links into a map, which automatically unique-ifies the keys.This is what it looks like:
![Searchbar](readme_images/searchbar.png)## How it works
As I've already described above: I use the searchbar to get a unique-list of all tokens. The token gets used like this: *https://digi4school.at/token/tokenID*. Once I had a CSV with the format tokenID;Booktitle I started downloading all pages. A page on this platform is an SVG vector-graphic with included image tags for images and shadows. At the time of writing this there are **2578** books available which resulted in a total of **211GB** of downloaded files.### Redeeming a token
Since this offer is anonymous, you don't need to log in or register in order to use it. So, when you call the token-url it creates a session for you which the token gets activated on, it's probably a temporary one. So, I read out the session data from headers and keep it in my program for all further processing. Before opening this session, you have to pass a 2-stage LTI confirmation which basically is a *display: none;* form and a script tag which posts it on a given url. Easy to do in java, no issue. Once the token is activated, I parse out the last page number from the navigator on the frontend and then just loop from 1 to *