Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/wragge/trove-newspapers
https://github.com/wragge/trove-newspapers
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/wragge/trove-newspapers
- Owner: wragge
- License: gpl-3.0
- Created: 2011-12-13T01:31:29.000Z (about 13 years ago)
- Default Branch: master
- Last Pushed: 2012-10-03T23:11:03.000Z (over 12 years ago)
- Last Synced: 2024-05-02T00:21:49.758Z (8 months ago)
- Language: JavaScript
- Homepage:
- Size: 563 KB
- Stars: 7
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README-DO_HARVEST.txt
- License: LICENSE.txt
Awesome Lists containing this project
README
===============================================================================
USING THE HARVESTER
===============================================================================QUICK START:
1. Open the harvest.ini file in a text editor
2. Insert your harvest options as indicated and save the file.
3. Run do_harvest.py (double click in Windows)IN MORE DETAIL
Using the do_harvest.py script you can initiate a harvest of Trove's newspaper database.
Depending on the configuration options you supply the script creates:
* a CSV file containing the details of articles - [your filename]
* a zip containing the text contents of articles - [your filename]_text.zip
* a zip containing pdfs of articles - [your filename]_pdf.zipThe script receives its configuration values either from the command line, or by reading the
harvest.ini file.SETTING HARVEST.INI
The harvest.ini file is well-documented, just enter the required values where indicated.Once harvest.ini is set you can simply run do_harvest.py. In Windows you can just double click it.
In Linux you'll probably need to cd to the directory containing the script and then run it from the
terminal - python do_harvest.pyRUNNING FROM COMMAND LINE
The script can be run from the command line with the following arguments:-q (or --query) [full url of Trove newspapers search]
-f (or --filename) [file and path name for the CSV output]
-t (or --text) Create a zip file containing the text of articles
-p (or --pdf) Create a zip file containing pdfs of articles
-s (or --start) The result number to start at.
Example:python do_harvest.py -q http://trove.nla.gov.au/newspaper/result?exactPhrase=inclement+wragge -f /home/wragge/trove-output.csv -t -p
If you're using Windows you'll have to make sure that the location of your Python
installation is included in your Windows path variable.RESTARTING A FAILED HARVEST
If for some reason a harvest fails, you can restart it where it left off.
In most cases, the script will write an error file ([your filename]_error.txt),
explaining what happened and telling you what to do next.This error file will include the number of the last completed record.
Simply insert this as the 'start' value in harvest.ini (or include on the command line
with the -s flag).If for some reason the error file wasn't created. Open up the CSV file and look at the
last row number. Use this value minus one as the start value for the script. This will
ensure any text and pdf files are properly saved. You might also want to delete the last
row of the CSV to avoid duplication.
want to delete the last row to