Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/thomasgpadilla/webscraping
sample scripts for webscraping
https://github.com/thomasgpadilla/webscraping
Last synced: 3 months ago
JSON representation
sample scripts for webscraping
- Host: GitHub
- URL: https://github.com/thomasgpadilla/webscraping
- Owner: thomasgpadilla
- License: cc0-1.0
- Created: 2015-12-29T22:55:25.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2016-08-10T17:24:47.000Z (over 8 years ago)
- Last Synced: 2024-08-03T01:38:37.618Z (6 months ago)
- Language: Python
- Homepage:
- Size: 34.2 KB
- Stars: 2
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# webscraping
Sample scripts for web scraping.
Typically tied to use cases, but generally extensible to other projects.
Scripts help build URL lists to feed to WGET for bulk downloading in a responsible manner (e.g. rate limiting).**fdr_item_urls**: scrape item URLs matching a type from the Franklin D Roosevelt Master Speech File, write URLs to TXT or CSV file
**frus_section_pdf_urls**: scrape all volume URLs from the the University of Wisconsin Madison's Foreign Relations of the United States, proceed to follow each URL to each volume, navigate down to each section of each volume and scrape PDF URLs, write PDF URLs to TXT file
**frus_section_parent_volume**: scrape all volume URLs from the the University of Wisconsin Madison's Foreign Relations of the United States, proceed to follow each URL to each volume, navigate down to each section and scrape title field (= parent volume of section), write to TXT file
**frus_section_title**: scrape all volume URLs from the the University of Wisconsin Madison's Foreign Relations of the United States, proceed to follow each URL to each volume, navigate down to each section of each volume and scrape itemmd field (= title of section)
contact: [email protected]