https://github.com/daviddavo/blogspot-crawler
Crawler for blogspot and blogger with beautifulsoup
https://github.com/daviddavo/blogspot-crawler
crawler hacktoberfest python
Last synced: 2 months ago
JSON representation
Crawler for blogspot and blogger with beautifulsoup
- Host: GitHub
- URL: https://github.com/daviddavo/blogspot-crawler
- Owner: daviddavo
- License: mit
- Created: 2021-04-03T11:37:51.000Z (about 5 years ago)
- Default Branch: main
- Last Pushed: 2021-04-03T18:22:07.000Z (about 5 years ago)
- Last Synced: 2025-03-17T06:21:56.378Z (over 1 year ago)
- Topics: crawler, hacktoberfest, python
- Language: Python
- Homepage:
- Size: 4.88 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Blogspot Crawler
A simple crawler using Beautiful Soup 4 and requests to obtain every post
in a Blogger/Blogspot website, via clicking on next page.
It only downloads the post body in html format, but creating a Jekyll file
with the title and tags. As a result, the entire blog dump is very small.
In the future, it will download images and jekyllify the HTML output.
## Usage
Just put the url and a destination folder. Posts should be downloaded as the url without the basename.
```
usage: ./blogspotCrawler.py [-h] [-o DESTINATION] url
Blogspot crawler
positional arguments:
url Blog url
optional arguments:
-h, --help show this help message and exit
-o DESTINATION, --output DESTINATION
Output folder
```
## Ideas for the future
- [ ] Quietly process ReadTimeout exceptions on future callback
- [ ] Auto download images
- [ ] Jekyllify output
- [ ] Add wordpress support
-----------------
This program is licensed under an MIT License
(C) 2021 [David Davó](https://ddavo.me/en)