https://github.com/daviddavo/blogspot-crawler

Crawler for blogspot and blogger with beautifulsoup
https://github.com/daviddavo/blogspot-crawler

crawler hacktoberfest python

Last synced: 2 months ago
JSON representation

Crawler for blogspot and blogger with beautifulsoup

Host: GitHub
URL: https://github.com/daviddavo/blogspot-crawler
Owner: daviddavo
License: mit
Created: 2021-04-03T11:37:51.000Z (about 5 years ago)
Default Branch: main
Last Pushed: 2021-04-03T18:22:07.000Z (about 5 years ago)
Last Synced: 2025-03-17T06:21:56.378Z (over 1 year ago)
Topics: crawler, hacktoberfest, python
Language: Python
Homepage:
Size: 4.88 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Blogspot Crawler

A simple crawler using Beautiful Soup 4 and requests to obtain every post
in a Blogger/Blogspot website, via clicking on next page.

It only downloads the post body in html format, but creating a Jekyll file
with the title and tags. As a result, the entire blog dump is very small.

In the future, it will download images and jekyllify the HTML output.

## Usage
Just put the url and a destination folder. Posts should be downloaded as the url without the basename.

```
usage: ./blogspotCrawler.py [-h] [-o DESTINATION] url

Blogspot crawler

positional arguments:
url Blog url

optional arguments:
-h, --help show this help message and exit
-o DESTINATION, --output DESTINATION
Output folder
```

## Ideas for the future
- [ ] Quietly process ReadTimeout exceptions on future callback
- [ ] Auto download images
- [ ] Jekyllify output
- [ ] Add wordpress support

-----------------
This program is licensed under an MIT License

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/daviddavo/blogspot-crawler

Awesome Lists containing this project

README