https://github.com/schyuler/chamorro-blog-content-scraper
A project to scrape and process Chamorro language blog content to support language preservation, analysis and revitalization efforts.
https://github.com/schyuler/chamorro-blog-content-scraper
webscraping webscraping-beautifulsoup webscraping-projects webscraping-python
Last synced: 19 days ago
JSON representation
A project to scrape and process Chamorro language blog content to support language preservation, analysis and revitalization efforts.
- Host: GitHub
- URL: https://github.com/schyuler/chamorro-blog-content-scraper
- Owner: schyuler
- Created: 2024-11-05T16:51:09.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-11-06T19:10:34.000Z (over 1 year ago)
- Last Synced: 2025-06-14T22:43:42.718Z (8 months ago)
- Topics: webscraping, webscraping-beautifulsoup, webscraping-projects, webscraping-python
- Language: HTML
- Homepage:
- Size: 3.27 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Web-Scraper-for-Blogger-Blog
This is a Python notebook for scraping the blog posts from a Blogger blog, extracting its content, and saving it into an HTML file that's ready to be converted to EPUB. This notebook uses Beautiful Soup and Requests libraries to fetch and parse the HTML content of blog posts. For this project, we will be scraping the Paleric blog.
## About the Paleric blog
The Paleric blog, which can be found at https://paleric.blogspot.com/ is written by Father Eric Forbes, a priest who has spent his life and ministry in the Mariana Islands. His time in the Mariana Islands as a priest has given him unique insight into the culture of our islands, and it also helped him to become fluent in the Chamorro language. He writes about Chamorro culture and language on his blog, including stories written in the Chamorro language with English translations. As such, it has become a crucial educational resource on these topics, due to his experience, expertise, and the blog's accessibility.
## About the Chamorro language and culture
Chamorro, Chamoru, or CHamoru is the name of the indigenous people and indigenous language of the Mariana Islands, which are located in the Western Pacific Ocean. These islands are one of the last remaining colonies in the world - currently colonized by the United States - and is one of 17 Non-Self Governing Territories as identified by the United Nations. The Chamorro language is currently listed as an endangered language after decades of systematic Chamorro language suppression efforts by the United States. With the decline of the Chamorro language, this means that the majority of our native speakers are elderly (usually over 60 years old, with the most fluent speakers being in their 80s and above) and the younger generations cannot speak, read or write the language. As the native speakers continue to pass away, our people risk losing our culture and language.
## Reasoning for this project
The current status of the Chamorro language means that learning materials are scarce, and access to those materials are often limited - either due to a lack of English translations or access being limited to a privileged few. This makes the Paleric blog one of the few Chamorro language and cultural education resources that is freely available, easily accessible and friendly to language learners. Scraping the blog content and compiling it into a single document, which can then be converted into other formats (i.e.: PDF, EPUB, etc.) is a way of preserving this content offline for learners, and allowing them greater ease and flexbility for using the content to support their language learning efforts.
## Benefits of this project
This project offers specific benefits to students of the Chamorro language and culture. The benefits of scraping the Paleric blog specifically include:
1) Using the output as a corpus, to verify how to properly use Chamorro words
2) Easily mark words and phrases for later review
3) Incorporate other interactive tools, such as a built-in Kindle dictionary
4) Add their own annotations directly to the text
This project can also provide a template for students/learners to easily access and format other text content on the internet, for additional analysis and research opportunities.
## Features
- Scrapes all the URLS of the blog posts
- Extracts the post title, post date, and post content
- Removes images from the posts
- Preserves special characters
- Formats the content using HTML, for a nice format
- Output is ready for EPUB conversation using tools like Calibre
## Requirements
- Python 3.11.7
- Libraries: `BeautifulSoup`, `requests`
- Jupyter Notebook
## Usage
**Open the Jupyter Notebook:** Open the `.ipynb` file containing the code in Jupyter Notebook or Jupyter Lab.
**Run the Cells:** Execute each cell in sequence, or click Cell > Run All to run the entire notebook.
**Output:** The notebook will save the content of all blog posts (without images) to a file named `palericblog.html` in the same directory as the notebook. This can be readily converted to other e-book formats, such as EPUB.
## Notes
**HTML Structure:** This notebook assumes the following:
- The main blog text is contained within a `
` with the class `post-body entry-content`
- The blog title is contained within a `` with the class `post-title entry title`
- The blog date is contained within a `` with the class `date-header`
Make sure to update the class name in the code if the target blog or website uses a different structure.
**EPUB Conversion:** The resulting file `palericblog.html` can be convered to EPUB using an EPUB converter like Calibre.