Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sakan811/find-common-japanese-character-from-news
Showcase visualizations about common Japanese characters that appear in the news
https://github.com/sakan811/find-common-japanese-character-from-news
beautifulsoup beautifulsoup4 data-analysis dataanalysis japanese japanese-language language news powerbi requests sqlite sqlite3 visualization webscraper webscraping
Last synced: 15 days ago
JSON representation
Showcase visualizations about common Japanese characters that appear in the news
- Host: GitHub
- URL: https://github.com/sakan811/find-common-japanese-character-from-news
- Owner: sakan811
- License: mit
- Created: 2024-05-25T18:42:12.000Z (8 months ago)
- Default Branch: master
- Last Pushed: 2024-07-30T14:34:32.000Z (6 months ago)
- Last Synced: 2024-07-30T18:14:04.964Z (6 months ago)
- Topics: beautifulsoup, beautifulsoup4, data-analysis, dataanalysis, japanese, japanese-language, language, news, powerbi, requests, sqlite, sqlite3, visualization, webscraper, webscraping
- Language: Python
- Homepage:
- Size: 23.5 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Common Japanese Morphemes in News
Showcase visualizations and code base about the common Japanese morphemes that appear in news.
Morphemes are the smallest units of meaning in a language.
Data was collected from 'https://www3.nhk.or.jp'
Data collecting period: 25 May 2024 - 4 July 2024
# Status
#### Common Japanese Morphemes in News: 🎉 **Project Completed** 🎉[![CodeQL](https://github.com/sakan811/Find-Common-Japanese-Words-From-News/actions/workflows/codeql.yml/badge.svg)](https://github.com/sakan811/Find-Common-Japanese-Words-From-News/actions/workflows/codeql.yml)
[![Scraper Test](https://github.com/sakan811/Find-Common-Japanese-Words-From-News/actions/workflows/scraper-test.yml/badge.svg)](https://github.com/sakan811/Find-Common-Japanese-Words-From-News/actions/workflows/scraper-test.yml)
[![Daily News Scraper](https://github.com/sakan811/Find-Common-Japanese-Words-From-News/actions/workflows/daily-news-scraper.yml/badge.svg)](https://github.com/sakan811/Find-Common-Japanese-Words-From-News/actions/workflows/daily-news-scraper.yml)
# Latest Update
**Common Japanese Morphemes in News** Latest Update: 30 July 2024# Visualizations
[Common Japanese Morphemes in News](#common-japanese-morphemes-in-news):* Visualizations Latest Update: 13 October 2024
* [Tableau](https://public.tableau.com/views/jp-news/Top10Morphemes?:language=th-TH&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link)
* [Instagram](https://www.instagram.com/p/DBEHq0OPOA-/?utm_source=ig_web_copy_link&igsh=MzRlODBiNWFlZA==)* [Facebook](https://www.facebook.com/share/p/H8ZACWLTbCZeY5ur/)
# Data
Located in [data](data) folder## [jp_morpheme_data_from_news_as_of_2024-07-04.parquet](data%2Fjp_morpheme_data_from_news_as_of_2024-07-04.parquet)
Contain Japanese morphemes data collected from the NHK News website.Total morphemes collected: 1,015,285
## [news_url_data_from_nhk_as_of_2024-07-04.parquet](data%2Fnews_url_data_from_nhk_as_of_2024-07-04.parquet)
Contain urls which link to the news that the morphemes were collected from.Total Url collected: 896
Urls in this file should follow https://www3.nhk.or.jp if you want to see the source.
For example: https://www3.nhk.or.jp/news/html/20240523/k10014458551000.html
# How to Web-Scrape Japanese News to Extract Japanese Morphemes
- Clone this repo: https://github.com/sakan811/Find-Common-Japanese-Character-From-News.git
- Go to [main.py](main.py)
- Adjust the SQLite database name as needed
```
sqlite_db = 'japan_news_test.db' # adjust as needed
```
- Run the script:
```bash
python main.py
```# [automated_news_scraper.py](automated_news_scraper.py)
Scrape data from NHK News daily, automated with GitHub Action.