https://github.com/boseongkang/newstrend
Self-correcting 5-pillar financial AI with continuous improvement cycle. Production: boseongkang.github.io/newstrend
https://github.com/boseongkang/newstrend
financial-ai finbert quantitative-finance self-correcting sentiment-analysis stock-price-prediction
Last synced: 20 days ago
JSON representation
Self-correcting 5-pillar financial AI with continuous improvement cycle. Production: boseongkang.github.io/newstrend
- Host: GitHub
- URL: https://github.com/boseongkang/newstrend
- Owner: boseongkang
- Created: 2025-08-17T22:55:56.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2026-05-29T05:26:25.000Z (22 days ago)
- Last Synced: 2026-05-29T06:24:22.623Z (22 days ago)
- Topics: financial-ai, finbert, quantitative-finance, self-correcting, sentiment-analysis, stock-price-prediction
- Language: Python
- Homepage: https://boseongkang.github.io/newstrend/index.html
- Size: 345 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Audit: audit_report.md
Awesome Lists containing this project
README
# news trend
A Python-first, package-style starter to *ingest daily US news, deduplicate them, and analyze trends*.
Now also includes **quick view & daily report generation**.
---
## Daily FinBERT routine (local, M3-MPS)
GitHub Actions handles raw-news collection and the trend-site build, but FinBERT
sentiment scoring runs locally — M3 MPS is ~52x faster than free-tier CPU and
the 60-day backfill that would have timed out on CI finishes in ~25 min locally.
```bash
# wrapper (auto-runs daily at 09:00 via launchd):
bin/finbert-daily # default 60-day window
WINDOW_DAYS=90 bin/finbert-daily # custom window
# direct invocation:
python scripts/sentiment_finbert_local.py --window-days 60 --commit
```
First run only: pass `--setup` to create the `/tmp/newstrend-cache` worktree.
On subsequent runs the script auto-detects which days are missing (typically
just today) and only scores those — usually a few seconds. The push lands on
the `data-cache` branch where `trend-site.yml` restores it on each CI run.
The launchd plist at `~/Library/LaunchAgents/com.newstrend.finbert.plist`
fires `bin/finbert-daily` at 09:00 daily; logs go to
`~/Library/Logs/newstrend-finbert.log`. Manual kickstart:
`launchctl kickstart gui/$(id -u)/com.newstrend.finbert`.
---
## Quickstart
```bash
# Create virtual environment & install dependencies
python -m venv .venv && source .venv/bin/activate
pip install -e .
pip install pydantic
pip install timedelta
# Copy env template & set NEWSAPI_KEY (optional)
cp .env.example .env
# edit .env and add: NEWSAPI_KEY=your_api_key
```
## Ingest today's news (RSS + NewsAPI if key present)
`newscli ingest --country us --rss --newsapi`
## Deduplicate today's file into silver dataset
`newscli dedup --date today`
### Example: look at yesterday's ingested raw newsapi
`python src/news_trend/quickview.py --date YYYY-MM-DD --indir data --kind raw_newsapi --top 30 --min-len 3`
This will show:
- total articles
- top publishers
- top words & bigrams
- sample articles
### Example: generate report from deduplicated (silver) data
`python src/news_trend/report.py --date YYYY-MM-DD --indir data --kind silver_newsapi --outdir reports`
### Cron Example (everyday automation)
`5 8 * * * cd /path/to/news-trend-python-starter && . .venv/bin/activate && newscli ingest --country us --rss --newsapi && newscli dedup --date today && python src/news_trend/report.py --date $(date +\%F) --indir data --kind silver_newsapi --outdir reports >> logs.txt 2>&1`
## Structure
- src/news_trend/: Python package (ingest, dedup, quickview, report, utils)
- data/raw/YYYY-MM-DD.jsonl: raw ingested articles
- data/silver/YYYY-MM-DD.jsonl: cleaned & deduplicated articles
- reports/YYYY-MM-DD.md: generated daily reports
## 1. Ingest
`newscli ingest --country us --rss --newsapi`
## 2. Deduplicate
`newscli dedup --date today`
## 3. Quickview on raw data
`python src/news_trend/quickview.py --date today --indir data --kind raw_newsapi`
## 4. Generate report from deduplicated (silver) data
`python src/news_trend/report.py --date today --indir data --kind silver_newsapi --outdir reports`
## 08/19 update
- Time-sliced NewsAPI ingest
- Daily HTML report
## Commands (daily pipeline)
```bash
# 1) Ingest (yesterday, time-sliced inside newscli / or your ingest script)
newscli ingest --newsapi --date yesterday
# 2) Report
python src/news_trend/report.py --date yesterday --indir data --kind raw --outdir reports --top 30
# python src/news_trend/report.py --date yesterday --indir data --kind silver_newsapi --outdir reports --top 30
```
## 08/22 update
### Continuous live collection using GitHub Actions (NEWSAPI)
This repo includes a 30-minute interval workflow that collects news data and commits them to the repo as newline-delimited JSON.
Files are written under data/live_newsapi/ with names like YYYY-MM-DDTHH-MMZ.jsonl.
### Verify
'Actions' - 'collect-live' - click recent workflow
The results are as follows
[LIVE] NewsAPI -> data/live_newsapi/2025-08-22T21-39Z.jsonl (n rows)
## 08/30 update
## Word Trends (cumulative)
Generate “top words” and 14-day trends from the deduplicated warehouse:
```
python scripts/viz_words.py \
--master data/warehouse/master.jsonl \
--outdir reports/words \
--top 30 \
--days 14 \
--min-len 3 \
--drop-content \
--extra-stop "chars,nbsp,amp,apos,mdash,ndash,inc,com,report,reports,shares"
```
## What the command does
- **Input**: data/warehouse/master.jsonl (all deduped articles).
- **Window**: keeps only the most recent --days (default: 14).
- **Text selection**: with --drop-content, only title + description are used (article body is ignored).
- This reduces boilerplate/noise that appears in bodies and surfaces headline topics.
- **Normalization**: lowercasing, basic cleaning, tokenization.
- **Filtering**:
- --min-len: drop tokens shorter than N characters (e.g., 3).
- Stopwords = built-in list plus --extra-stop (comma-separated, case-insensitive).
- **Outputs** (written to --outdir, e.g., reports/words/):
- top_words.png – bar chart of the overall top N words.
- top_words_trend.png – line chart of daily counts for those words over the last N days.
- top_words.csv – total counts.
- top_words_trend.csv – daily counts per word.