https://github.com/graphistry/dots
https://github.com/graphistry/dots
Last synced: 11 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/graphistry/dots
- Owner: graphistry
- License: mit
- Created: 2024-02-23T06:32:11.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-04-12T07:48:41.000Z (about 2 years ago)
- Last Synced: 2025-02-24T16:50:28.697Z (over 1 year ago)
- Language: Python
- Size: 24.2 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Current Events Scraper & Featurizer
Using OpenSearch and Google News APIs, this tool pulls news stories and extracts features from the text. The features are then stored in a CSV file.
Can gather stories from multiple sources and languages. GNews maxes out at ~3000 stories per day, OpenSearch has no limit. OpenSearch uses scroll and slice to pull a large number of stories .
Clone current version & run [dots_feat.py](https://github.com/dcolinmorgan/dots/blob/main/dots/dots_feat.py)
--------------------------------------------------
requirements :
pytest,
pyarrow,
spacy,
python-dotenv,
bs4,
pandas,
scikit-learn,
transformers,
torch,
opensearch-py,
requests,
nltk,
numpy,
graphistry[umap-learn],
umap-learn,
validators,
pytesseract,
selenium,
webdriver_manager,
undetected_chromedriver,
gliner,
### the example below will pull 100 OS gnews stories and return features each in additon to location and date to a file
```python
git clone https://github.com/graphistry/dots
python dots/dots_feat.py -n 100 -e 0 -d 0 -o dots_drba_feats.csv
python dots/dots_feat.py -n 100 -e 1 -d 0 -o dots_gpy_feats.csv
python dots/dots_feat.py -n 100 -e 2 -d 0 -o dots_glnr_feats.csv
```
>"'Gaza Strip', '16-01-2024', ","['neighborhoods', 'rebels', 'widespread famine', 'egypt', 'disease']"
>"'Miseno, Campania, Italy', '16-01-2024', ","['disasters', 'mount vesuvius', 'ancient cataclysm', 'costruzione', 'beach']"
>"'Clarendon, Clarendon, Jamaica', '16-01-2024', ","['new bowen', 'fight', 'whatsapp', 'st catherine', 'jamaica']"
>"'Philadelphia, Pennsylvania, United States', '16-01-2024', ","['meteorologists', 'snow shovels', 'snowstorm', 'accuweather alerts', 'accuweather meteorologists']"
>"'New Bedford, Massachusetts, United States', '16-01-2024', ","['massachusetts law', 'saturday', 'ariel dorsey', 'traffic', 'united states']"
>"'Corofin, Clare, Ireland', '16-01-2024', ","['emergency services', 'breathing', 'rescue service', 'firefighters', 'afternoon']"
>"'United States', '16-01-2024', ","['preparedness', 'earthquake', 'quake', 'morning', 'disaster']"
>"'Syria', '16-01-2024', ","['neighboring countries', 'early recovery', 'cholera', 'symptom', 'mohamad katoub']"
>"'Iceland', '16-01-2024', ","['lava flows', 'evacuation', 'eruptions', 'jóhannesson', 'lúðvík pétursson']"
here is an example produced every day via `gh_actions` parsing gNews stories and extracting features:
[Feature Table](DOTS/output/lobstr3_dots_feats.csv) and [Full Table](DOTS/output/full_lobstr3_dots_feats.csv)