https://github.com/textcorpuslabs/njgovnews
Web scraping of the New Jersey news feeds
https://github.com/textcorpuslabs/njgovnews
newsfeed python3 text-corpus
Last synced: 8 months ago
JSON representation
Web scraping of the New Jersey news feeds
- Host: GitHub
- URL: https://github.com/textcorpuslabs/njgovnews
- Owner: TextCorpusLabs
- License: mit
- Created: 2022-02-24T17:28:01.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2022-03-10T22:43:19.000Z (over 3 years ago)
- Last Synced: 2025-01-27T07:27:29.305Z (10 months ago)
- Topics: newsfeed, python3, text-corpus
- Language: Python
- Homepage:
- Size: 13.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# New Jersey Government News


Scrape news feeds from the New Jersey government
# Punch List
Known bugs befor v1.0
- [ ] GitHub Install
- [ ] Encoding
# Operation
## Install
You can install the package using the following steps:
1. `pip` install using an _admin_ prompt
```{ps1}
pip uninstall NJGovNews
pip install -v git+https://github.com/TextCorpusLabs/NJGovNews.git
```
## Run
You can run the package as follows:
```{ps1}
NJGovNews SITE -out FILE_OUT
```
The scraper currently supports the following `SITE`s:
1. The [Department of the Treasury](https://nj.gov/treasury).
I.E. ` NJGovNews treasury -out "c:/data/news/nj_treasury.csv"`
## Cache
This scraper uses `requests-cache` to improve performance.
If you want to _force_ a full reload of all the data, delete the file called 'SITE.cache.sqlite'.
It will be in the same folder as the _.csv_ the scraper created.
# Development
## Prerequisites
You can install the package _for development_ using the following steps:
**Note**: You can replace steps 1-3 using the [VSCode](https://code.visualstudio.com/Download) Git:Clone command
1. Download the project from [GitHub](https://github.com/TextCorpusLabs/NJGovNews)
* Click the green "Code" button on the right.
Select "Download Zip"
2. Remove zip protections by right-clicking on the file, selecting properties, and checking "security: unblock"
3. Unzip the folder.
I recommend using the folder _c:/repos/TextCorpusLabs/NJGovNews_
4. Run `pip`'s edit install using an _admin_ prompt
```{ps1}
pip uninstall NJGovNews
pip install -v -e c:/repos/TextCorpusLabs/NJGovNews
```
5. Install the `nltk` add-ons using an _admin_ prompt
```{ps1}
python -c "import nltk;nltk.download('punkt')"
```