https://github.com/khuyentran1401/extract-text-from-article
https://github.com/khuyentran1401/extract-text-from-article
data-science natural-language-processing newspaper3k nltk python text-preprocessing web-scraping
Last synced: 6 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/khuyentran1401/extract-text-from-article
- Owner: khuyentran1401
- Created: 2020-01-01T04:14:52.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2020-04-06T02:40:27.000Z (over 5 years ago)
- Last Synced: 2025-03-27T03:51:19.744Z (7 months ago)
- Topics: data-science, natural-language-processing, newspaper3k, nltk, python, text-preprocessing, web-scraping
- Language: Jupyter Notebook
- Size: 82 KB
- Stars: 6
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# About this project
This project extracts the text from an article using Python Article Library and uses NLTK (Natural Language Processing Toolkit) to preprocess the text and extract the most common words in the article# Tools
* Newspaper3k: tool to scrape article
* NLTK: tool to process text# Steps
* Scrape articles with newspaper3k
```javascript
from newspaper import Articleurl = 'https://mystudentvoices.com/it-took-me-2-years-to-get-1000-followers-life-lessons-ive-learned-throughout-the-journey-9bc44f2959f0'
article = Article(url)article.download()
```
* Find the publish date
```javascript
article.publish_date
```
* Extract image
* Find the author
* Find the keywords
* Find the summary
* Preprocessing with NLTK
* Tokenize text
* Lowercase and remove stopwords
* Visualization the frequency of words with Matplotlib
# Tutorial blog
Find the Medium article for this repository [here](https://medium.com/@khuyentran1476/find-common-words-in-article-with-python-module-newspaper-and-nltk-8c7d6c75733)