https://github.com/wenkesj/summarize_text
Experimenting with Medium Digests: Learning to Summarize
https://github.com/wenkesj/summarize_text
medium nlp nodejs pupeteer tensorflow textsum web-scraping
Last synced: about 2 months ago
JSON representation
Experimenting with Medium Digests: Learning to Summarize
- Host: GitHub
- URL: https://github.com/wenkesj/summarize_text
- Owner: wenkesj
- Created: 2017-11-10T21:03:52.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2017-12-05T15:55:04.000Z (over 8 years ago)
- Last Synced: 2025-01-06T12:40:57.128Z (over 1 year ago)
- Topics: medium, nlp, nodejs, pupeteer, tensorflow, textsum, web-scraping
- Language: Python
- Homepage:
- Size: 19.5 KB
- Stars: 0
- Watchers: 2
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Medium Text Summarization
This is an attempt to model Medium's digest by topic. It's a very simple experiment that uses
a combination of (high-dimensional) word embeddings and dynamic bi-directional encoding
on article features. This is a proof-of-concept for modeling "live" data that is "streamed" and is
available publicly. There are models for headless web `Scraper`s and dataset models for natural
language articles.
# Running
**Before** you start slinging, do this first (> Node 7, > Python 2 or 3):
```sh
npm i && python setup.py install
```
1. Scrape dataset `medium`. This will go for a while dependent on your internet connection.
There is a timeout of 30s where the page will be skipped. This tries to use 8 threads.
```sh
bin/scrape medium
```
This crawls the [topics page](https://medium.com/topics) and collects the available topics. Then
it visits each topic main page i.e. [culture](https://medium.com/topics/culture) and extracts all
the landing page articles (extracts the `href` according to the attribute `data-post-id`). Finally,
it visits each article page, finds the Medium API from the landing page, it looks like this:
```html
// <;
```
It uses `page.evaluate(...)` to perform a `regexp` on the script content, parses it as JSON and
then passes it back to node.js. It finally strips the meta data, it reduces the object as a model
for the python `textsum/dataset/article.py` model `textsum.Article` with the features: `title`,
`subtitle`, `text`, `tags`, `description`, `short_description`.
We now have raw data that we can use to do fun things.
2. Convert raw data to numpy records of examples.
```sh
bin/records --src=data/medium --dst=records/medium --pad
```
This takes the raw data from `src` and serializes it as `textsum.Article` objects for consumption.
As it is serializing, it tokenizes all the features (`title`, `subtitle`, ...) as mentioned in **2**.
It saves all these as `np.ndarray`s and stores them in `dst` by `topic`. Next, the examples
are piped to `*.npy` files. This comes in handy to be used with the
native `tf.data` API, it's like **hadoop** or **spark** but native compatibility with **tensorflow**.
Finally, all the record `tokens` we collected for each topic, is collected in a `set`, so we don't
store all tokens in memory to avoid repetition, this is done in a `map->reduce` fashion. The tokens
are gathered by `topic` on a individual thread as a `set` of `str`s and the `union` operation reduces
the total space for each `topic` `map` operation. The `map` stage returns all the individual vocabs
for each feature (as in **2**) and is reduced by the `union` operation again.
We now have a set of vocab files for each feature in the dataset.
3. Final step, **Sling** (run the experiment)
```sh
bin/experiment \
--model_dir=article_model \
--dataset_dir=records/medium \
--input_feature='text' \
--target_feature='title' \
--schedule='train'
```