{"id":19343020,"url":"https://github.com/wenkesj/summarize_text","last_synced_at":"2026-05-06T09:44:30.973Z","repository":{"id":74813045,"uuid":"110292447","full_name":"wenkesj/summarize_text","owner":"wenkesj","description":"Experimenting with Medium Digests: Learning to Summarize","archived":false,"fork":false,"pushed_at":"2017-12-05T15:55:04.000Z","size":20,"stargazers_count":0,"open_issues_count":0,"forks_count":3,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-06T12:40:57.128Z","etag":null,"topics":["medium","nlp","nodejs","pupeteer","tensorflow","textsum","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wenkesj.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-11-10T21:03:52.000Z","updated_at":"2017-12-05T15:56:17.000Z","dependencies_parsed_at":"2023-05-31T12:16:22.638Z","dependency_job_id":null,"html_url":"https://github.com/wenkesj/summarize_text","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wenkesj%2Fsummarize_text","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wenkesj%2Fsummarize_text/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wenkesj%2Fsummarize_text/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wenkesj%2Fsummarize_text/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wenkesj","download_url":"https://codeload.github.com/wenkesj/summarize_text/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240449891,"owners_count":19803125,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["medium","nlp","nodejs","pupeteer","tensorflow","textsum","web-scraping"],"created_at":"2024-11-10T03:37:05.013Z","updated_at":"2026-05-06T09:44:25.937Z","avatar_url":"https://github.com/wenkesj.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Medium Text Summarization\n\nThis is an attempt to model Medium's digest by topic. It's a very simple experiment that uses\na combination of (high-dimensional) word embeddings and dynamic bi-directional encoding\non article features. This is a proof-of-concept for modeling \"live\" data that is \"streamed\" and is\navailable publicly. There are models for headless web `Scraper`s and dataset models for natural\nlanguage articles.\n\n# Running\n\n**Before** you start slinging, do this first (\u003e Node 7, \u003e Python 2 or 3):\n\n```sh\nnpm i \u0026\u0026 python setup.py install\n```\n\n1. Scrape dataset `medium`. This will go for a while dependent on your internet connection.\n   There is a timeout of 30s where the page will be skipped. This tries to use 8 threads.\n\n   ```sh\n   bin/scrape medium\n   ```\n\n   This crawls the [topics page](https://medium.com/topics) and collects the available topics. Then\n   it visits each topic main page i.e. [culture](https://medium.com/topics/culture) and extracts all\n   the landing page articles (extracts the `href` according to the attribute `data-post-id`). Finally,\n   it visits each article page, finds the Medium API from the landing page, it looks like this:\n\n   ```html\n   \u003cscript\u003e\n   // \u003c![CDATA[\n   window[\"obvInit\"]({\"value\":{...}});\n   \u003c/script\u003e\n   ```\n\n   It uses `page.evaluate(...)` to perform a `regexp` on the script content, parses it as JSON and\n   then passes it back to node.js. It finally strips the meta data, it reduces the object as a model\n   for the python `textsum/dataset/article.py` model `textsum.Article` with the features: `title`,\n   `subtitle`, `text`, `tags`, `description`, `short_description`.\n\n   We now have raw data that we can use to do fun things.\n\n2. Convert raw data to numpy records of examples.\n\n   ```sh\n   bin/records --src=data/medium --dst=records/medium --pad\n   ```\n\n   This takes the raw data from `src` and serializes it as `textsum.Article` objects for consumption.\n   As it is serializing, it tokenizes all the features (`title`, `subtitle`, ...) as mentioned in **2**.\n   It saves all these as `np.ndarray`s and stores them in `dst` by `topic`. Next, the examples\n   are piped to `*.npy` files. This comes in handy to be used with the\n   native `tf.data` API, it's like **hadoop** or **spark** but native compatibility with **tensorflow**.\n   Finally, all the record `tokens` we collected for each topic, is collected in a `set`, so we don't\n   store all tokens in memory to avoid repetition, this is done in a `map-\u003ereduce` fashion. The tokens\n   are gathered by `topic` on a individual thread as a `set` of `str`s and the `union` operation reduces\n   the total space for each `topic` `map` operation. The `map` stage returns all the individual vocabs\n   for each feature (as in **2**) and is reduced by the `union` operation again.\n\n   We now have a set of vocab files for each feature in the dataset.\n\n3. Final step, **Sling** (run the experiment)\n\n  ```sh\n  bin/experiment \\\n    --model_dir=article_model \\\n    --dataset_dir=records/medium \\\n    --input_feature='text' \\\n    --target_feature='title' \\\n    --schedule='train'\n  ```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwenkesj%2Fsummarize_text","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwenkesj%2Fsummarize_text","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwenkesj%2Fsummarize_text/lists"}