{"id":19373393,"url":"https://github.com/xop/news-scraper","last_synced_at":"2025-07-17T09:35:02.631Z","repository":{"id":80031662,"uuid":"58066347","full_name":"XOP/news-scraper","owner":"XOP","description":"NewScraper","archived":false,"fork":false,"pushed_at":"2017-02-08T20:08:53.000Z","size":340,"stargazers_count":2,"open_issues_count":4,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-24T14:56:11.129Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/XOP.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-05-04T16:15:59.000Z","updated_at":"2018-08-22T11:31:15.000Z","dependencies_parsed_at":null,"dependency_job_id":"fdb2a816-328e-44e6-95b2-416e21f01d75","html_url":"https://github.com/XOP/news-scraper","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/XOP/news-scraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/XOP%2Fnews-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/XOP%2Fnews-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/XOP%2Fnews-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/XOP%2Fnews-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/XOP","download_url":"https://codeload.github.com/XOP/news-scraper/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/XOP%2Fnews-scraper/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265589529,"owners_count":23793551,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-10T08:28:03.419Z","updated_at":"2025-07-17T09:35:02.598Z","avatar_url":"https://github.com/XOP.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# NewScraper\n\n\u003e IMPORTANT!\n\nCurrent version is work in progress, documentation is not up-to-date.  \nIf you want to try stable working example, check out [latest release](https://github.com/XOP/news-scraper/releases/tag/0.10.0) and follow installation instructions. \n\n\u003e Why?\n\nI have a decent number of favourite sites that regularly publish new interesting articles  \nand I'm too lazy (or better say _love to automate_) to check all of them manually.\n\n\u003e How does it work?\n\nNewScraper goes to the specified urls, fetches the defined links, brings them back and displays on one page.  \nOptionally, it can deploy it to some hosting.  \nAll you need to do from here is to look through the titles and decide whether to check it out or ditch it.\n\n\u003e Any special skills required?\n\nBasic knowledge of JS, browser dev tools ([Chrome Dev Tools](https://developers.google.com/web/tools/chrome-devtools/), for instance) and CSS selectors are highly preferable.  \n\nIf you are more of a user than of a developer, then this current manual should cover almost everything necessary.\n\n\u003e What does it look like?\n\nDaily digital digest, manually (for now) controlled.\n\n\u003e Anything else I should know?\n\nYes! NewScraper involves powers of [Nightmare](http://www.nightmarejs.org/) for the web-scrawling purposes.\n\n\n\n## Support\n\n:warning: NewScraper is being designed for the handful of platforms, however so far there are some [issues](https://github.com/XOP/news-scraper/issues/1) with full Windows support.\n\n\n\n## Prerequisites\n\nNewScraper is designed for Node.js, so go ahead and [install](https://nodejs.org/) it.\n\nFor easy deployment process it's better to have [surge](http://surge.sh/) installed globally.  \nIt is not necessary, though. For test scraping and saving results locally you can proceed without it. \n\n\n\n## Install\n\nNo big surprise here:\n\n```\n$ npm install\n```\n\n\n\n## Setup\n\nNewScraper utilizes source files as the _parsing directives_.  \nSource data can be presented in JSON or [YAML](http://docs.ansible.com/ansible/YAMLSyntax.html) files, choose whatever suits your needs.\n\n`*.yml` example:\n\n```\n'Smashing magazine':\n  url: 'http://www.smashingmagazine.com/'\n  elem: 'article.post'\n  link: 'h2 \u003e a'\n  author: 'h2 + ul li.a a'\n  time: 'h2 + ul li.rd'\n  image: 'figure \u003e a \u003e img'\n  limit: 6\n```\n\nProperties explained:\n\n`'Smashing magazine'`  \nname of the resource, **required**  \n\n`url`  \nsource url for the NewScraper, **required**  \n\n`elem`  \nCSS selector of the news item container element, **required**  \n\n`link`  \nCSS selector of the link (\u003ca href=\"\"\u003e...\u003c/a\u003e) _inside_ of the `elem`  \nIf the `elem` itself _is_ a link, this is not required\n\n`author`  \nCSS selector of the author element _inside_ of the `elem`\n\n`time`  \nCSS selector of the time element _inside_ of the `elem`\n\n`image`  \nCSS selector of the image element _inside_ of the `elem`  \nThis one can be `img` tag or any other - scraper will search for `data-src` and `background-image` CSS properties to find proper image data\n\n`limit`  \nhow many `elem`-s from the `url` will be scraped, maximum\n\n\n### Adding directives\n\nThere are several ways to add (custom) directives.\n\n\n#### Git repository\n\nCreate a git repository, similar to [this](https://github.com/XOP/my-favourite-front-end-resources). There you can add files with desired resources in YAML or JSON format - take [scraper.yml](https://github.com/XOP/my-favourite-front-end-resources/blob/master/scraper.yml) as an example. Then specify the properties of the repo in `config.js` and you are good to go.\n\n\n#### Custom files \n\nSecond option is to manually create directives files (YAML or JSON format) and put them into the `/source` directory. Then adjust the `config.js` so scraper would know what directives to use.\n\n\n#### Adding dialog\n\nDialog option is probably the simplest way to test something relatively quick.\n\nAll you need to do is run\n\n```\n$ npm run add\n```\n\nand follow the prompts.  \nThe result will be stored in the `custom.json` file in the `/source` directory and utilized in scraping procedure.\n\nBy default, `custom.json` is used as the source file, so there is no need to tweak `config.js`.\n\n\n\n## Up and running\n\nFirst you have to build the project.  \nThis has to be done **only once**, unless you are making changes in the `/src` directory:\n\n```\n$ npm run build\n```\n\nAfter that starting is pretty straightforward:\n\n```\n$ npm start\n```\n\nand go to `http://localhost:9000`.\n\n\n### Generating index\n\nSometimes it's needed to generate or re-generate the index without undergoing the whole fetching process.\n\nPlease notice, that any existing `index.html` in target folder will be overwritten.\n\n```\n$ npm run index\n```\n\n\n\n## Deployment\n\nDeployment uses [surge.sh](http://surge.sh/), so you have to be logged in.  \nTo do this, run the following:\n\n```\n$ surge login\n```\n\nThen you have to customize local settings. To do this simply rename `user-example.json` to `user.json` and edit existing config.\n\nAfter this procedure you will be able to deploy to your domain with one command:\n\n```\n$ npm run deploy\n```\n\n\n### Adjusting the deploy\n\nBy default deploy covers the whole directory (specified in config as `output.path`), which implies a lot of generated JSON data.  \nTo take care of that inconvenience create the `.surgeignore` file with the following content:\n\n```\n*.json\n```\n\nand make sure to store it **inside the deployed folder** (specified in config as `output.path`) - \"data\" by default. \n\nYou can find a sample `.surgeignore` file in the project root.\n\n\n\n## Configuration\n\nPersistent configuration is stored in the `config.js`.  \nIt can be tweaked, though there is a better way to tweak settings.\n\nFor various everyday needs there is a `user.json` file at your service.  \nIt basically _overrides_ config settings.\n\nHere are all the config parameters:\n```\n{\n    // directives' parameters\n    source: {\n        // folder for all directives\n        path: '/source',\n        \n        // array of local directives' names\n        // empty array means no local directives being used\n        file: ['local.yml']\n    },\n    \n    // repository parameters\n    repo: {\n        name: 'my-favourite-front-end-resources',\n        path: 'https://github.com/XOP/my-favourite-front-end-resources',\n        \n        // array of directives' names in repository\n        // empty array means no repo directives being used\n        file: []\n    },\n    \n    // results parameters\n    output: {\n        // folder for the output data and rendered html\n        path: '/data',\n        \n        // properties for the rendered html file name\n        fileName: '',\n        fileDate: true,\n        fileExt: 'html',\n        \n        // keeps last scraping run data (applies to compare update strategy)\n        current: 'data.json'\n    },\n    \n    // here are the assets for the deployed site being kept - css etc.\n    assets: {\n        path: '/assets'\n    },\n    \n    // array of acceptable directives formats\n    sourceFormats: ['json', 'yml'],\n    \n    // maximum number of news parsed from each resource\n    limit: 3,\n    \n    // the most maximum number of news scraped from each resource\n    absLimit: 50,\n    \n    // determines if only local directives are being used\n    localOnly: false,\n    \n    // prevents verbose output to console and reports errors only\n    silent: false,\n    \n    // possible options: scratch | compare\n    // scratch - each following scraping round ignores previous results\n    // compare - each following scraping round brings only news since last run\n    updateStrategy: 'scratch'\n}\n```\n\nThus said, here is the possible configuration of the `user.json` (create one from the [corresponding sample file](user-example.json)):\n\n```\n{\n    \"surgeDomain\": \"xop-news-scraper.surge.sh\",\n\n    \"localOnly\": false,\n\n    \"repo\": {\n        \"file\": [\n            \"news-rus.yml\"\n        ]\n    },\n\n    \"source\": {\n        \"file\": []\n    }\n}\n```\n\n\n\n## Development mode\n\nIn this mode Repository Update is skipped.\n\n```\n$ npm run build\n$ npm run dev\n```\n\n\n### Debug\n\n:construction::construction::construction:\n\n\n\n## Running tests\n\n```\n$ npm run build\n$ npm test\n```\n\n\n\n## Dependencies\n\n- [NewScraper Core](https://www.npmjs.com/package/news-scraper-core)\n\n\n\n## [MIT License](LICENSE)\n\n\n\n## Useful links\n\n- [Nightmare](http://www.nightmarejs.org/)\n- [cheerio](https://github.com/cheeriojs/cheerio)\n- [surge.sh](http://surge.sh/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxop%2Fnews-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxop%2Fnews-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxop%2Fnews-scraper/lists"}