{"id":19316582,"url":"https://github.com/tokenmill/crawling-framework-example","last_synced_at":"2026-05-08T16:40:13.985Z","repository":{"id":138584177,"uuid":"108383645","full_name":"tokenmill/crawling-framework-example","owner":"tokenmill","description":"Demonstration on how to use the Crawling Framework to setup a simple science news crawler and store results in ElasticSearch. Use this configuration to set up your own crawler.","archived":false,"fork":false,"pushed_at":"2019-09-04T10:48:45.000Z","size":183,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-02-24T04:42:07.481Z","etag":null,"topics":["crawler","crawling-framework","elasticsearch","storm-crawler"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tokenmill.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-10-26T08:31:49.000Z","updated_at":"2022-12-03T11:19:36.000Z","dependencies_parsed_at":"2023-04-15T12:01:12.039Z","dependency_job_id":null,"html_url":"https://github.com/tokenmill/crawling-framework-example","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/tokenmill/crawling-framework-example","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tokenmill%2Fcrawling-framework-example","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tokenmill%2Fcrawling-framework-example/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tokenmill%2Fcrawling-framework-example/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tokenmill%2Fcrawling-framework-example/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tokenmill","download_url":"https://codeload.github.com/tokenmill/crawling-framework-example/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tokenmill%2Fcrawling-framework-example/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32788744,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-08T08:22:46.396Z","status":"ssl_error","status_checked_at":"2026-05-08T08:22:45.650Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","crawling-framework","elasticsearch","storm-crawler"],"created_at":"2024-11-10T01:11:57.142Z","updated_at":"2026-05-08T16:40:13.958Z","avatar_url":"https://github.com/tokenmill.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ca href=\"http://www.tokenmill.lt\"\u003e\n      \u003cimg src=\".github/tokenmill-logo.svg\" width=\"125\" height=\"125\" align=\"right\" /\u003e\n\u003c/a\u003e\n\n# Crawling Framework Example\n\nThis repository contains a sample [Crawling Framework](https://github.com/tokenmill/crawling-framework) configuration. It shows how to use Crawling Framework to setup a simple science news crawler, write results to [ElasticSearch](https://www.elastic.co/products/elasticsearch). You can use this configuration to set up your own crawler.\n\nThere will be no need to write any code, just change the configuration files, create ElasticSearch indices (scripts are provided), and use Crawling Framework Management UI to define crawling rules.\n\n## Running Science News Crawler\n\n### Prerequisites\n\nElasticSearch v6.1.x server must be running on localhost:9200 before executing installation commands.\n\n### Installing and running management UI\n\nFirst you need to checkout Crawling Framework which has scripts to create ElasticSearch indices. Make sure ElasticSearch server is running (default configuration assumes that it's host is localhost, if not see instructions bellow how to change it). Run the commands bellow:\n\n```\ngit clone git@github.com:tokenmill/crawling-framework.git\ncd crawling-framework\nbin/create-es-indices.sh localhost cf-example\n```\n\nAfter that clone Example project and run crawling management UI.\n\n```\ngit clone git@github.com:tokenmill/crawling-framework-example.git\ncd crawling-framework-example\nbin/run-management-ui.sh\n```\n\nOnce started open http://localhost:8081/ which will show empty crawling setup. In order to populate it with pre-configured Science News setup:\n\n1. Click *Configuration* and choose *Import/Export* option. This will open Management UI section where configurations can be imported or exported (once you create your own configuration you might want to save it via export).#\n1. Window opens on *HTTP Sources* tab. Click on *Browse* and choose (paths are relative to example project root) *crawl-config/http-sources.csv* Then click on *Import HTTP Sources*\n1. Open *HTTP Source Tests* tab. Click on *Browse* and choose *crawl-config/http-source-tests.csv* Then click on *Import HTTP Source Tests*\n\nConfirm that there are three HTTP sources configured. Click *Configuration* and choose *HTTP Sources* option. Table bellow has to contain entries for:\n\n1. https://www.sciencedaily.com\n1. https://www.sciencenews.org\n1. http://www.bbc.com/news/science_and_environment\n\nCrawl setup is done and we can start the crawler.\n\n### Running the crawl\n\nFrom within Example project root run this command\n```\nbin/run-crawler.sh\n```\n\nIt will start Storm Crawler with the configuration for Science News crawl. Run it for a few minutes, you should start seeing log messages with urls in it. Something like this\n```\n7-09-20T10:25:41 source: http://www.bbc.com/news/science_and_environment\n\n24411 [Thread-36-generator-executor[4 4]] INFO  l.t.c.c.s.UrlGeneratorSpout - Emitted url http://www.bbc.com/news/science-environment-40900679 with meta discovered: 2017-09-20T10:25:41 source: http://www.bbc.com/news/science_and_environment\n\n24411 [Thread-36-generator-executor[4 4]] INFO  l.t.c.c.s.UrlGeneratorSpout - Emitted url http://www.bbc.com/news/science-environment-40686984 with meta discovered: 2017-09-20T10:25:41 source: http://www.bbc.com/news/science_and_environment\n\n```\n\nCrawler is working and writes found documents to ElasticSearch. You can leave crawler running for as long as you like. In order see the results issue the following query via curl or using Kibana.\n\n```\ncurl -XGET \"http://localhost:9200/cf-example-docs/_search\" -H 'Content-Type: application/json' -d'\n{\"query\": {\"match_all\": {}}}'\n```\n\n\n## Configuring your own crawler\n\nUse Example project as the basis for your own crawler. Just as in case of Example you have to create new indices via framework's scripts\n\n\n```\ncd [CRAWLING FRAMEWORK LOCATION]\nbin/create-es-indices.sh localhost [APP NAME]\n```\n\nNext you will have to change Storm configuration file located at *crawler/conf/crawler-local.yaml* \n\n```\n# Index to store http source configuration\nes.httpsource.index.name=[APP NAME]-http-sources\n\n# Index to store http source tests\nes.httpsourcetest.index.name=[APP NAME]-http-source-tests\n\n# Index to store named queries\nes.namedqueries.index.name=[APP NAME]-named-queries\n\n# Index to store crawled documents\nes.docs.index.name=[APP NAME]-docs\n```\n\nThis gets the configuration in order and you can run crawler management UI *bin/run-management-ui.sh* \n\nSetup your crawl sources and after Storm Crawler can be started to fetch the specified content *bin/run-crawler.sh*\n\n## License\n\nCopyright \u0026copy; 2019 [TokenMill UAB](http://www.tokenmill.lt).\n\nDistributed under the The Apache License, Version 2.0.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftokenmill%2Fcrawling-framework-example","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftokenmill%2Fcrawling-framework-example","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftokenmill%2Fcrawling-framework-example/lists"}