{"id":16577392,"url":"https://github.com/oliver006/elasticsearch-hn","last_synced_at":"2025-07-29T01:35:18.955Z","repository":{"id":23263556,"uuid":"26621932","full_name":"oliver006/elasticsearch-hn","owner":"oliver006","description":"Index \u0026 Search Hacker News using Elasticsearch and the HN API","archived":false,"fork":false,"pushed_at":"2018-04-08T15:01:58.000Z","size":7,"stargazers_count":96,"open_issues_count":0,"forks_count":15,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-03T20:51:11.515Z","etag":null,"topics":["elasticsearch","news","python","stories","tornado","tutorial"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oliver006.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-11-14T04:14:59.000Z","updated_at":"2025-03-07T15:45:06.000Z","dependencies_parsed_at":"2022-07-22T04:02:08.956Z","dependency_job_id":null,"html_url":"https://github.com/oliver006/elasticsearch-hn","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/oliver006/elasticsearch-hn","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oliver006%2Felasticsearch-hn","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oliver006%2Felasticsearch-hn/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oliver006%2Felasticsearch-hn/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oliver006%2Felasticsearch-hn/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oliver006","download_url":"https://codeload.github.com/oliver006/elasticsearch-hn/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oliver006%2Felasticsearch-hn/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267616687,"owners_count":24116159,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-28T02:00:09.689Z","response_time":68,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["elasticsearch","news","python","stories","tornado","tutorial"],"created_at":"2024-10-11T22:10:44.292Z","updated_at":"2025-07-29T01:35:18.938Z","avatar_url":"https://github.com/oliver006.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Elasticsearch For Beginners: Index and Search Hacker News \n================\n\n\n#### Big picture plz? \n\nHacker News officially released their [API](http://blog.ycombinator.com/hacker-news-api) this October, giving access to a vast amount of news articles, comments, polls, job postings, etc and via JSON, perfect to put it into Elasticsearch.\n\n[Elasticsearch](http://elasticsearch.org) is currently the most popular Open-Source search engine, used for a wide variety of use cases. It natively works with JSON documents so this sounds like a perfect fit.\n\nIt runs on a [DigitalOcean 512MB droplet](https://m.do.co/c/c9b25dec9715) droplet and hosts the Elasticsearch node and a simple Tornado app for the frontend. Crontab runs the update every 5 minutes.\n\n\n#### Prerequisites\n\nSet up Elasticsearch and make sure it's running at [http://localhost:9200](http://localhost:9200)\n\nSee [here](https://www.elastic.co/guide/en/elasticsearch/guide/current/running-elasticsearch.html) if you need more information on how to install Elasticsearch.\n\nI use Python and [Tornado](https://github.com/tornadoweb/tornado/) for the scripts to import and query the data.\n\n\n\n#### Aight, so what are we doing? \n\nWe'll start with loading the Top 100 HN stories IDs, retrieve detailed information about each item and then index them in Elasticsearch.\n\n\nTop 100 Stories:\n\n`curl https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty`\n\nthe result looking something like this:\n\n```\n[ 8605204, 8604814, 8602936, 8604489, 8604533, 8604626, 8605207, 8605186, \n...\n8603147, 8602037 ]\n```\n\nWe can now loop through the IDs and retrieve more detailed information:\n\n`curl https://hacker-news.firebaseio.com/v0/item/8605204.json?print=pretty`\n\nyields this:\n\n```\n{\n  \"by\" : \"davecheney\",\n  \"id\" : 8605204,\n  \"kids\" : [ 8605567, 8605461, 8605280, 8605824, 8605404, 8605601, 8605246, 8605323, 8605712, 8605346, 8605743, 8605242, 8605321, 8605268 ],\n  \"score\" : 260,\n  \"text\" : \"\",\n  \"time\" : 1415926359,\n  \"title\" : \"Go is moving to GitHub\",\n  \"type\" : \"story\",\n  \"url\" : \"https://groups.google.com/forum/#!topic/golang-dev/sckirqOWepg\"\n}\n```\n\nAnd store the JSON document in Elasticsearch:\n\n`curl -XPUT http://localhost:9200/hn/story/***item['id']*** -d @doc.json`\n\nwhere `***item['id']***` is the ID of the document we just retrieved and `@doc.json` is the body of the document we just downloaded.\n\n\n#### Got it, show me some real code!\n\nCheck out the full Python code here: [src/update.py](src/update.py)\n\nThis is the loop over the top 100 IDs:\n\n```\n    response = yield http_client.fetch('https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty')\n    top100_ids = json.loads(response.body)\n    \n    for item_id in top100_ids:\n        yield download_and_index_item(item_id)\n\n    print \"Done\"\n\n```\n\nand this (shortened) piece downloads the individual items:\n\n```\ndef download_and_index_item(item_id):\n    \n    url = \"https://hacker-news.firebaseio.com/v0/item/%s.json?print=pretty\" % item_id\n    response = yield http_client.fetch(url)\n    item = json.loads(response.body)\n\n\t# all sorts of clean-up of \"item\"\n\n    es_url = \"http://localhost:9200/hn/%s/%s\" % (item['type'], item['id'])\n    request = HTTPRequest(es_url, method=\"PUT\", body=json.dumps(item), request_timeout=10)\n    response = yield http_client.fetch(request)\n    if not response.code in [200, 201]:\n        print \"\\nfailed to add item %s\" % item['id']\n    else:\n        sys.stdout.write('.')\n```\n\n\n#### Ok, but where's the data?\n\nOnce we have a batch of HN articles in ES we can run queries\n\n`curl \"http://localhost:9200/hn/story/_search?pretty\"`\n\ngives us all the stories (the first 10 really as ES defaults to 10 results by default).\n\nAll stories for a given user:\n\n`curl \"http://localhost:9200/hn/story/_search?q=by:davecheney\u0026pretty\"`\n\nWe can also run aggregations and for see who posted the most stories and what the most popular domains are:\n\n```\ncurl -XGET 'http://localhost:9200/hn/story/_search?search_type=count' -d '\n{ \"aggs\" : { \"domains\" : { \"terms\" : { \"field\" : \"domain\", \"size\": 11 } }, \"by\" : {  \"terms\" : { \"field\" : \"by\", \"size\": 5 } } } }'\n```\n\nreturning something like this:\n\n```\n{ \"aggregations\": {\n    \"by\": {\n      \"buckets\": [\n        { \"doc_count\": 5,\n          \"key\": \"luu\" \"},\n        { \"doc_count\": 3,\n          \"key\": \"benbreen\" },\n        { \"doc_count\": 3,\n          \"key\": \"dnetesn\" \"},\n        ...\n      ]\n    },\n    \"domains\": {\n      \"buckets\": [\n        { \"doc_count\": 6,\n          \"key\": \"github.com\" },\n        { \"doc_count\": 4,\n          \"key\": \"medium.com\" },\n        ...\n      ]\n    }\n  }\n}\n```\n\n\n\n#### What can we do better? \n\n##### Field Mappings\n\nElasticsearch is doing a pretty good job at figuring out what type a field is but sometimes it can use a little help.\nRun this query to see how ES maps each field of the `story` type:\n\n`curl -XGET 'http://localhost:9200/hn/_mapping/story'`\n\nLooks all pretty straight forward but one mapping sticks out:\n\n```\n    \"time\": {\n        \"type\": \"long\"\n    },\n```\n\nThe type `long` is ok but what we really want is the type `date` so we can take advantage of the built-in date operators and aggregations. \u003cbr\u003e\nLet's set up a index mapping for `time`:\n\n```\ncurl -XPUT \"http://localhost:9200/hn/\" -d '{\n    \"mappings\" : {\n        \"story\" : {\n            \"properties\" : {\n                \"time\" :   { \"type\" : \"date\" }\n            }\n        }\n    }\n}'\n```\nThat should do the trick so now we can run a query to see how many stories are being posted to the HN Top 100 per week:\n\n```\ncurl -XGET 'http://localhost:9200/hn/story/_search?search_type=count' -d '\n{\n    \"aggs\" : {\n        \"articles_over_time\" : {\n            \"date_histogram\" : {\n                \"field\" : \"time\",\n                \"interval\" : \"1w\"\n            }\n        }\n    }\n}\n'\n```\nResult:\n\n```\n{ \"aggregations\": {\n    \"articles_over_time\": {\n      \"buckets\": [\n        { \"doc_count\": 1609,\n          \"key\": 1413158400000,\n          \"key_as_string\": \"2014-10-13T00:00:00.000Z\"\n        },\n        { \"doc_count\": 1195,\n          \"key\": 1413763200000,\n          \"key_as_string\": \"2014-10-20T00:00:00.000Z\"\n        },\n        { \"doc_count\": 1236,\n          \"key\": 1414368000000,\n          \"key_as_string\": \"2014-10-27T00:00:00.000Z\"\n        },\n        { \"doc_count\": 1304,\n          \"key\": 1414972800000,\n          \"key_as_string\": \"2014-11-03T00:00:00.000Z\"\n        }\n  ] } },\n}\n```\n\n \n\n##### Other possible future improvements\n\n- use bulk API\n- more interesting queries\n- simple web interface to query ES\n\n\n#### feedback\n\nOpen pull requests, issues or email me at o@21zoo.com\n\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foliver006%2Felasticsearch-hn","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foliver006%2Felasticsearch-hn","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foliver006%2Felasticsearch-hn/lists"}