{"id":13572284,"url":"https://github.com/oliver006/elasticsearch-gmail","last_synced_at":"2025-05-14T07:08:15.950Z","repository":{"id":22806876,"uuid":"26153532","full_name":"oliver006/elasticsearch-gmail","owner":"oliver006","description":"Index your Gmail Inbox with Elasticsearch","archived":false,"fork":false,"pushed_at":"2024-11-23T02:27:57.000Z","size":54,"stargazers_count":2049,"open_issues_count":1,"forks_count":162,"subscribers_count":58,"default_branch":"master","last_synced_at":"2025-04-11T01:41:56.891Z","etag":null,"topics":["elasticsearch","filter","gmail","gmail-inbox","mbox-format","python","tornado","tutorial"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oliver006.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2014-11-04T04:56:28.000Z","updated_at":"2025-03-10T23:12:59.000Z","dependencies_parsed_at":"2024-01-13T02:56:07.730Z","dependency_job_id":"9adb5b1f-a22e-4c86-bd62-ffc51a3560a0","html_url":"https://github.com/oliver006/elasticsearch-gmail","commit_stats":{"total_commits":37,"total_committers":17,"mean_commits":2.176470588235294,"dds":0.6216216216216216,"last_synced_commit":"d7ea1523e2a7b390daad42a45e3b554a4e0258ea"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oliver006%2Felasticsearch-gmail","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oliver006%2Felasticsearch-gmail/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oliver006%2Felasticsearch-gmail/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oliver006%2Felasticsearch-gmail/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oliver006","download_url":"https://codeload.github.com/oliver006/elasticsearch-gmail/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254092657,"owners_count":22013290,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["elasticsearch","filter","gmail","gmail-inbox","mbox-format","python","tornado","tutorial"],"created_at":"2024-08-01T14:01:18.981Z","updated_at":"2025-05-14T07:08:15.911Z","avatar_url":"https://github.com/oliver006.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"Elasticsearch For Beginners: Indexing your Gmail Inbox (and more: Supports any mbox and MH mailboxes)\n=======================\n\n[![Build Status](https://cloud.drone.io/api/badges/oliver006/elasticsearch-gmail/status.svg)](https://cloud.drone.io/oliver006/elasticsearch-gmail)\n\n\n#### What's this all about?\n\nI recently looked at my Gmail inbox and noticed that I have well over 50k emails, taking up about 12GB of space but there is no good way to tell what emails take up space, who sent them to, who emails me, etc\n\nGoal of this tutorial is to load an entire Gmail inbox into Elasticsearch using bulk indexing and then start querying the cluster to get a better picture of what's going on.\n\n\n#### Prerequisites\n\nSet up [Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/guide/current/running-elasticsearch.html) and make sure it's running at [http://localhost:9200](http://localhost:9200)\n\nA quick way to run Elasticsearch is using Docker: (the cors settings aren't really needed but come in handy if you want to use e.g. [dejavu](https://dejavu.appbase.io/) to explore the index)\n```\ndocker run --name es -d -p 9200:9200 -e http.port=9200 -e http.cors.enabled=true -e 'http.cors.allow-origin=*' -e http.cors.allow-headers=X-Requested-With,X-Auth-Token,Content-Type,Content-Length,Authorization -e http.cors.allow-credentials=true -e \"discovery.type=single-node\" docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2\n```\n\nI use Python and [Tornado](https://github.com/tornadoweb/tornado/) for the scripts to import and query the data. Also `beautifulsoup4` for the stripping HTML/JS/CSS (if you want to use the body indexing flag).\n\nInstall the dependencies by running:\n\n`pip3 install -r requirements.txt`\n\n\n#### Aight, where do we start?\n\nFirst, go [here](https://www.google.com/settings/takeout/custom/gmail) and download your Gmail mailbox, depending on the amount of emails you have accumulated this might take a while.\nThere's also a small `sample.mbox` file included in the repo for you to play around with while you're waiting for Google to prepare your download.\n\nThe downloaded archive is in the [mbox format](http://en.wikipedia.org/wiki/Mbox) and Python provides libraries to work with the mbox format so that's easy.\n\nYou can run the code (assuming Elasticsearch is running at localhost:9200) with the sammple mbox file like this:\n```\n$ python3 src/index_emails.py --infile=sample.mbox\n[I index_emails:173] Starting import from file sample.mbox\n[I index_emails:101] Upload: OK - upload took: 1033ms, total messages uploaded:      3\n[I index_emails:197] Import done - total count 16\n$\n```\n\nNote: All examples focus on Gmail inboxes. Substitute any `--infile=` parameters with `--indir=` pointing to an MH directory to make them work with MH mailboxes instead.\n\n#### The Source Code\n\nThe overall program will look something like this:\n\n```python\nmbox = mailbox.mbox('emails.mbox') // or mailbox.MH('inbox/')\n\nfor msg in mbox:\n    item = convert_msg_to_json(msg)\n\tupload_item_to_es(item)\n\nprint \"Done!\"\n```\n\n#### Ok, tell me more about the details\n\nThe full Python code is here: [src/index_emails.py](src/index_emails.py)\n\n\n##### Turn mailbox into JSON\n\nFirst, we got to turn the messages into JSON so we can insert it into Elasticsearch. [Here](http://nbviewer.ipython.org/github/furukama/Mining-the-Social-Web-2nd-Edition/blob/master/ipynb/Chapter%206%20-%20Mining%20Mailboxes.ipynb) is some sample code that was very useful when it came to normalizing and cleaning up the data.\n\nA good first step:\n\n```python\ndef convert_msg_to_json(msg):\n    result = {'parts': []}\n    for (k, v) in msg.items():\n        result[k.lower()] = v.decode('utf-8', 'ignore')\n\n```\n\nAdditionally, you also want to parse and normalize the `From` and `To` email addresses:\n\n```python\nfor k in ['to', 'cc', 'bcc']:\n    if not result.get(k):\n        continue\n    emails_split = result[k].replace('\\n', '').replace('\\t', '').replace('\\r', '').replace(' ', '').encode('utf8').decode('utf-8', 'ignore').split(',')\n    result[k] = [ normalize_email(e) for e in emails_split]\n\nif \"from\" in result:\n    result['from'] = normalize_email(result['from'])\n```\n\nElasticsearch expects timestamps to be in microseconds so let's convert the date accordingly\n\n```python\nif \"date\" in result:\n    tt = email.utils.parsedate_tz(result['date'])\n    result['date_ts'] = int(calendar.timegm(tt) - tt[9]) * 1000\n```\n\nWe also need to split up and normalize the labels\n\n```python\nlabels = []\nif \"x-gmail-labels\" in result:\n    labels = [l.strip().lower() for l in result[\"x-gmail-labels\"].split(',')]\n    del result[\"x-gmail-labels\"]\nresult['labels'] = labels\n```\n\nEmail size is also interesting so let's break that out\n\n```python\nparts = json_msg.get(\"parts\", [])\njson_msg['content_size_total'] = 0\nfor part in parts:\n    json_msg['content_size_total'] += len(part.get('content', \"\"))\n\n```\n\n\n##### Index the data with Elasticsearch\n\nThe most simple approach is a PUT request per item:\n\n```python\ndef upload_item_to_es(item):\n    es_url = \"http://localhost:9200/gmail/email/%s\" % (item['message-id'])\n    request = HTTPRequest(es_url, method=\"PUT\", body=json.dumps(item), request_timeout=10)\n    response = yield http_client.fetch(request)\n    if not response.code in [200, 201]:\n        print \"\\nfailed to add item %s\" % item['message-id']\n\n```\n\nHowever, Elasticsearch provides a better method for importing large chunks of data: [bulk indexing](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html)\nInstead of making a HTTP request per document and indexing individually, we batch them in chunks of eg. 1000 documents and then index them.\u003cbr\u003e\nBulk messages are of the format:\n\n```\ncmd\\n\ndoc\\n\ncmd\\n\ndoc\\n\n...\n```\n\nwhere `cmd` is the control message for each `doc` we want to index.\nFor our example, `cmd` would look like this:\n\n```\ncmd = {'index': {'_index': 'gmail', '_type': 'email', '_id': item['message-id']}}`\n```\n\nThe final code looks something like this:\n\n```python\nupload_data = list()\nfor msg in mbox:\n    item = convert_msg_to_json(msg)\n    upload_data.append(item)\n    if len(upload_data) == 100:\n        upload_batch(upload_data)\n        upload_data = list()\n\nif upload_data:\n    upload_batch(upload_data)\n\n```\nand\n\n```python\ndef upload_batch(upload_data):\n\n    upload_data_txt = \"\"\n    for item in upload_data:\n        cmd = {'index': {'_index': 'gmail', '_type': 'email', '_id': item['message-id']}}\n        upload_data_txt += json.dumps(cmd) + \"\\n\"\n        upload_data_txt += json.dumps(item) + \"\\n\"\n\n    request = HTTPRequest(\"http://localhost:9200/_bulk\", method=\"POST\", body=upload_data_txt, request_timeout=240)\n    response = http_client.fetch(request)\n    result = json.loads(response.body)\n\tif 'errors' in result:\n\t    print result['errors']\n```\n\n\n\n#### Ok, show me some data!\n\nAfter indexing all your emails, we can start running queries.\n\n\n##### Filters\n\nIf you want to search for emails from the last 6 months, you can use the range filter and search for `gte` the current time (`now`) minus 6 month:\n\n```\ncurl -XGET 'http://localhost:9200/gmail/email/_search?pretty' -d '{\n\"filter\": { \"range\" : { \"date_ts\" : { \"gte\": \"now-6M\" } } } }\n'\n```\n\nor you can filter for all emails from 2014 by using `gte` and `lt`\n\n```\ncurl -XGET 'http://localhost:9200/gmail/email/_search?pretty' -d '{\n\"filter\": { \"range\" : { \"date_ts\" : { \"gte\": \"2013-01-01T00:00:00.000Z\", \"lt\": \"2014-01-01T00:00:00.000Z\" } } } }\n'\n```\n\nYou can also quickly query for certain fields via the `q` parameter. This example shows you all your Amazon shipping info emails:\n\n```\ncurl \"localhost:9200/gmail/email/_search?pretty\u0026q=from:ship-confirm@amazon.com\"\n```\n\n##### Aggregation queries\n\nAggregation queries let us bucket data by a given key and count the number of messages per bucket.\nFor example, number of messages grouped by recipient:\n\n```\ncurl -XGET 'http://localhost:9200/gmail/email/_search?pretty\u0026search_type=count' -d '{\n\"aggs\": { \"emails\": { \"terms\" : { \"field\" : \"to\",  \"size\": 10 }\n} } }\n'\n```\n\nResult:\n\n```\n\"aggregations\" : {\n\"emails\" : {\n  \"buckets\" : [ {\n       \"key\" : \"noreply@github.com\",\n       \"doc_count\" : 1920\n  }, { \"key\" : \"oliver@gmail.com\",\n       \"doc_count\" : 1326\n  }, { \"key\" : \"michael@gmail.com\",\n       \"doc_count\" : 263\n  }, { \"key\" : \"david@gmail.com\",\n       \"doc_count\" : 232\n  }\n  ...\n  ]\n}\n```\n\nThis one gives us the number of emails per label:\n\n```\ncurl -XGET 'http://localhost:9200/gmail/email/_search?pretty\u0026search_type=count' -d '{\n\"aggs\": { \"labels\": { \"terms\" : { \"field\" : \"labels\",  \"size\": 10 }\n} } }\n'\n```\n\nResult:\n\n```\n\"hits\" : {\n  \"total\" : 51794,\n},\n\"aggregations\" : {\n\"labels\" : {\n  \"buckets\" : [       {\n       \"key\" : \"important\",\n       \"doc_count\" : 15430\n  }, { \"key\" : \"github\",\n       \"doc_count\" : 4928\n  }, { \"key\" : \"sent\",\n       \"doc_count\" : 4285\n  }, { \"key\" : \"unread\",\n       \"doc_count\" : 510\n  },\n  ...\n   ]\n}\n```\n\nUse a `date histogram` you can also count how many emails you sent and received per year:\n\n```\ncurl -s \"localhost:9200/gmail/email/_search?pretty\u0026search_type=count\" -d '\n{ \"aggs\": {\n    \"years\": {\n      \"date_histogram\": {\n        \"field\": \"date_ts\", \"interval\": \"year\"\n}}}}\n'\n```\n\nResult:\n\n```\n\"aggregations\" : {\n\"years\" : {\n  \"buckets\" : [ {\n    \"key_as_string\" : \"2004-01-01T00:00:00.000Z\",\n    \"key\" : 1072915200000,\n    \"doc_count\" : 585\n  }, {\n...\n  }, {\n    \"key_as_string\" : \"2013-01-01T00:00:00.000Z\",\n    \"key\" : 1356998400000,\n    \"doc_count\" : 12832\n  }, {\n    \"key_as_string\" : \"2014-01-01T00:00:00.000Z\",\n    \"key\" : 1388534400000,\n    \"doc_count\" : 7283\n  } ]\n}\n```\n\nWrite aggregation queries to work out how much you spent on Amazon/Steam:\n\n```\nGET _search\n{\n  \"query\": {\n    \"match_all\": {}\n      },\n      \"size\": 0,\n      \"aggs\": {\n        \"group_by_company\": {\n          \"terms\": {\n            \"field\": \"order_details.merchant\"\n            },\n            \"aggs\": {\n              \"total_spent\": {\n                \"sum\": {\n                  \"field\": \"order_details.order_total\"\n                }\n                },\n                \"postage\": {\n                  \"sum\": {\n                    \"field\": \"order_details.postage\"\n                  }\n                }\n              }\n            }\n          }\n        }\n```\n\n\n#### Todo\n\n- more interesting queries\n- schema tweaks\n- multi-part message parsing\n- blurb about performance\n- ...\n\n\n\n#### Feedback\n\nOpen a pull requests or an issue!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foliver006%2Felasticsearch-gmail","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foliver006%2Felasticsearch-gmail","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foliver006%2Felasticsearch-gmail/lists"}