{"id":13615299,"url":"https://github.com/Hironsan/bertsearch","last_synced_at":"2025-04-13T21:30:37.023Z","repository":{"id":35528701,"uuid":"210933055","full_name":"Hironsan/bertsearch","owner":"Hironsan","description":"Elasticsearch with BERT for advanced document search.","archived":false,"fork":false,"pushed_at":"2023-05-01T21:15:34.000Z","size":745,"stargazers_count":899,"open_issues_count":4,"forks_count":202,"subscribers_count":25,"default_branch":"master","last_synced_at":"2025-04-12T18:48:01.001Z","etag":null,"topics":["bert","elasticsearch","machine-learning","natural-language-processing","search-engine"],"latest_commit_sha":null,"homepage":"https://towardsdatascience.com/elasticsearch-meets-bert-building-search-engine-with-elasticsearch-and-bert-9e74bf5b4cf2","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Hironsan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null},"funding":{"github":"Hironsan"}},"created_at":"2019-09-25T20:19:02.000Z","updated_at":"2025-04-09T05:27:27.000Z","dependencies_parsed_at":"2024-01-17T09:03:09.144Z","dependency_job_id":null,"html_url":"https://github.com/Hironsan/bertsearch","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hironsan%2Fbertsearch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hironsan%2Fbertsearch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hironsan%2Fbertsearch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hironsan%2Fbertsearch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Hironsan","download_url":"https://codeload.github.com/Hironsan/bertsearch/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248785879,"owners_count":21161369,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","elasticsearch","machine-learning","natural-language-processing","search-engine"],"created_at":"2024-08-01T20:01:11.659Z","updated_at":"2025-04-13T21:30:36.997Z","avatar_url":"https://github.com/Hironsan.png","language":"Python","readme":"# Elasticsearch meets BERT\n\nBelow is a job search example:\n\n![An example of bertsearch](./docs/example.png)\n\n## System architecture\n\n![System architecture](./docs/architecture.png)\n\n## Requirements\n\n- Docker\n- Docker Compose \u003e= [1.22.0](https://docs.docker.com/compose/release-notes/#1220)\n\n## Getting Started\n\n### 1. Download a pretrained BERT model\n\n\u003cdetails\u003e\n \u003csummary\u003eList of released pretrained BERT models (click to expand...)\u003c/summary\u003e\n\n\n\u003ctable\u003e\n\u003ctr\u003e\u003ctd\u003e\u003ca href=\"https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip\"\u003eBERT-Base, Uncased\u003c/a\u003e\u003c/td\u003e\u003ctd\u003e12-layer, 768-hidden, 12-heads, 110M parameters\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e\u003ca href=\"https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip\"\u003eBERT-Large, Uncased\u003c/a\u003e\u003c/td\u003e\u003ctd\u003e24-layer, 1024-hidden, 16-heads, 340M parameters\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e\u003ca href=\"https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip\"\u003eBERT-Base, Cased\u003c/a\u003e\u003c/td\u003e\u003ctd\u003e12-layer, 768-hidden, 12-heads , 110M parameters\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e\u003ca href=\"https://storage.googleapis.com/bert_models/2018_10_18/cased_L-24_H-1024_A-16.zip\"\u003eBERT-Large, Cased\u003c/a\u003e\u003c/td\u003e\u003ctd\u003e24-layer, 1024-hidden, 16-heads, 340M parameters\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e\u003ca href=\"https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip\"\u003eBERT-Base, Multilingual Cased (New)\u003c/a\u003e\u003c/td\u003e\u003ctd\u003e104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e\u003ca href=\"https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip\"\u003eBERT-Base, Multilingual Cased (Old)\u003c/a\u003e\u003c/td\u003e\u003ctd\u003e102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e\u003ca href=\"https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip\"\u003eBERT-Base, Chinese\u003c/a\u003e\u003c/td\u003e\u003ctd\u003eChinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters\u003c/td\u003e\u003c/tr\u003e\n\u003c/table\u003e\n\n\u003c/details\u003e\n\n```bash\n$ wget https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip\n$ unzip cased_L-12_H-768_A-12.zip\n```\n\n### 2. Set environment variables\n\nYou need to set a pretrained BERT model and Elasticsearch's index name as environment variables:\n\n```bash\n$ export PATH_MODEL=./cased_L-12_H-768_A-12\n$ export INDEX_NAME=jobsearch\n```\n\n### 3. Run Docker containers\n\n\n```bash\n$ docker-compose up\n```\n\n**CAUTION**: If possible, assign high memory(more than `8GB`) to Docker's memory configuration because BERT container needs high memory.\n\n### 4. Create index\n\nYou can use the create index API to add a new index to an Elasticsearch cluster. When creating an index, you can specify the following:\n\n* Settings for the index\n* Mappings for fields in the index\n* Index aliases\n\nFor example, if you want to create `jobsearch` index with `title`, `text` and `text_vector` fields, you can create the index by the following command:\n\n```bash\n$ python example/create_index.py --index_file=example/index.json --index_name=jobsearch\n# index.json\n{\n  \"settings\": {\n    \"number_of_shards\": 2,\n    \"number_of_replicas\": 1\n  },\n  \"mappings\": {\n    \"dynamic\": \"true\",\n    \"_source\": {\n      \"enabled\": \"true\"\n    },\n    \"properties\": {\n      \"title\": {\n        \"type\": \"text\"\n      },\n      \"text\": {\n        \"type\": \"text\"\n      },\n      \"text_vector\": {\n        \"type\": \"dense_vector\",\n        \"dims\": 768\n      }\n    }\n  }\n}\n```\n\n**CAUTION**: The `dims` value of `text_vector` must need to match the dims of a pretrained BERT model.\n\n### 5. Create documents\n\nOnce you created an index, you’re ready to index some document. The point here is to convert your document into a vector using BERT. The resulting vector is stored in the `text_vector` field. Let`s convert your data into a JSON document:\n\n```bash\n$ python example/create_documents.py --data=example/example.csv --index_name=jobsearch\n# example/example.csv\n\"Title\",\"Description\"\n\"Saleswoman\",\"lorem ipsum\"\n\"Software Developer\",\"lorem ipsum\"\n\"Chief Financial Officer\",\"lorem ipsum\"\n\"General Manager\",\"lorem ipsum\"\n\"Network Administrator\",\"lorem ipsum\"\n```\n\nAfter finishing the script, you can get a JSON document like follows:\n\n```python\n# documents.jsonl\n{\"_op_type\": \"index\", \"_index\": \"jobsearch\", \"text\": \"lorem ipsum\", \"title\": \"Saleswoman\", \"text_vector\": [...]}\n{\"_op_type\": \"index\", \"_index\": \"jobsearch\", \"text\": \"lorem ipsum\", \"title\": \"Software Developer\", \"text_vector\": [...]}\n{\"_op_type\": \"index\", \"_index\": \"jobsearch\", \"text\": \"lorem ipsum\", \"title\": \"Chief Financial Officer\", \"text_vector\": [...]}\n...\n```\n\n### 6. Index documents\n\nAfter converting your data into a JSON, you can adds a JSON document to the specified index and makes it searchable.\n\n```bash\n$ python example/index_documents.py\n```\n\n### 7. Open browser\n\nGo to \u003chttp://127.0.0.1:5000\u003e.\n","funding_links":["https://github.com/sponsors/Hironsan"],"categories":["Pretrained Language Model","Python","search-engine"],"sub_categories":["Repository"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FHironsan%2Fbertsearch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FHironsan%2Fbertsearch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FHironsan%2Fbertsearch/lists"}