Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Hironsan/bertsearch
Elasticsearch with BERT for advanced document search.
https://github.com/Hironsan/bertsearch
bert elasticsearch machine-learning natural-language-processing search-engine
Last synced: about 1 month ago
JSON representation
Elasticsearch with BERT for advanced document search.
- Host: GitHub
- URL: https://github.com/Hironsan/bertsearch
- Owner: Hironsan
- License: mit
- Created: 2019-09-25T20:19:02.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2023-05-01T21:15:34.000Z (over 1 year ago)
- Last Synced: 2024-10-11T08:24:09.621Z (2 months ago)
- Topics: bert, elasticsearch, machine-learning, natural-language-processing, search-engine
- Language: Python
- Homepage: https://towardsdatascience.com/elasticsearch-meets-bert-building-search-engine-with-elasticsearch-and-bert-9e74bf5b4cf2
- Size: 728 KB
- Stars: 896
- Watchers: 26
- Forks: 202
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
- ATPapers - Hironsan / bertsearch - Elasticsearch with BERT for advanced document search (Pretrained Language Model / Repository)
README
# Elasticsearch meets BERT
Below is a job search example:
![An example of bertsearch](./docs/example.png)
## System architecture
![System architecture](./docs/architecture.png)
## Requirements
- Docker
- Docker Compose >= [1.22.0](https://docs.docker.com/compose/release-notes/#1220)## Getting Started
### 1. Download a pretrained BERT model
List of released pretrained BERT models (click to expand...)
BERT-Base, Uncased12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Large, Uncased24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Cased12-layer, 768-hidden, 12-heads , 110M parameters
BERT-Large, Cased24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Multilingual Cased (New)104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Multilingual Cased (Old)102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, ChineseChinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters```bash
$ wget https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip
$ unzip cased_L-12_H-768_A-12.zip
```### 2. Set environment variables
You need to set a pretrained BERT model and Elasticsearch's index name as environment variables:
```bash
$ export PATH_MODEL=./cased_L-12_H-768_A-12
$ export INDEX_NAME=jobsearch
```### 3. Run Docker containers
```bash
$ docker-compose up
```**CAUTION**: If possible, assign high memory(more than `8GB`) to Docker's memory configuration because BERT container needs high memory.
### 4. Create index
You can use the create index API to add a new index to an Elasticsearch cluster. When creating an index, you can specify the following:
* Settings for the index
* Mappings for fields in the index
* Index aliasesFor example, if you want to create `jobsearch` index with `title`, `text` and `text_vector` fields, you can create the index by the following command:
```bash
$ python example/create_index.py --index_file=example/index.json --index_name=jobsearch
# index.json
{
"settings": {
"number_of_shards": 2,
"number_of_replicas": 1
},
"mappings": {
"dynamic": "true",
"_source": {
"enabled": "true"
},
"properties": {
"title": {
"type": "text"
},
"text": {
"type": "text"
},
"text_vector": {
"type": "dense_vector",
"dims": 768
}
}
}
}
```**CAUTION**: The `dims` value of `text_vector` must need to match the dims of a pretrained BERT model.
### 5. Create documents
Once you created an index, you’re ready to index some document. The point here is to convert your document into a vector using BERT. The resulting vector is stored in the `text_vector` field. Let`s convert your data into a JSON document:
```bash
$ python example/create_documents.py --data=example/example.csv --index_name=jobsearch
# example/example.csv
"Title","Description"
"Saleswoman","lorem ipsum"
"Software Developer","lorem ipsum"
"Chief Financial Officer","lorem ipsum"
"General Manager","lorem ipsum"
"Network Administrator","lorem ipsum"
```After finishing the script, you can get a JSON document like follows:
```python
# documents.jsonl
{"_op_type": "index", "_index": "jobsearch", "text": "lorem ipsum", "title": "Saleswoman", "text_vector": [...]}
{"_op_type": "index", "_index": "jobsearch", "text": "lorem ipsum", "title": "Software Developer", "text_vector": [...]}
{"_op_type": "index", "_index": "jobsearch", "text": "lorem ipsum", "title": "Chief Financial Officer", "text_vector": [...]}
...
```### 6. Index documents
After converting your data into a JSON, you can adds a JSON document to the specified index and makes it searchable.
```bash
$ python example/index_documents.py
```### 7. Open browser
Go to .