https://github.com/Hironsan/bertsearch

Elasticsearch with BERT for advanced document search.
https://github.com/Hironsan/bertsearch

bert elasticsearch machine-learning natural-language-processing search-engine

Last synced: 3 months ago
JSON representation

Elasticsearch with BERT for advanced document search.

Host: GitHub
URL: https://github.com/Hironsan/bertsearch
Owner: Hironsan
License: mit
Created: 2019-09-25T20:19:02.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2023-05-01T21:15:34.000Z (about 2 years ago)
Last Synced: 2025-04-12T18:48:01.001Z (3 months ago)
Topics: bert, elasticsearch, machine-learning, natural-language-processing, search-engine
Language: Python
Homepage: https://towardsdatascience.com/elasticsearch-meets-bert-building-search-engine-with-elasticsearch-and-bert-9e74bf5b4cf2
Size: 728 KB
Stars: 899
Watchers: 25
Forks: 202
Open Issues: 4
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

ATPapers - Hironsan / bertsearch - Elasticsearch with BERT for advanced document search (Pretrained Language Model / Repository)

README

# Elasticsearch meets BERT

Below is a job search example:

![An example of bertsearch](./docs/example.png)

## System architecture

![System architecture](./docs/architecture.png)

## Requirements

- Docker
- Docker Compose >= [1.22.0](https://docs.docker.com/compose/release-notes/#1220)

## Getting Started

### 1. Download a pretrained BERT model

List of released pretrained BERT models (click to expand...)

BERT-Base, Uncased12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Large, Uncased24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Cased12-layer, 768-hidden, 12-heads , 110M parameters
BERT-Large, Cased24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Multilingual Cased (New)104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Multilingual Cased (Old)102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, ChineseChinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

```bash
$ wget https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip
$ unzip cased_L-12_H-768_A-12.zip
```

### 2. Set environment variables

You need to set a pretrained BERT model and Elasticsearch's index name as environment variables:

```bash
$ export PATH_MODEL=./cased_L-12_H-768_A-12
$ export INDEX_NAME=jobsearch
```

### 3. Run Docker containers

```bash
$ docker-compose up
```

**CAUTION**: If possible, assign high memory(more than `8GB`) to Docker's memory configuration because BERT container needs high memory.

### 4. Create index

You can use the create index API to add a new index to an Elasticsearch cluster. When creating an index, you can specify the following:

* Settings for the index
* Mappings for fields in the index
* Index aliases

For example, if you want to create `jobsearch` index with `title`, `text` and `text_vector` fields, you can create the index by the following command:

```bash
$ python example/create_index.py --index_file=example/index.json --index_name=jobsearch
# index.json
{
"settings": {
"number_of_shards": 2,
"number_of_replicas": 1
},
"mappings": {
"dynamic": "true",
"_source": {
"enabled": "true"
},
"properties": {
"title": {
"type": "text"
},
"text": {
"type": "text"
},
"text_vector": {
"type": "dense_vector",
"dims": 768
}
}
}
}
```

**CAUTION**: The `dims` value of `text_vector` must need to match the dims of a pretrained BERT model.

### 5. Create documents

Once you created an index, you’re ready to index some document. The point here is to convert your document into a vector using BERT. The resulting vector is stored in the `text_vector` field. Let`s convert your data into a JSON document:

```bash
$ python example/create_documents.py --data=example/example.csv --index_name=jobsearch
# example/example.csv
"Title","Description"
"Saleswoman","lorem ipsum"
"Software Developer","lorem ipsum"
"Chief Financial Officer","lorem ipsum"
"General Manager","lorem ipsum"
"Network Administrator","lorem ipsum"
```

After finishing the script, you can get a JSON document like follows:

```python
# documents.jsonl
{"_op_type": "index", "_index": "jobsearch", "text": "lorem ipsum", "title": "Saleswoman", "text_vector": [...]}
{"_op_type": "index", "_index": "jobsearch", "text": "lorem ipsum", "title": "Software Developer", "text_vector": [...]}
{"_op_type": "index", "_index": "jobsearch", "text": "lorem ipsum", "title": "Chief Financial Officer", "text_vector": [...]}
...
```

### 6. Index documents

After converting your data into a JSON, you can adds a JSON document to the specified index and makes it searchable.

```bash
$ python example/index_documents.py
```

### 7. Open browser

Go to .

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/Hironsan/bertsearch

Awesome Lists containing this project

README