Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/zhangyaoxing/inditex-poc
POC for Inditex
https://github.com/zhangyaoxing/inditex-poc
Last synced: about 1 month ago
JSON representation
POC for Inditex
- Host: GitHub
- URL: https://github.com/zhangyaoxing/inditex-poc
- Owner: zhangyaoxing
- Created: 2024-04-12T13:51:51.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-04-17T16:05:48.000Z (9 months ago)
- Last Synced: 2024-04-23T00:20:06.735Z (9 months ago)
- Language: JavaScript
- Size: 180 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Inditex POC
## Test Dataset
- We are using the [CORD-19](https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge) (COVID-19 Open Research Dataset Challenge) dataset for testing.
- The data used in the POC is from the `./document_parses/pdf_json` folder.Note this is a 20GB huge dataset. >100GB when unzipped. Make sure you have enough space to hold everything.
## Import Test Data
### Dependencies
```bash
sudo yum install -y jq cyrus-sasl cyrus-sasl-gssapi cyrus-sasl-plain
```### Executables
The script uses `mongoimport` together with `jq` to import the json files. `mongoimport` is included in the [MongoDB Database Tools](https://www.mongodb.com/try/download/database-tools).### Configuration
The following configuration should be replaced by yours:
- `uri='mongodb+srv://:@vectorsearchpoc.wy1z6.mongodb.net/'`: The target MongoDB URI.
- `db=covid`: Database name where you want to store the dataset.
- `coll=paper`: Collection name where you want to store the dataset.
- `json_source="/document_parses/pdf_json/*.json"`: Where to find the json files.### How to Import
```bash
./import.sh
```### What's Imported
Each json file is a big one has a lot of data included. We don't need all of them. Only the following fields are imported for this testing:
- `paper_id`: `paper_id`
- `title`: `metadata.title`
- `text`: `body_text.text`
- `section`: `body_text.section`Note the `body_text` is an array. We unwinded the array so that each text snippet is inserted as one document.
This is an example of the final document structure:
```json
{
"_id": {
"$oid": "6619439e4728952eae60aeb8"
},
"paper_id": "0000028b5cc154f68b8a269f6578f21e31f62977",
"title": "\"Multi-faceted\" COVID-19: Russian experience",
"text": "According to current live statistics at the time of editing this letter, Russia has been the third country in the world to be affected by COVID-19 with both new cases and death rates rising. It remains in a position of advantage due to the later onset of the viral spread within the country since the worldwide disease outbreak.",
"section": "Editor"
}
```## Peroformance Test
### The Logic
- The application first load `numSamples` documents from the collection. The `embedding` in these documents will be used for later test.
- When a request arrives,
- The application will randomly get 1 embedding from the previously loaded embeddings.
- Use the embedding to do vector search.
- The search result will return `limit` of `numCandidates`.
- The full document (without embedding because it's big and is usually useless to the client) will be returned to the client.### Dependencies
- NodeJS v18.
- Source code in `/performance-test` folder.
- Packages and tools:
```bash
cd performance-test
npm install
npm install -g pm2
```### Run
To start:
```bash
pm2 start bin/www -i max --watch
```
To review logs:
```bash
pm2 log www
```
To stop:
```bash
pm2 stop www
```
To remove
```bash
pm2 delete www
```### Test
Apache ab is used for testing
```bash
yum install httpd
ab -n 100 -c 1 http://localhost/vector-search
```