https://github.com/weaviate/arxiv-demo-dataset
This repository will contain a demo using Weaviate with data and metadata from the arXiv dataset.
https://github.com/weaviate/arxiv-demo-dataset
arxiv-dataset weaviate
Last synced: 3 months ago
JSON representation
This repository will contain a demo using Weaviate with data and metadata from the arXiv dataset.
- Host: GitHub
- URL: https://github.com/weaviate/arxiv-demo-dataset
- Owner: weaviate
- License: bsd-3-clause
- Created: 2020-09-04T13:54:12.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2022-03-08T22:13:48.000Z (over 3 years ago)
- Last Synced: 2025-01-14T12:30:24.081Z (9 months ago)
- Topics: arxiv-dataset, weaviate
- Language: HTML
- Homepage:
- Size: 9.71 MB
- Stars: 12
- Watchers: 6
- Forks: 5
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# arXiv-demo-dataset
This repository will contain a demo using Weaviate with data and metadata from the [arXiv dataset](https://www.kaggle.com/Cornell-University/arxiv).
The code is tested with Python version 3.8.5.
## Steps to set up:
1. Spin up a default Weaviate instance with docker-compose (see https://weaviate.io/developers/weaviate/current/getting-started/installation.html#docker-compose).
2. Run `python start_project.py`, with the following optional arguments. If a config file (`-cf CONFIG_FILE, --config_file CONFIG_FILE`) is given, all other parameters are ignored:
| short argument | long argument | default value | description |
| ------ | ------ | ------ | ------ |
| -cf | --config_file | | config file name |
| -i | --metadata_file | data/arxiv-metadata-oai-snapshot.json | location and name of the arXiv metadata json file |
| -s | --schema | project/schema.json | location and name of the schema |
| -w | --weaviate | http://localhost:8080 | weaviate url |
| -np | --n_papers | 1000000000 | maximum number of papers to import |
| -snp | --skip_n_papers | 0 | number of papers to skip before starting the import |
| -po | --papers_only | false | skips all other data object imports except for papers if set to True, and ignores --skip_journals, --skip_authors and --skip_taxonomy |
| -sj | --skip_journals | false | whether you want to skip the import of all the journals |
| -sa | --skip_authors | false | whether you want to skip the import of all the authors |
| -st | --skip_taxonomy | false | whether you want to skip the import of all the arxiv taxonomy objects |
| -to | --timeout | 20 | max time out in seconds for the python client batching operations |
| -ows | --overwrite_schema | false | overwrites the schema if one is present and one is given |
| -bs | --batch_size | 512 | maximum number of data objects to be sent in one batch |## Usage notes
- If you want to import the whole arXiv dataset of 2.65GB, make sure you have enough memory resources available in your environment (and Docker setup, I allocated 200GB for the Docker image size).
- In addition, set the `--timeout` parameter to at least 50, to avoid batches to fail because of longer read and write times.
- Moreover, make sure to allocate enough memory for ES, by setting `ES_JAVA_OPTS: -Xms4g -Xmx4g` in `docker-compose.yaml`## Build Status
| Branch | Status |
| -------- |:-------------:|
| Master | [](https://travis-ci.com/semi-technologies/arXiv-demo-dataset)