https://github.com/collaborative-ai/coldata
Dataset Search Engine with Vector Database
https://github.com/collaborative-ai/coldata
beautifulsoup dataset milvus mongodb search-engine vector-database
Last synced: 7 months ago
JSON representation
Dataset Search Engine with Vector Database
- Host: GitHub
- URL: https://github.com/collaborative-ai/coldata
- Owner: Collaborative-AI
- License: mit
- Created: 2023-05-19T03:32:19.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-02-28T11:59:19.000Z (7 months ago)
- Last Synced: 2025-02-28T18:18:46.814Z (7 months ago)
- Topics: beautifulsoup, dataset, milvus, mongodb, search-engine, vector-database
- Language: Python
- Homepage:
- Size: 27 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# ColData
_**Col**laborative **Data**set Search Engine with Vector Database_**coldata** is an open-source dataset search engine designed to help researchers, data scientists, developers, and the broader community **collaboratively** discover, share, and access relevant datasets across a variety of sources.
The engine crawls metadata from popular dataset hosting platforms with **BeautifulSoup**, stores it in **MongoDB**, and transforms it into a vector-based database using **MilvuDB** for enhanced search and retrieval.## Features
- **Multi-source Crawling**: We gather metadata from major dataset repositories.
- **Vector-based Search**: The metadata is converted into vector embeddings using the language model, enabling powerful semantic search capabilities.- **Interface**: With the help of **Gradio**, we offer a simple demo to interact with the engine and quickly locate datasets.
- **Scalable**: The underlying database, **MongoDB**, and vector engine, **MilvuDB**, ensure that the system scales as the number of crawled datasets grows.
## Datasets
We currently support crawling and indexing datasets from the following sources:
| Dataset Name | Number of Datasets | Completed |
|---------------------------------------|-----------------------|------------|
| **UCI** | 675 | ✅ |
| **Kaggle** | 40,000+ | ✅ |
| **Registry of Open Data on AWS** | 496 | ✅ |
| **Papers With Code** | 8,966 | ✅ |
| **Figshare** | 1,856,206 | |
| **Mendeley Data** | 1,307,514 | |
| **Hugging Face Datasets** | 88,179 | |
| **Zenodo** | 234,972 | |
| **IEEE Dataport** | 1,170 | |
| **Open Data Lab** | 6,432 | |
| **Roboflow Universe** | 200,000+ | |## Installation
Clone the repository:
```bash
git clone https://github.com/yourusername/coldata.git
cd coldata
```Install dependencies:
```bash
pip install -r requirements.txt
```## Configuration
The system can be customized via the `config.yml` file, where you can configure hyperparameters for both MongoDB and MilvuDB.
## Quick Start
1. **Start MongoDB**:
Run `start_mongo.sh` to start a local MongoDB instance.
```bash
./start_mongo.sh
```2. **Start Milvus DB**:
Run `manage_milvus.sh` to start the Milvus vector database.
```bash
./manage_milvus.sh
```3. **Run Scheduler**:
Set up the scheduler to crawl datasets at a specified interval by running:
```bash
python scheduler.py
```4. **Demo Interface**:
To quickly test the dataset search, you can use the Gradio-based demo:
```bash
python demo.py
```
## How to ContributeWe welcome contributions to improve **coldata**. If you have ideas for new features, bug fixes, or dataset sources to include, please feel free to open an issue or submit a pull request.
## License
This project is licensed under the MIT License.