https://github.com/collaborative-ai/coldata

Dataset Search Engine with Vector Database
https://github.com/collaborative-ai/coldata

beautifulsoup dataset milvus mongodb search-engine vector-database

Last synced: 7 months ago
JSON representation

Dataset Search Engine with Vector Database

Host: GitHub
URL: https://github.com/collaborative-ai/coldata
Owner: Collaborative-AI
License: mit
Created: 2023-05-19T03:32:19.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2025-02-28T11:59:19.000Z (7 months ago)
Last Synced: 2025-02-28T18:18:46.814Z (7 months ago)
Topics: beautifulsoup, dataset, milvus, mongodb, search-engine, vector-database
Language: Python
Homepage:
Size: 27 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

# ColData
_**Col**laborative **Data**set Search Engine with Vector Database_

**coldata** is an open-source dataset search engine designed to help researchers, data scientists, developers, and the broader community **collaboratively** discover, share, and access relevant datasets across a variety of sources.
The engine crawls metadata from popular dataset hosting platforms with **BeautifulSoup**, stores it in **MongoDB**, and transforms it into a vector-based database using **MilvuDB** for enhanced search and retrieval.

## Features

- **Multi-source Crawling**: We gather metadata from major dataset repositories.

- **Vector-based Search**: The metadata is converted into vector embeddings using the language model, enabling powerful semantic search capabilities.

- **Interface**: With the help of **Gradio**, we offer a simple demo to interact with the engine and quickly locate datasets.

- **Scalable**: The underlying database, **MongoDB**, and vector engine, **MilvuDB**, ensure that the system scales as the number of crawled datasets grows.

## Datasets

We currently support crawling and indexing datasets from the following sources:

| Dataset Name | Number of Datasets | Completed |
|---------------------------------------|-----------------------|------------|
| **UCI** | 675 | ✅ |
| **Kaggle** | 40,000+ | ✅ |
| **Registry of Open Data on AWS** | 496 | ✅ |
| **Papers With Code** | 8,966 | ✅ |
| **Figshare** | 1,856,206 | |
| **Mendeley Data** | 1,307,514 | |
| **Hugging Face Datasets** | 88,179 | |
| **Zenodo** | 234,972 | |
| **IEEE Dataport** | 1,170 | |
| **Open Data Lab** | 6,432 | |
| **Roboflow Universe** | 200,000+ | |

## Installation

Clone the repository:

```bash
git clone https://github.com/yourusername/coldata.git
cd coldata
```

Install dependencies:

```bash
pip install -r requirements.txt
```

## Configuration

The system can be customized via the `config.yml` file, where you can configure hyperparameters for both MongoDB and MilvuDB.

## Quick Start

1. **Start MongoDB**:

Run `start_mongo.sh` to start a local MongoDB instance.

```bash
./start_mongo.sh
```

2. **Start Milvus DB**:

Run `manage_milvus.sh` to start the Milvus vector database.

```bash
./manage_milvus.sh
```

3. **Run Scheduler**:

Set up the scheduler to crawl datasets at a specified interval by running:

```bash
python scheduler.py
```

4. **Demo Interface**:

To quickly test the dataset search, you can use the Gradio-based demo:

```bash
python demo.py
```

## How to Contribute

We welcome contributions to improve **coldata**. If you have ideas for new features, bug fixes, or dataset sources to include, please feel free to open an issue or submit a pull request.

## License

This project is licensed under the MIT License.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/collaborative-ai/coldata

Awesome Lists containing this project

README