An open API service indexing awesome lists of open source software.

https://github.com/privetin/dataset-viewer

MCP server for interacting with Hugging Face dataset viewer API, providing dataset browsing, filtering, and statistics capabilities
https://github.com/privetin/dataset-viewer

Last synced: 26 days ago
JSON representation

MCP server for interacting with Hugging Face dataset viewer API, providing dataset browsing, filtering, and statistics capabilities

Awesome Lists containing this project

README

        

# Dataset Viewer MCP Server

An MCP server for interacting with the [Hugging Face Dataset Viewer API](https://huggingface.co/docs/dataset-viewer), providing capabilities to browse and analyze datasets hosted on the Hugging Face Hub.

## Features

### Resources

- Uses `dataset://` URI scheme for accessing Hugging Face datasets
- Supports dataset configurations and splits
- Provides paginated access to dataset contents
- Handles authentication for private datasets
- Supports searching and filtering dataset contents
- Provides dataset statistics and analysis

### Tools

The server provides the following tools:

1. **validate**
- Check if a dataset exists and is accessible
- Parameters:
- `dataset`: Dataset identifier (e.g. 'stanfordnlp/imdb')
- `auth_token` (optional): For private datasets

2. **get_info**
- Get detailed information about a dataset
- Parameters:
- `dataset`: Dataset identifier
- `auth_token` (optional): For private datasets

3. **get_rows**
- Get paginated contents of a dataset
- Parameters:
- `dataset`: Dataset identifier
- `config`: Configuration name
- `split`: Split name
- `page` (optional): Page number (0-based)
- `auth_token` (optional): For private datasets

4. **get_first_rows**
- Get first rows from a dataset split
- Parameters:
- `dataset`: Dataset identifier
- `config`: Configuration name
- `split`: Split name
- `auth_token` (optional): For private datasets

5. **get_statistics**
- Get statistics about a dataset split
- Parameters:
- `dataset`: Dataset identifier
- `config`: Configuration name
- `split`: Split name
- `auth_token` (optional): For private datasets

6. **search_dataset**
- Search for text within a dataset
- Parameters:
- `dataset`: Dataset identifier
- `config`: Configuration name
- `split`: Split name
- `query`: Text to search for
- `auth_token` (optional): For private datasets

7. **filter**
- Filter rows using SQL-like conditions
- Parameters:
- `dataset`: Dataset identifier
- `config`: Configuration name
- `split`: Split name
- `where`: SQL WHERE clause (e.g. "score > 0.5")
- `orderby` (optional): SQL ORDER BY clause
- `page` (optional): Page number (0-based)
- `auth_token` (optional): For private datasets

8. **get_parquet**
- Download entire dataset in Parquet format
- Parameters:
- `dataset`: Dataset identifier
- `auth_token` (optional): For private datasets

## Installation

### Prerequisites

- Python 3.12 or higher
- [uv](https://github.com/astral-sh/uv) - Fast Python package installer and resolver

### Setup

1. Clone the repository:
```bash
git clone https://github.com/privetin/dataset-viewer.git
cd dataset-viewer
```

2. Create a virtual environment and install:
```bash
# Create virtual environment
uv venv

# Activate virtual environment
# On Unix:
source .venv/bin/activate
# On Windows:
.venv\Scripts\activate

# Install in development mode
uv add -e .
```

## Configuration

### Environment Variables

- `HUGGINGFACE_TOKEN`: Your Hugging Face API token for accessing private datasets

### Claude Desktop Integration

Add the following to your Claude Desktop config file:

On Windows: `%APPDATA%\Claude\claude_desktop_config.json`

On MacOS: `~/Library/Application Support/Claude/claude_desktop_config.json`

```json
{
"mcpServers": {
"dataset-viewer": {
"command": "uv",
"args": [
"run",
"dataset-viewer"
]
}
}
}
```

## Usage Examples

1. Validate a dataset:
```json
{
"dataset": "stanfordnlp/imdb"
}
```

2. Get dataset information:
```json
{
"dataset": "stanfordnlp/imdb"
}
```

3. Search dataset contents:
```json
{
"dataset": "stanfordnlp/imdb",
"config": "plain_text",
"split": "train",
"query": "great movie"
}
```

4. Filter and sort rows:
```json
{
"dataset": "stanfordnlp/imdb",
"config": "plain_text",
"split": "train",
"where": "label = 'positive'",
"orderby": "text DESC",
"page": 0
}
```

5. Get dataset statistics:
```json
{
"dataset": "stanfordnlp/imdb",
"config": "plain_text",
"split": "train"
}
```

## License

MIT License - see [LICENSE](LICENSE) for details