Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hazyresearch/meerkat
Creative interactive views of any dataset.
https://github.com/hazyresearch/meerkat
data-science foundation-models machine-learning ml pandas
Last synced: about 6 hours ago
JSON representation
Creative interactive views of any dataset.
- Host: GitHub
- URL: https://github.com/hazyresearch/meerkat
- Owner: HazyResearch
- License: apache-2.0
- Created: 2021-05-07T00:26:35.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-02-25T18:15:08.000Z (9 months ago)
- Last Synced: 2024-11-08T14:05:05.749Z (8 days ago)
- Topics: data-science, foundation-models, machine-learning, ml, pandas
- Language: Python
- Homepage:
- Size: 66.5 MB
- Stars: 829
- Watchers: 15
- Forks: 43
- Open Issues: 11
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md
Awesome Lists containing this project
README
---
[![GitHub](https://img.shields.io/github/license/HazyResearch/meerkat)](https://img.shields.io/github/license/HazyResearch/meerkat)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)Create interactive views of any dataset.
[**Website**](http://meerkat.wiki)
| [**Quickstart**](http://meerkat.wiki/docs/start/quickstart-df.html)
| [**Docs**](http://meerkat.wiki/docs/index.html)
| [**Contributing**](CONTRIBUTING.md)
| [**Discord**](https://discord.gg/pw8E4Q26Tq)
| [**Blogpost**](https://hazyresearch.stanford.edu/blog/2023-03-01-meerkat)## ⚡️ Quickstart
```bash
pip install meerkat-ml
```**Next Steps**.
Check out our [Getting Started page](http://meerkat.wiki/docs/start/quickstart-df.html) and our [documentation](http://meerkat.wiki/docs/index.html) to start building with Meerkat.## Why Meerkat?
Meerkat is an open-source Python library that helps users visualize, explore, and annotate any dataset. It is especially useful when processing unstructured data types (_e.g._ free text, PDFs, images, video) with machine learning models.
### ✏️ Features and Design Principles
Here are four principles that inform Meerkat's design.
**(1) Low overhead.** With four lines of Python, start interacting with any dataset.
- Zero-copy integrations with your preferred data abstractions: Pandas, Arrow, HF Datasets, Ibis, SQL.
- Limited data movement. With Meerkat, you interact with your data where it already lives: no uploads to external databases and no reformatting.```python
import meerkat as mk
df = mk.from_csv("paintings.csv")
df["image"] = mk.files("image_url")
df
```
**(2) Diverse data types.** Visualize and annotate almost any data type in Meerkat interfaces: text, images, audio, video, MRI scans, PDFs, HTML, JSON.
**(3) "Intelligent" user interfaces.** Meerkat makes it easy to embed **machine learning models** (e.g. LLMs) within user interfaces to enable intelligent functionality such as searching, grouping and autocomplete.
```python
df["embedding"] = mk.embed(df["img"], engine="clip")
match = mk.gui.Match(df,
against="embedding",
engine="clip"
)
sorted_df = mk.sort(df,
by=match.criterion.name,
ascending=False
)
gallery = mk.gui.Gallery(sorted_df)
mk.gui.html.div([match, gallery])
```
**(4) Declarative (think: Seaborn), but also infinitely customizable and composable.**
Meerkat visualization components can be composed and customized to create new interfaces.```python
plot = mk.gui.plotly.Scatter(df=plot_df, x="umap_1", y="umap_2",)@mk.gui.reactive
def filter(selected: list, df: mk.DataFrame):
return df[df.primary_key.isin(selected)]filtered_df = filter(plot.selected, plot_df)
table = mk.gui.Table(filtered_df, classes="h-full")mk.gui.html.flex([plot, table], classes="h-[600px]")
```
### ✨ Use cases where Meerkat shines
- _Exploratory analysis over unstructured data types._ [Demo](https://www.youtube.com/watch?v=a8FBT33QACQ)
- _Spot-checking the behavior of large language models (e.g. GPT-3)._ [Demo](https://www.youtube.com/watch?v=3ItA70qoe-o)
- _Identifying systematic errors made by machine learning models._ [Demo](https://youtu.be/4Kk_LZbNWNs)
- _Rapid labeling of validation data._### 🤔 Use cases where Meerkat may not be the right fit
- _Are you only working with structured data (e.g. numerical and categorical variables)?_ Popular data visualization libraries (_e.g._ [Seaborn](https://seaborn.pydata.org/), [Matplotlib](https://matplotlib.org/)) are often sufficient. If you're looking for interactivity, [Plotly](https://plotly.com/) and [Streamlit](https://streamlit.io/) work well with structured data. Meerkat is differentiated in how it visualizes unstructured data types: long-form text, PDFs, HTML, images, video, audio...
- _Are you trying to make a straightforward demo of a machine learning model (single input/output, chatbot) and share with the world?_ [Gradio](https://gradio.app/) is likely a better fit! Though, if your demo involves visualizing lots of data, you may find Meerkat useful.
- _Are you trying to manually label tens of thousands of data points?_ If you are looking for a data labeling tool to use with a labeling team, there are great open source labeling solutions designed for this (_e.g._ [LabelStudio](https://labelstud.io/)). In contrast, Meerkat is great fit for teams/individuals without access to a large labeling workforce who are using pretrained models (_e.g._ GPT-3) and need to label validation data or in-context examples.## ✉️ About
Meerkat is being built by Machine Learning PhD students in the [Hazy Research](https://hazyresearch.stanford.edu) lab at Stanford. We're excited to build for a future where models will make it easier for teams to sift and reason through large volumes of unstructtured data effortlessly.
Please reach out to `kgoel [at] cs [dot] stanford [dot] edu, eyuboglu [at] stanford [dot] edu, and arjundd [at] stanford [dot] edu` if you would like to use Meerkat for a project, at your company or if you have any questions.