https://github.com/OpenInterpreter/aifs

Local semantic search. Stupidly simple.
https://github.com/OpenInterpreter/aifs

Last synced: about 2 months ago
JSON representation

Local semantic search. Stupidly simple.

Host: GitHub
URL: https://github.com/OpenInterpreter/aifs
Owner: OpenInterpreter
License: apache-2.0
Created: 2024-01-16T08:54:21.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-04-24T03:27:32.000Z (12 months ago)
Last Synced: 2024-05-22T00:19:58.682Z (11 months ago)
Language: Python
Size: 323 KB
Stars: 326
Watchers: 9
Forks: 40
Open Issues: 7
Metadata Files:
- Readme: README.md
- License: LICENSE
- Roadmap: ROADMAP.md

Awesome Lists containing this project

awesome_ai_agents - aifs - AIFS offers a simple and efficient local semantic search capability for folders, leveraging Unstructured.IO for advanced data processing and ChromaDB for fast, similarity-based searching of embeddings [github](https://github.com/OpenInterpreter/aifs) (Learning / Repositories)

README

# AI Filesystem

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QdXPchTDnzW6I_3HTZFpSeak_XoH81v5?usp=sharing)

Local semantic search over folders. Why didn't this exist?

```shell
pip install aifs
pip install "unstructured[all-docs]" # If you want to parse all doc types. Includes large packages!
```

```python
from aifs import search

search("How does AI Filesystem work?", path="/path/to/folder")
search("It's not unlike how Spotlight works.") # Path defaults to CWD
```

# How it works

![aifs](https://github.com/KillianLucas/aifs/assets/63927363/c61599a9-aad8-483d-b6a4-3671629cd5f4)

Running `aifs.search` will chunk and embed all nested supported files (`.txt`, `.py`, `.sh`, `.docx`, `.pptx`, `.jpg`, `.png`, `.eml`, `.html`, and `.pdf`) in `path`. It will then store these embeddings into an `_.aifs` file in `path`.

By storing the index, you only have to chunk/embed once. This makes semantic search **very** fast after the first time you search a path.

If a file has changed or been added, `aifs.search` will update or add those chunks. We still need to handle file deletions (we welcome PRs).

### In detail:

1. If a folder hasn't been indexed, we first use [`unstructured`](https://github.com/Unstructured-IO/unstructured/tree/main) to parse and chunk every file in the `path`.
2. Then we use [`chroma`](https://github.com/chroma-core/chroma) to embed the chunks locally and save them to a `_.aifs` file in `path`.
3. Finally, `chroma` is used again to semantically search the embeddings.

If an `_.aifs` file _is_ found in a directory, it uses that instead of indexing it again. If some files have been updated, it will re-index those.

# Goals

- We should always have SOTA parsing and chunking. The logic for this should be swapped out as new methods arise.
- Chunking should be semantic — as in, `python` and `markdown` files should have _different_ chunking algorithms based on the expected content of those filetypes. Who has this solution?
- For parsing, I think Unstructured is the best of the best. Is this true?
- We should always have SOTA embedding. If a better local embedding model is found, we should automatically download and use it.
- I think Chroma will always do this (is this true?) so we depend on Chroma.
- This project should stay **minimally scoped** — we want `aifs` to be the best local semantic search in the universe.

# Why?

We built this to let [`open-interpreter`](https://openinterpreter.com/) quickly semantically search files/folders.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/OpenInterpreter/aifs

Awesome Lists containing this project

README