An open API service indexing awesome lists of open source software.

https://github.com/pangeacyber/pangea-multipass

Pangea Multipass is the authorization checker for systems like Google Workspace, Jira, and more.
https://github.com/pangeacyber/pangea-multipass

authorization llms rag

Last synced: 8 months ago
JSON representation

Pangea Multipass is the authorization checker for systems like Google Workspace, Jira, and more.

Awesome Lists containing this project

README

          

# Pangea Multipass: Your Authorization Helper

Pangea Multipass is a Python library for checking user access to upstream data sources.

In practice, you can use it to check if a specific user has access to a file in a Google Drive, a ticket in Jira, or a page in Confluence. In concept, we've built this library to be extensible to eventually support Slack channels, GitHub repositories, Salesforce opportunities, and more.

We originally built this to support our customers' Retrieval-Augmented Generation (RAG) applications to mitigate data leaks. In a RAG architecture, the application inserts additional context at inference time. If you don't check the user's authorization to that context, you could inadvertently leak sensitive information.

While this is useful in AI/LLM apps, we've abstracted this to work independently so you can use it in any app.

Check out the `/examples` folder for AI-specific and generic examples.

## Features

- **Document Reading**: Supports document content extraction for use in processing and enrichment.
- **Metadata Enrichment**: Includes enrichers for hashing, constant value setting, and custom metadata.
- **Metadata Filtering**: Provides flexible operators to filter document metadata for customized queries.
- **Authorization Processing**: Manages authorized and unauthorized nodes with customizable node processors.
- **Extensible**: Built on abstract base classes, allowing easy extension and customization of functionality.

## Installation

To install `pangea-multipass`, you can use [Poetry](https://python-poetry.org/) for dependency management:

```bash
poetry add pangea-multipass
```

There are full runnable demos in the `pangea_multipass_lib\examples` directory but here are the key aspects.

Using a set of Google Drive credentials - following the steps in the llama_index_examples folder - you initialize the data source:

```python
gdrive_reader = GoogleDriveReader(
folder_id=gdrive_fid, token_path=admin_token_filepath, credentials_path=credentials_filepath
)
documents = gdrive_reader.load_data(folder_id=gdrive_fid)
```

This gives you a list of files. You can then use the processors to filter into the authorized and unauthorized resource lists:

```python
gdrive_processor = LlamaIndexGDriveProcessor(creds)
node_processor = NodePostprocessorMixer([gdrive_processor])

authorized_docs = node_processor.postprocess_nodes(documents)
unauthorized_docs = node_processor.get_unauthorized_nodes()
```

In general, the authorized list will be more important but you may notify an admin or log if a user is attempting to access a folder where they have limited access. It could be an attempt at data theft or their permissions are incomplete.

## Roadmap

At release, this library supports Google Workspace, Confluence, and Jira. For adding systems, our top priorities are:

- Box
- Dropbox
- Office 365

Others we plan to support or are looking for contributions are:

- Zoom
- Salesforce
- GitLab
- Zendesk
- Notion
- Sharepoint
- Asana
- Hubspot

Check out `EXTENDING.md` for the specific structure and requirements for extending Pangea Multipass for your data sources. Pull requests are welcome.