Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/chanzuckerberg/alhazen

AI agents + toolkits for scientific knowledge
https://github.com/chanzuckerberg/alhazen

landscape-analysis large-language-models scientific-knowledge-engineering

Last synced: about 2 months ago
JSON representation

AI agents + toolkits for scientific knowledge

Awesome Lists containing this project

README

        

# Home - Alhazen

Alhazen is an AI agent to help human users understand what is already
known. Alhazen uses tools that help you build a local **digital
library** of papers, webpages, database records, etc. by querying online
sources and downloading material to your hard drive. Alhazen also uses a
range of tools to analyze its digital library contents to answer
meaningful questions to human users. It is primarily built with the
excellent [`LangChain`](https://langchain.com/) system, providing tools
that use web-robots, generative AI, deep learning, ontologies and other
knowledge-based technologies. It can be run using locally-running
Dashboards, Jupyter Notebooks, or command-line function calls.

The goals of this work are threefold:

1. To provide a pragmatic AI tool that helps read, summarize, and
synthesize available scientific knowledge.
2. To provide a *platform* for development of AI tools in the
community.
3. To actively develop working systems for high-value tasks within the
Chan Zuckerberg Initiative’s programs and partnerships.

The system uses available tools within the rapidly-expanding ecosystem
of generative AI models. This includes state-of-the-art commercial APIs
(such as OpenAI, Gemini, Mistral, etc) as well as open models that can
be run locally (such as Llama-2, Mixtral, Smaug, Olmo, etc.).

To use local models, it is recommended that Alhazen be run on a large,
high end machine such as an M2 Apple Macbook with 48+GB of memory.

## Caution + Caveats

- This toolkit provides functionality to use agents to download
information from the web. Care should be taken by users and developers
should make sure they abide by data licensing requirements and third
party websites terms and conditions and that that they don’t otherwise
infringe upon third party privacy or intellectual property rights.
- All data generated by Large Language Models (LLMs) should be reviewed
for accuracy.

## Installation

### Docker

The preferred method to run Alhazen is through
[Docker](https://www.docker.com/).

Note, for correct functionality, set the following environment variables
for the shell from which you are calling Docker:

**MANDATORY**

- LOCAL_FILE_PATH - the directory where the system will store full-text
files.

**OPTIONAL**

- OPENAI_API_KEY - if you are using OpenAI large language models.
- DATABRICKS_API_KEY - if you are using the Databricks AI Playground
endpoint as an LLM server.
- GROQ_API_KEY - if you are calling LLMs on groq.com

#### Quickstart

To run the system out of the box, run these commands:

``` bash
$ git clone https://github.com/chanzuckerberg/alhazen
$ cd alhazen
$ docker compose build
$ docker compose up
```

This should generate the output that includes a link formatted like this
one: http://127.0.0.1:8888/lab?token=LONG-ALPHANUMERIC-STRING.

Open a browser to that location and you should get access to a juypter
lab notebook that provides access to all notebooks in the repo.

Browse to `nbs/tutorials/CryoET_Tutorial.ipynb` to access a walkthrough
of an analysis over papers involving CryoET as a demonstration.

#### Run Huridocs as PDF extraction

To run the system with support from the
[Huridocs](https://github.com/huridocs/pdf_paragraphs_extraction) PDF
extraction system (needed for processing full text articles), you must
first run the docker container for that system:

``` bash
$ git clone https://github.com/huridocs/pdf_paragraphs_extraction
$ cd pdf_paragraphs_extraction
$ docker compose build
$ docker compose up
```

Then repeat as before, but with the huridocs alhazen image

``` bash
$ cd ..
$ git clone https://github.com/chanzuckerberg/alhazen
$ cd alhazen
$ docker compose build
$ docker compose -f docker-compose-huridocs.yml up
```

### Install dependencies

#### Postgresql

Alhazen requires
[postgresql@14](https://www.postgresql.org/download/macosx/) to run.
Homebrew provides an installer:

``` bash
$ brew install postgresql@14
```

which can be run as a service:

``` bash
$ brew services start postgresql@14
$ brew services list
```

If you install Postgresql via homebrew, you will need to create a
`postgres` superuser to run the `psql` command.

$ createuser -s postgres

Note that the [`Postgres.app`](https://postgresapp.com/) system also
provides a nice GUI interface for Postgres but installing the
[`pgvector`](https://github.com/pgvector/pgvector) package is a little
more involved.

#### Ollama

The tool uses the [Ollama](https://ollama.ai/) library to execute large
language models locally on your machine. Note that to able to run the
best performing models on a Apple Mac M1 or M2 machine, you will need at
least 48GB of memory.

#### Huridocs

We use a PDF document text extraction and classification system called
[Huridocs](https://huridocs.org/). In particular, our PDF processing
requires a docker image of their [PDF Paragraphs
Extraction](https://github.com/huridocs/pdf_paragraphs_extraction)
system. To run this, perform the following steps:

1. git clone https://github.com/huridocs/pdf_paragraphs_extraction
2. cd pdf_paragraphs_extraction
3. docker-compose up

### Install Alhazen source code

``` bash
git clone https://github.com/chanzuckerberg/alzhazen
conda create -n alhazen python=3.11
conda activate alhazen
cd alhazen
pip install -e .
```

## How to use

We provide a number of low-level interfaces to work with Alhazen.

### Notebooks

We have developed numerous worked examples of corpora that can generated
by running queries on public sources and then processing the results
with LLM-enabled workflows. See the
[`nbs/cookbook`](https://github.com/chanzuckerberg/alhazen/tree/main/nbs/cookbook)
subdirectory for examples.

### Marimo Dashboards

We provide dashboards using [Marimo notebooks](https://marimo.io/).
These provide runnable, ‘reactive notebooks’ (similar to the excellent
ObservableHQ system but implemented in Python). They provide lightweight
dashboards and data visualization.

For a dashboard that shows contents of all active databases on the
current machine, run

marimo run nb/marimo/002_corpora_map.py

### Applications

We use simple python modules to run applications. To generate a simple
gradio chatbot to interact with the \`\` library to create a modular
command line interface (CLI) for Alhazen. Use the following command
structure to execute specific demo applications:

``` bash
python -m alhazen.apps.chat --loc --db_name
```

## Environmental Variables

The following environment variables will need to be set:

- LOCAL_FILE_PATH = /path/to/local/directory/for/full/text/files/

To use other commercial services, you should also set appropriate
environmental variables to gain access. Examples include:

- OPENAI_API_KEY
- DB_API_KEY
- VERTEXAI_PROJECT_NAME
- NCBI_API_KEY

## Code Status and Capabilities

This project is still very early, but we are attempting to provide
access to the full range of capabilities of the project as we develop
them.

The system is built using the excellent [nbdev](https://nbdev.fast.ai/)
environment. Jupyter notebooks in the `nbs` directory are processed
based on directive comments contained within notebook cells (see
[guide](https://nbdev.fast.ai/explanations/directives.html)) to generate
the source code of the library, as well as accompanying documentation.

Examples of the use of the library to address research / landscaping
questions specified in the [use cases](docnb1_use_cases.html) can be
found in the `nb_scratchpad/cookbook` subdirectory of this github repo.

### Contributing

We warmly welcome contributions from the community! Please see our
[contributing
guide](https://github.com/chanzuckerberg/alhazen/blob/main/CONTRIBUTING.md)
and don’t hesitate to open an issue or send a pull request to improve
Alhazen.

This project adheres to the Contributor Covenant [code of
conduct](https://github.com/chanzuckerberg/.github/blob/master/CODE_OF_CONDUCT.md).
By participating, you are expected to uphold this code. Please report
unacceptable behavior to [email protected].

## Where does the Name ‘Alhazen’ come from?

One thousand years ago, Ḥasan Ibn al-Haytham (965-1039 AD) studied
optics through experimentation and observation. He advocated that a
hypothesis must be supported by experiments based on confirmable
procedures or mathematical reasoning — an early pioneer in the
scientific method *five centuries* before Renaissance scientists started
following the same paradigm ([Website](https://www.ibnalhaytham.com/),
[Wikipedia](https://en.wikipedia.org/wiki/Ibn_al-Haytham), [Tbakhi &
Amir 2007](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6074172/)).

We use the latinized form of his name (‘Alhazen’) to honor his
contribution (which goes largely unrecognized from within non-Islamic
communities).

Famously, he was quoted as saying:

> The duty of the man who investigates the writings of scientists, if
> learning the truth is his goal, is to make himself an enemy of all
> that he reads, and, applying his mind to the core and margins of its
> content, attack it from every side. He should also suspect himself as
> he performs his critical examination of it, so that he may avoid
> falling into either prejudice or leniency.

Here, we seek to develop an AI capable of applying scientific knowledge
engineering to support CZI’s mission. We seek to honor Ibn al-Haytham’s
critical view of published knowledge by creating a AI-powered system for
scientific discovery.

Note - when describing our agent, we will use non-gendered pronouns
(they/them/it) to refer to the agent.