https://github.com/shipclojure/voice-fn

A Clojure library for building real-time voice-enabled AI pipelines. voice-fn handles the orchestration of speech recognition, audio processing, and AI service integration with the elegance of functional programming.
https://github.com/shipclojure/voice-fn
Last synced: 2 months ago
JSON representation
Host: GitHub
URL: https://github.com/shipclojure/voice-fn
Owner: shipclojure
License: epl-1.0
Created: 2024-12-24T19:33:06.000Z (7 months ago)
Default Branch: main
Last Pushed: 2025-05-04T08:28:28.000Z (2 months ago)
Last Synced: 2025-05-07T06:04:27.543Z (2 months ago)
Language: Clojure
Homepage:
Size: 2.7 MB
Stars: 63
Watchers: 3
Forks: 5
Open Issues: 1
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project

README

        # voice-fn - Real-time Voice AI Pipeline Framework

`voice-fn` is a Clojure framework for building real-time voice AI applications using a data-driven, functional approach. Built on top of `clojure.core.async.flow`, it provides a composable pipeline architecture for processing audio, text, and AI interactions with built-in support for major AI providers.

This project's status is **_experimental_**. Expect breaking changes.

[![Watch the video](https://img.youtube.com/vi/HwoGMhIx5w0/0.jpg)](https://youtu.be/HwoGMhIx5w0?t=345)

## Core Features

-   **Flow-Based Architecture:** Built on `core.async.flow` for robust concurrent processing

-   **Data-First Design:** Define AI pipelines as data structures for easy configuration and modification

-   **Streaming Architecture:** Efficient real-time audio and text processing

-   **Extensible Processors:** Simple protocol-based system for adding new processing components

-   **Flexible Frame System:** Type-safe message passing between pipeline components

-   **Built-in Services:** Ready-to-use integrations with major AI providers

## Quick Start: Local example

First, create a `resources/secrets.edn`:

```edn

{:deepgram {:api-key ""}

 :elevenlabs {:api-key ""

              :voice-id ""}

 :groq {:api-key ""}

 :openai {:new-api-sk ""}}

```

Obtain the API keys from the respective providers and fill in the blank values.

Start a REPL and evaluate the snippets in the `(comment ...)` blocks to start the flows.

Allow Microphone access when prompted.

```clojure

(ns voice-fn-examples.local

  (:require

   [clojure.core.async :as a]

   [clojure.core.async.flow :as flow]

   [taoensso.telemere :as t]

   [voice-fn.processors.deepgram :as asr]

   [voice-fn.processors.elevenlabs :as tts]

   [voice-fn.processors.llm-context-aggregator :as context]

   [voice-fn.processors.openai :as llm]

   [voice-fn.secrets :refer [secret]]

   [voice-fn.transport :as transport]

   [voice-fn.utils.core :as u]))

(defn make-local-flow

  "This example showcases a voice AI agent for the local computer.  Audio is

  usually encoded as PCM at 16kHz frequency (sample rate) and it is mono (1

  channel).

    "

  ([] (make-local-flow {}))

  ([{:keys [llm-context extra-procs extra-conns encoding debug?

            sample-rate language sample-size-bits channels chunk-duration-ms]

     :or {llm-context {:messages [{:role "system"

                                   :content "You are a helpful assistant "}]}

          encoding :pcm-signed

          sample-rate 16000

          sample-size-bits 16

          channels 1

          chunk-duration-ms 20

          language :en

          debug? false

          extra-procs {}

          extra-conns []}}]

   (flow/create-flow

     {:procs

      (u/deep-merge

        {;; Capture audio from microphone and send raw-audio-input frames further in the pipeline

         :transport-in {:proc transport/microphone-transport-in

                        :args {:audio-in/sample-rate sample-rate

                               :audio-in/channels channels

                               :audio-in/sample-size-bits sample-size-bits}}

         ;; raw-audio-input -> transcription frames

         :transcriptor {:proc asr/deepgram-processor

                        :args {:transcription/api-key (secret [:deepgram :api-key])

                               :transcription/interim-results? true

                               :transcription/punctuate? false

                               :transcription/vad-events? true

                               :transcription/smart-format? true

                               :transcription/model :nova-2

                               :transcription/utterance-end-ms 1000

                               :transcription/language language

                               :transcription/encoding encoding

                               :transcription/sample-rate sample-rate}}

         ;; user transcription & llm message frames -> llm-context frames

         ;; responsible for keeping the full conversation history

         :context-aggregator  {:proc context/context-aggregator

                               :args {:llm/context llm-context

                                      :aggregator/debug? debug?}}

         ;; Takes llm-context frames and produces new llm-text-chunk & llm-tool-call-chunk frames

         :llm {:proc llm/openai-llm-process

               :args {:openai/api-key (secret [:openai :new-api-sk])

                      :llm/model "gpt-4o-mini"}}

         ;; llm-text-chunk & llm-tool-call-chunk -> llm-context-messages-append frames

         :assistant-context-assembler {:proc context/assistant-context-assembler

                                       :args {:debug? debug?}}

         ;; llm-text-chunk -> sentence speak frames (faster for text to speech)

         :llm-sentence-assembler {:proc context/llm-sentence-assembler}

         ;; speak-frames -> audio-output-raw frames

         :tts {:proc tts/elevenlabs-tts-process

               :args {:elevenlabs/api-key (secret [:elevenlabs :api-key])

                      :elevenlabs/model-id "eleven_flash_v2_5"

                      :elevenlabs/voice-id (secret [:elevenlabs :voice-id])

                      :voice/stability 0.5

                      :voice/similarity-boost 0.8

                      :voice/use-speaker-boost? true

                      :flow/language language

                      :audio.out/encoding encoding

                      :audio.out/sample-rate sample-rate}}

         ;; audio-output-raw -> smaller audio-output-raw frames (used for sending audio in realtime)

         :audio-splitter {:proc transport/audio-splitter

                          :args {:audio.out/sample-rate sample-rate

                                 :audio.out/sample-size-bits sample-size-bits

                                 :audio.out/channels channels

                                 :audio.out/duration-ms chunk-duration-ms}}

         ;; speakers out

         :transport-out {:proc transport/realtime-speakers-out-processor

                         :args {:audio.out/sample-rate sample-rate

                                :audio.out/sample-size-bits sample-size-bits

                                :audio.out/channels channels

                                :audio.out/duration-ms chunk-duration-ms}}}

        extra-procs)

      :conns (concat

               [[[:transport-in :out] [:transcriptor :in]]

                [[:transcriptor :out] [:context-aggregator :in]]

                [[:context-aggregator :out] [:llm :in]]

                ;; Aggregate full context

                [[:llm :out] [:assistant-context-assembler :in]]

                [[:assistant-context-assembler :out] [:context-aggregator :in]]

                ;; Assemble sentence by sentence for fast speech

                [[:llm :out] [:llm-sentence-assembler :in]]

                [[:llm-sentence-assembler :out] [:tts :in]]

                [[:tts :out] [:audio-splitter :in]]

                [[:audio-splitter :out] [:transport-out :in]]]

               extra-conns)})))

(def local-ai (make-local-flow))

(comment

  ;; Start local ai flow - starts paused

  (let [{:keys [report-chan error-chan]} (flow/start local-ai)]

    (a/go-loop []

      (when-let [[msg c] (a/alts! [report-chan error-chan])]

        (when (map? msg)

          (t/log! {:level :debug :id (if (= c error-chan) :error :report)} msg))

        (recur))))

  ;; Resume local ai -> you can now speak with the AI

  (flow/resume local-ai)

  ;; Stop the conversation

  (flow/stop local-ai)

  ,)

```

Which roughly translates to:

![Flow Diagram](./resources/flow.png)

See [examples](./examples/src/voice_fn_examples/) for more usages.

## Supported Providers

### Text-to-Speech (TTS)

-   **ElevenLabs**

    -   Models: `eleven_multilingual_v2`, `eleven_turbo_v2`, `eleven_flash_v2` and more.

    -   Features: Real-time streaming, multiple voices, multilingual support

### Speech-to-Text (STT)

-   **Deepgram**

    -   Models: `nova-2`, `nova-2-general`, `nova-2-meeting` and more.

    -   Features: Real-time transcription, punctuation, smart formatting

### Text Based Large Language Models (LLM)

-   **OpenAI**

    -   Models: `gpt-4o-mini`(fastest, cheapest),  `gpt-4`, `gpt-3.5-turbo` and more

    -   Features: Function calling, streaming responses

## Key Concepts

### Flows

The core building block of voice-fn pipelines:

-   Composed of processes connected by channels

-   Processes can be:

    -   Input/output handlers

    -   AI service integrations

    -   Data transformers

-   Managed by `core.async.flow` for lifecycle control

### Transport

The modality through which audio comes and goes from the voice ai pipeline. Example transport modalities:

- local (microphone + speakers)

- telephony (twilio through websocket)

- webRTC (browser support) - TODO

- async (through in & out core async channels)

You will see processors like `:transport-in` & `:transport-out`

### Frames

The basic unit of data flow, representing typed messages like:

-   `:audio/input-raw` - Raw audio data

-   `:transcription/result` - Transcribed text

-   `:llm/text-chunk` - LLM response chunks

-   `:system/start`, `:system/stop` - Control signals

Each frame has a type and optionally a schema for the data contained in it.

See [frame.clj](./src/voice_fn/frame.clj) for all possible frames.

### Processes

Components that transform frames:

-   Define input/output requirements

-   Can maintain state

-   Use core.async for async processing

-   Implement the `flow/process` protocol

## Adding Custom Processes

```clojure

    (defn custom-processor []

      (flow/process

        {:describe (fn [] {:ins {:in "Input channel"}

                           :outs {:out "Output channel"}})

         :init identity

         :transform (fn [state in msg]

                      [state {:out [(process-message msg)]}])}))

```

Read core.async.flow docs for more information about flow precesses.

## Built With

-   [core.async]() - Concurrent processing

-   [core.async.flow]() - Flow control

-   [Hato]() - WebSocket support

-   [Malli]() - Schema validation

## Acknowledgements

Voice-fn takes heavy inspiration from [pipecat](https://github.com/pipecat-ai/pipecat). Differences:

- voice-fn uses a graph instead of a bidirectional queue for frame transport

- voice-fn has a data centric implementation. The processors in voice-fn are

  pure functions in the `core.async.flow` transform syntax

## License

MIT
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shipclojure/voice-fn

Awesome Lists containing this project

README