{"id":22160350,"url":"https://github.com/rozek/node-red-flow-openai-api","last_synced_at":"2025-07-26T09:31:31.514Z","repository":{"id":184172438,"uuid":"671428489","full_name":"rozek/node-red-flow-openai-api","owner":"rozek","description":"Node-RED Flows for OpenAI API compatible endpoints calling llama.cpp","archived":false,"fork":false,"pushed_at":"2023-08-08T11:55:51.000Z","size":851,"stargazers_count":7,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-11-30T00:05:10.335Z","etag":null,"topics":["llama","llamacpp","node-red","node-red-flow","openai","openai-api"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rozek.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-07-27T09:44:35.000Z","updated_at":"2024-05-30T10:44:51.000Z","dependencies_parsed_at":"2024-08-03T07:59:15.875Z","dependency_job_id":null,"html_url":"https://github.com/rozek/node-red-flow-openai-api","commit_stats":{"total_commits":38,"total_committers":1,"mean_commits":38.0,"dds":0.0,"last_synced_commit":"e8290ba021a15d6728743399e008284a22c8375b"},"previous_names":["rozek/node-red-flow-openai-api"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rozek%2Fnode-red-flow-openai-api","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rozek%2Fnode-red-flow-openai-api/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rozek%2Fnode-red-flow-openai-api/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rozek%2Fnode-red-flow-openai-api/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rozek","download_url":"https://codeload.github.com/rozek/node-red-flow-openai-api/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227668795,"owners_count":17801513,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llama","llamacpp","node-red","node-red-flow","openai","openai-api"],"created_at":"2024-12-02T04:07:31.484Z","updated_at":"2024-12-02T04:07:32.139Z","avatar_url":"https://github.com/rozek.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# node-red-flow-openai-api #\n\nNode-RED Flows for OpenAI API compatible endpoints calling llama.cpp\n\nThis repository contains a few flows which implement a relevant subset of the OpenAI API in order to serve as a drop-in replacement for OpenAI in [LangChain](https://github.com/hwchase17/langchainjs) and similar tools.\n\n\u003cimg src=\"./Screenshot.png\" width=\"600\" alt=\"OpenAI API Flow\"\u003e\n\nSo far, it has been tested both with low level tools (like `curl`) and [Flowise](https://github.com/rozek/Flowise), the no-code environment for LangChain - if you build the author's own version instead of the [original Flowise](https://github.com/FlowiseAI/Flowise), you automatically also get the nodes needed to access the Node-RED server.\n\nThe actual \"heavy lifting\" is done by [llama.cpp](https://github.com/rozek/llama.cpp) - please note, that you will need the author's version instead of the [original llama.cpp](https://github.com/ggerganov/llama.cpp) in order to get an additional tokenizer tool which is used to compute the statistics appended to OpenAI API responses.\n\n\u003e Nota bene: these flows do not contain the actual model. You will have to download your own copy from [HuggingFace](https://huggingface.co/TheBloke/Llama-2-13B-GGML/blob/main/llama-2-13b.ggmlv3.q4_0.bin).\n\nRight now, the [13B variant of the LLaMA 2 LLM](https://huggingface.co/TheBloke/Llama-2-13B-GGML) has been hard-coded into the llama.cpp invocation and the parameter defaults were chosen with a view to achieving the best possible results.\n\n\u003e Just a small note: if you like this work and plan to use it, consider \"starring\" this repository (you will find the \"Star\" button on the top right of this page), so that I know which of my repositories to take most care of.\n\n## Installation ##\n\nStart by creating a subfolder called `ai` within the installation folder of your Node-RED server. This subfolder will later store the llama.cpp executables and the actual model. Using such a subfolder helps keeping the folder structure of your server clean if you decide to play with other AI models as well.\n\n### Building the Executable ###\n\nUnder the hood, the flows from this repository run native executables from [llama.cpp](https://github.com/rozek/llama.cpp). Simply follow the instructions found in section [Usage](https://github.com/rozek/llama.cpp#usage) of the llama.cpp docs to build these executables for your platform.\n\nAfterwards, rename \n\n* `main` to `llama`,\n* `tokenization` to `llama-tokens` and\n* `embedding` to `llama-embeddings`\n\nand copy these files only into the subfolder `ai` you created before.\n\n### Preparing the Model ###\n\nJust download the model from [HuggingFace](https://huggingface.co/TheBloke/Llama-2-13B-GGML/blob/main/llama-2-13b.ggmlv3.q4_0.bin) - it already has the proper format.\n\n\u003e Nota bene: right now, the model has been hard-coded into the flows - but this may easily be changed in the function sources\n\nAfterwards, move the file `llama-2-13b.ggmlv3.q4_0.bin` into the same subfolder `ai` where you already placed the llama.cpp executables.\n\n### Importing the Nodes ###\n\nThese flows use the Node-RED extension [node-red-reusable-flows](https://github.com/rozek/node-red-reusable-flows) (\"Reusable Flows\" which allow multiply needed flows to be defined once and then invoked from multiple places) which should therefore be installed first.\n\nFinally, open the Flow Editor of your Node-RED server and import the contents of [OpenAI-API-flow.json](./OpenAI-API-flow.json). After deploying your changes, you are ready to use the implemented endpoints.\n\n## Configuration ##\n\nThe node \"define common settings\" allows you to configure a few parameters which can not be passed along with a request. These are\n\n* **`API-Key`**\u003cbr\u003elike the original, these flows allow you to protect your endpoints against abuse - which may become relevant as soon as you open the port of your Node-RED server in your LAN. By default, the key is `sk-xxxx`, but you may configure anything else here which is then compared to its counterpart in any incoming request. If both differ, the request is rejected \n* **`number-of-threads`**\u003cbr\u003eallows you to configure the maximum number of threads used by llama.cpp executables. Internally, it is limited by the number of CPU cores your hardware provides\n* **`context-length`**\u003cbr\u003eallows you to configure the maximum length of the inference context (including prompt). LLaMA 2 has been trained with a context length of 4096 which is therefore set as default\n* **`number-of-batches`**\u003cbr\u003esee the [llama.cpp docs](https://github.com/rozek/llama.cpp/blob/master/examples/main/README.md#batch-size) for a description of the \"batch size\"\n* **`prompt-template`**\u003cbr\u003eallows you to define templates which will be used to convert the messages from a chat completion request into a single prompt for llama.cpp - see below for a more detailed description\n* **`count-tokens`**\u003cbr\u003emay either be set to `true` or `false` depending on whether you want the actual token numbers (or just dummies) to be used when calculating the statistics of every request - `true` is closer to the behaviour of OpenAI, but `false` saves you a bit of time\n* **`stop-sequence`**\u003cbr\u003edefines a default \"stop sequence\" (or \"reverse prompt\" as llama.cpp calls it) which helps to avoid unnecessarily generated tokens in a chat completion request\n\nIn principle, the predefined defaults should work quite well - although you may want to change the default `API-Key`\n\n### Prompt Template ###\n\nAn incoming chat completion request contains a list of \"system\", \"user\" and \"assistant\" messages which first have to be conveted into a single prompt which may then be passed to llama.cpp. The `prompt-template` object provides a template for each message type:\n\n```json\n{\n  \"system\":\"{input}\\n\",\n  \"user\":\"### Instruction: {input}\\n\",\n  \"assistant\":\"### Response: {input}\\n\",\n  \"suffix\":\"### Response:\"\n}\n```\n\nWhen constructing the actual prompt, all given messages will be converted and concatenated one after the other using the appropriate template for the message's `role` and replacing the `{input}` placeholder by the message's `content`. Finally, the `suffix` tempalte is appended and the result passed to llama.cpp.\n\nSurprisingly, the shown templates work much better than those recommended on [HuggingFace](https://huggingface.co/blog/llama2#how-to-prompt-llama-2), but your mileage may vary, of course. If you prefer the suggested prompts from the linked blog post, you may simply import an [alternative node](./common-settings.json) and replace the former \"define common settings\" node by the imported one.\n\n## Usage ##\n\nThe flows in this repository implement a small, but relevant subset of the OpenAI API - just enough to support LangChain and similar tools to run inferences on local hardware rather than somewhere in the cloud.\n\nThe set of supported request properties is an intersection of those [specified bei OpenAI](https://platform.openai.com/docs/api-reference) and those [accepted by llama.cpp](https://github.com/rozek/llama.cpp/blob/master/examples/main/README.md)\n\n### /v1/embeddings ###\n\n`/v1/embeddings` is modelled after the OpenAI endpoint to [create an embedding vector](https://platform.openai.com/docs/api-reference/embeddings). The parameters `model` and `input` are respected, `user` is ignored.\n\n### /v1/completions ###\n\n`/v1/completions` is modelled after the OpenAI endpoint to [create a completion](https://platform.openai.com/docs/api-reference/completions). The parameters `model`, `prompt`, `max_tokens`, `temperature`, `top_p`, `stop`, `frequency_penalty` are respected, `suffix`, `n`, `stream`, `logprobs`, `echo`, `presence_penalty`, `best_of`, `logit_bias` and `user` are ignored.\n\n### /v1/chat/completion ###\n\n`/v1/chat/completions` is modelled after the OpenAI endpoint to [create a chat completion](https://platform.openai.com/docs/api-reference/chat). The parameters `model`, `messages`, `max_tokens`, `temperature`, `top_p`, `stop`, `frequency_penalty` are respected, `functions`, `function_call`, `n`, `stream`, `presence_penalty`, `logit_bias` and `user` are ignored.\n\n## Best Practices ##\n\nIn order to get the best of LLaMA 2, you want to observe the following recommendations:\n\n### Do not rely on OpenAI Defaults ###\n\nThe defaults applied within the flows are those specified in the OpenAI API.\n\nHowever, these defaults do not fit well to the LLaMA 2 models and should therefore be replaced by more adequate values, in particular for the following parameters:\n\n* `top_p`: do not use the default value of `1` but reduce it to s.th. like 0.5 or lower,\n* `frequency_penalty`: LLaMA 2 almost _requires_ the `frequency_penalty` to be set to values above 0 (the OpenAI default) for meaningful results. The `llama.cpp` default `1.1` performs quite well\n\n### Explicitly define Stop Sequences ###\n\nWhile llama.cpp works quite well using the LLaMA 2 LLM, it tends to spit out an endless stream of tokens (which takes ages to complete and does not lead to useful answers). Setting a maximum for the number of tokens to be generated is not always helpful as it may cut off a (potential useful) answer.\n\nInstead, **it is recommended to define stop sequences!**. Llama.cpp obeyes these sequences and, thus, avoids many useless repetitions.\n\nWhich stop sequence to define is highly dependent on your actual use case (and the prompt you send to llama.cpp) - just take the `prompt template` described above as an example\n\n## License ##\n\n[MIT License](LICENSE.md)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frozek%2Fnode-red-flow-openai-api","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frozek%2Fnode-red-flow-openai-api","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frozek%2Fnode-red-flow-openai-api/lists"}