{"id":26801164,"url":"https://github.com/tjake/Jlama","last_synced_at":"2025-03-29T21:02:01.778Z","repository":{"id":184951233,"uuid":"672730040","full_name":"tjake/Jlama","owner":"tjake","description":"Jlama is a modern LLM inference engine for Java","archived":false,"fork":false,"pushed_at":"2025-03-09T22:28:19.000Z","size":4278,"stargazers_count":979,"open_issues_count":22,"forks_count":108,"subscribers_count":28,"default_branch":"main","last_synced_at":"2025-03-23T18:01:36.328Z","etag":null,"topics":["ai","genai","gpt","huggingface","java","llama","llm","openai","simd","transformers"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tjake.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-07-31T03:15:38.000Z","updated_at":"2025-03-23T02:51:46.000Z","dependencies_parsed_at":null,"dependency_job_id":"f02074cc-e1cb-4102-8ffe-8a6b1825e52f","html_url":"https://github.com/tjake/Jlama","commit_stats":{"total_commits":243,"total_committers":14,"mean_commits":"17.357142857142858","dds":"0.10699588477366251","last_synced_commit":"c12e3f246eff40fcdccae521efca0824ecef2836"},"previous_names":["tjake/llm-j"],"tags_count":14,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tjake%2FJlama","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tjake%2FJlama/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tjake%2FJlama/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tjake%2FJlama/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tjake","download_url":"https://codeload.github.com/tjake/Jlama/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246059326,"owners_count":20717084,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","genai","gpt","huggingface","java","llama","llm","openai","simd","transformers"],"created_at":"2025-03-29T21:00:53.217Z","updated_at":"2025-03-29T21:02:01.772Z","avatar_url":"https://github.com/tjake.png","language":"Java","funding_links":[],"categories":["General LLM","人工智能","\u003ca name=\"Java\"\u003e\u003c/a\u003eJava"],"sub_categories":[],"readme":"# 🦙 Jlama: A modern LLM inference engine for Java\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/jlama.jpg\" width=\"300\" height=\"300\" alt=\"Cute Jlama\"\u003e\n\u003c/p\u003e\n\n[![Maven Central Version](https://img.shields.io/maven-central/v/com.github.tjake/jlama-parent?style=flat-square)](https://central.sonatype.com/artifact/com.github.tjake/jlama-core/overview)\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)\n[![Discord](https://img.shields.io/discord/1279855254812229642?style=flat-square\u0026label=Discord\u0026color=663399)](https://discord.gg/HsYXHrMu6J)\n\n\n## 🚀 Features\n\nModel Support:\n  * Gemma \u0026 Gemma 2 Models\n  * Llama \u0026 Llama2 \u0026 Llama3 Models\n  * Mistral \u0026 Mixtral Models\n  * Qwen2 Models\n  * IBM Granite Models\n  * GPT-2 Models\n  * BERT Models\n  * BPE Tokenizers\n  * WordPiece Tokenizers\n\nImplements:\n  * Paged Attention\n  * Mixture of Experts\n  * Tool Calling\n  * Generate Embeddings\n  * Classifier Support\n  * Huggingface [SafeTensors](https://github.com/huggingface/safetensors) model and tokenizer format\n  * Support for F32, F16, BF16 types\n  * Support for Q8, Q4 model quantization\n  * Fast GEMM operations\n  * Distributed Inference!\n\nJlama requires Java 20 or later and utilizes the new [Vector API](https://openjdk.org/jeps/448) \nfor faster inference.\n\n## 🤔 What is it used for? \n\nAdd LLM Inference directly to your Java application.\n\n## 🔬 Quick Start\n\n### 🕵️‍♀️ How to use as a local client (with jbang!)\nJlama includes a command line tool that makes it easy to use.\n\nThe CLI can be run with [jbang](https://www.jbang.dev/download/).\n\n```shell\n#Install jbang (or https://www.jbang.dev/download/)\ncurl -Ls https://sh.jbang.dev | bash -s - app setup\n\n#Install Jlama CLI (will ask if you trust the source)\njbang app install --force jlama@tjake\n```\n\nNow that you have jlama installed you can download a model from huggingface and chat with it.\nNote I have pre-quantized models available at https://hf.co/tjake\n\n```shell\n# Run the openai chat api and UI on a model\njlama restapi tjake/Llama-3.2-1B-Instruct-JQ4 --auto-download\n```\n\nopen browser to http://localhost:8080/\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/demo.png\" alt=\"Demo chat\"\u003e\n\u003c/p\u003e\n\n\n```shell\nUsage:\n\njlama [COMMAND]\n\nDescription:\n\nJlama is a modern LLM inference engine for Java!\nQuantized models are maintained at https://hf.co/tjake\n\nChoose from the available commands:\n\nInference:\n  chat                 Interact with the specified model\n  restapi              Starts a openai compatible rest api for interacting with this model\n  complete             Completes a prompt using the specified model\n\nDistributed Inference:\n  cluster-coordinator  Starts a distributed rest api for a model using cluster workers\n  cluster-worker       Connects to a cluster coordinator to perform distributed inference\n\nOther:\n  download             Downloads a HuggingFace model - use owner/name format\n  list                 Lists local models\n  quantize             Quantize the specified model\n  rm                   Removes local model\n```\n\n\n### 👨‍💻 How to use in your Java project\nThe main purpose of Jlama is to provide a simple way to use large language models in Java.\n\nThe simplest way to embed Jlama in your app is with the [Langchain4j Integration](https://github.com/langchain4j/langchain4j-examples/tree/main/jlama-examples).  \n\nIf you would like to embed Jlama without langchain4j, add the following [maven](https://central.sonatype.com/artifact/com.github.tjake/jlama-core/) dependencies to your project:\n\n```xml\n\n\u003cdependency\u003e\n  \u003cgroupId\u003ecom.github.tjake\u003c/groupId\u003e\n  \u003cartifactId\u003ejlama-core\u003c/artifactId\u003e\n  \u003cversion\u003e${jlama.version}\u003c/version\u003e\n\u003c/dependency\u003e\n\n\u003cdependency\u003e\n  \u003cgroupId\u003ecom.github.tjake\u003c/groupId\u003e\n  \u003cartifactId\u003ejlama-native\u003c/artifactId\u003e\n  \u003c!-- supports linux-x86_64, macos-x86_64/aarch_64, windows-x86_64 \n       Use https://github.com/trustin/os-maven-plugin to detect os and arch --\u003e\n  \u003cclassifier\u003e${os.detected.name}-${os.detected.arch}\u003c/classifier\u003e\n  \u003cversion\u003e${jlama.version}\u003c/version\u003e\n\u003c/dependency\u003e\n\n```\n\njlama uses Java 21 preview features. You can enable the features globally with:\n\n```shell\nexport JDK_JAVA_OPTIONS=\"--add-modules jdk.incubator.vector --enable-preview\"\n```\nor enable the preview features by configuring maven compiler and failsafe plugins.\n\n\n\nThen you can use the Model classes to run models:\n\n```java\n public void sample() throws IOException {\n    String model = \"tjake/Llama-3.2-1B-Instruct-JQ4\";\n    String workingDirectory = \"./models\";\n\n    String prompt = \"What is the best season to plant avocados?\";\n\n    // Downloads the model or just returns the local path if it's already downloaded\n    File localModelPath = new Downloader(workingDirectory, model).huggingFaceModel();\n    \n    // Loads the quantized model and specified use of quantized memory\n    AbstractModel m = ModelSupport.loadModel(localModelPath, DType.F32, DType.I8);\n\n    PromptContext ctx;\n    // Checks if the model supports chat prompting and adds prompt in the expected format for this model\n    if (m.promptSupport().isPresent()) {\n        ctx = m.promptSupport()\n                .get()\n                .builder()\n                .addSystemMessage(\"You are a helpful chatbot who writes short responses.\")\n                .addUserMessage(prompt)\n                .build();\n    } else {\n        ctx = PromptContext.of(prompt);\n    }\n\n    System.out.println(\"Prompt: \" + ctx.getPrompt() + \"\\n\");\n    // Generates a response to the prompt and prints it\n    // The api allows for streaming or non-streaming responses\n    // The response is generated with a temperature of 0.7 and a max token length of 256\n    Generator.Response r = m.generate(UUID.randomUUID(), ctx, 0.0f, 256, (s, f) -\u003e {});\n    System.out.println(r.responseText);\n }\n```\n\nOr you can use a **Builder API**:\n\n```java\n public void sample() throws IOException {\n    String model = \"tjake/Llama-3.2-1B-Instruct-JQ4\";\n    String workingDirectory = \"./models\";\n\n    String prompt = \"What is the best season to plant avocados?\";\n\n    // Downloads the model or just returns the local path if it's already downloaded\n    File localModelPath = new Downloader(workingDirectory, model).huggingFaceModel();\n    \n    // Loads the quantized model and specified use of quantized memory\n    AbstractModel m = ModelSupport.loadModel(localModelPath, DType.F32, DType.I8);\n\n    PromptContext ctx;\n    // Checks if the model supports chat prompting and adds prompt in the expected format for this model\n    if (m.promptSupport().isPresent()) {\n        ctx = m.promptSupport()\n                .get()\n                .builder()\n                .addSystemMessage(\"You are a helpful chatbot who writes short responses.\")\n                .addUserMessage(prompt)\n                .build();\n    } else {\n        ctx = PromptContext.of(prompt);\n    }\n\n    System.out.println(\"Prompt: \" + ctx.getPrompt() + \"\\n\");\n    // Generates a response to the prompt and prints it\n    // The api allows for streaming or non-streaming responses\n    // The response is generated with a temperature of 0.7 and a max token length of 256\n    Generator.Response r = m.generateBuilder()\n            .session(UUID.randomUUID()) //By default, UUID.randomUUID()\n            .promptContext(ctx) // Required or use prompt(String text)\n            .ntokens(256) //By default, 256\n            .temperature(0.0f) //By default, 0.0f\n            .onTokenWithTimings((s, aFloat) -\u003e {}) //By default, (s, aFloat) -\u003e {}, nothing\n            .generate();\n    \n    System.out.println(r.responseText);\n }\n```\n\nYou can simplify promptSupport using:\n\n```java\n public void sample() throws IOException {\n    String model = \"tjake/Llama-3.2-1B-Instruct-JQ4\";\n    String workingDirectory = \"./models\";\n\n    String prompt = \"What is the best season to plant avocados?\";\n\n    // Downloads the model or just returns the local path if it's already downloaded\n    File localModelPath = new Downloader(workingDirectory, model).huggingFaceModel();\n    \n    // Loads the quantized model and specified use of quantized memory\n    AbstractModel m = ModelSupport.loadModel(localModelPath, DType.F32, DType.I8);\n    \n    var systemPrompt = \"You are a helpful chatbot who writes short responses.\";\n\n    PromptContext ctx = m.prompt()\n                        .addUserMessage(prompt)\n                        .addSystemMessage(systemPrompt)\n                        .build(); //build method will create a PromptContext, if model don't support prompt, a simple PromptContext object will be created\n\n    System.out.println(\"Prompt: \" + ctx.getPrompt() + \"\\n\");\n    // Generates a response to the prompt and prints it\n    // The api allows for streaming or non-streaming responses\n    // The response is generated with a temperature of 0.7 and a max token length of 256\n    Generator.Response r = m.generateBuilder()\n            .session(UUID.randomUUID()) //By default, UUID.randomUUID()\n            .promptContext(ctx) // Required or use prompt(String text)\n            .ntokens(256) //By default, 256\n            .temperature(0.0f) //By default, 0.0f\n            .onTokenWithTimings((s, aFloat) -\u003e {}) //By default, (s, aFloat) -\u003e {}, nothing\n            .generate();\n    \n    System.out.println(r.responseText);\n }\n```\n\n## ⭐ Give us a Star! \n\nIf you like or are using this project to build your own, please give us a star. It's a free way to show your support.\n\n## 🗺️ Roadmap\n\n* Support more and more models\n* \u003cs\u003eAdd pure java tokenizers\u003c/s\u003e\n* \u003cs\u003eSupport Quantization (e.g. k-quantization)\u003c/s\u003e\n* Add LoRA support\n* GraalVM support\n* \u003cs\u003eAdd distributed inference\u003c/s\u003e\n\n## 🏷️ License and Citation\n\nThe code is available under [Apache License](./LICENSE).\n\nIf you find this project helpful in your research, please cite this work at\n\n```\n@misc{jlama2024,\n    title = {Jlama: A modern Java inference engine for large language models},\n    url = {https://github.com/tjake/jlama},\n    author = {T Jake Luciani},\n    month = {January},\n    year = {2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftjake%2FJlama","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftjake%2FJlama","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftjake%2FJlama/lists"}