{"id":28255510,"url":"https://github.com/fareedkhan-dev/save-llm-api-cost","last_synced_at":"2025-07-28T05:33:18.084Z","repository":{"id":294069503,"uuid":"985896529","full_name":"FareedKhan-dev/save-llm-api-cost","owner":"FareedKhan-dev","description":"A straightforward method to reduce your LLM inference API costs and token usage.","archived":false,"fork":false,"pushed_at":"2025-05-18T18:43:25.000Z","size":34,"stargazers_count":11,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-13T12:49:37.600Z","etag":null,"topics":["ai","api","artificial-intelligence","gemini","large-language-models","llm","openai"],"latest_commit_sha":null,"homepage":"https://medium.com/@fareedkhandev/e3da975c0424","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FareedKhan-dev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-18T18:39:17.000Z","updated_at":"2025-05-29T11:35:38.000Z","dependencies_parsed_at":"2025-05-18T19:46:17.293Z","dependency_job_id":null,"html_url":"https://github.com/FareedKhan-dev/save-llm-api-cost","commit_stats":null,"previous_names":["fareedkhan-dev/save-llm-api-cost"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/FareedKhan-dev/save-llm-api-cost","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FareedKhan-dev%2Fsave-llm-api-cost","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FareedKhan-dev%2Fsave-llm-api-cost/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FareedKhan-dev%2Fsave-llm-api-cost/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FareedKhan-dev%2Fsave-llm-api-cost/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FareedKhan-dev","download_url":"https://codeload.github.com/FareedKhan-dev/save-llm-api-cost/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FareedKhan-dev%2Fsave-llm-api-cost/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267468390,"owners_count":24092334,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-28T02:00:09.689Z","response_time":68,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","api","artificial-intelligence","gemini","large-language-models","llm","openai"],"created_at":"2025-05-19T22:13:54.715Z","updated_at":"2025-07-28T05:33:18.055Z","avatar_url":"https://github.com/FareedKhan-dev.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!-- omit in toc --\u003e\n# Save LLM API Cost with Memory Efficiency\n\n[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-370/) [![Nebius AI](https://img.shields.io/badge/Nebius%20AI-API-brightgreen)](https://cloud.nebius.ai/services/llm-embedding) [![OpenAI](https://img.shields.io/badge/OpenAI-API-lightgrey)](https://openai.com/) [![Medium](https://img.shields.io/badge/Medium-Blog-black?logo=medium)](https://medium.com/@fareedkhandev/saving-40-tokens-in-llm-api-to-reduce-the-cost-using-memory-efficiency-algorithm-e3da975c0424)\n\n\nYou have likely used APIs such as OpenAI LLM, Claude, Gemini, or others, and you may have noticed that when creating a chatbot, whether through RAG or as a standalone system the cost increases with every run.\n\nThis is primarily because memory is attached to the chatbot, meaning the model retains the history of conversations to make the chatbot more conversational, similar to how ChatGPT remembers your previous chats.\n\n![Memory Size Increases Issue](https://cdn-images-1.medium.com/max/1500/1*HhkDsmEgNnvdyWJZE_Ee8Q.png)\n\nThis increase leads to higher costs. In this blog, we are going to use a [**memory-efficient algorithm**](https://mem0.ai/research) to reduce the number of tokens stored in memory by up to 40%, significantly lowering the cost of running inferences for your chatbot.\n\nTake a look at the comparison between our memory-efficient algorithm and the raw approach.\n\n![Comparative Analysis](https://cdn-images-1.medium.com/max/1000/1*ka6hpbCJ5RDcoB2imWFzyg.png)\n\nAs the number of chats increases, the difference in total token count continues to grow. With our algorithmic approach, the total token count increases steadily.\n\nThe spikes in the graph represent instances where the LLM is generating responses, while the other points indicate knowledge improvement updates.\n\nThis gap becomes significantly larger as the number of chats rises, which you will clearly see in the comparative analysis section.\n\n\n\u003c!-- omit in toc --\u003e\n# Table of Contents\n- [Setting up the stage](#setting-up-the-stage)\n- [What is the Problem?](#what-is-the-problem)\n- [Memory Efficiency as a Solution](#memory-efficiency-as-a-solution)\n- [Our Conversational Scenario](#our-conversational-scenario)\n- [Implementing Raw LLM Approach](#implementing-raw-llm-approach)\n- [Embedding and Response Generation](#embedding-and-response-generation)\n- [Memory Storing Feature](#memory-storing-feature)\n- [Classifying User Input](#classifying-user-input)\n- [Fact Extraction](#fact-extraction)\n- [Memory Update (ADD, UPDATE, NOOP)](#memory-update-add-update-noop)\n- [Retrieval for Answering Queries](#retrieval-for-answering-queries)\n- [Running Mem0 Algorithm](#running-mem0-algorithm)\n- [Comparative Analysis](#comparative-analysis)\n- [What’s Next](#whats-next)\n\n# Setting up the stage\n\nBefore we start, we need to import some basic libraries, which will be used throughout this blog, you most probably be aware of them. let’s do that.\n\n```python\n# Import required standard and third-party libraries for memory, LLM, and data handling\nimport os      # OS interactions (not used directly here)\nimport json    # For JSON parsing/generation (LLM outputs)\nimport time    # For delays between API calls\nimport uuid    # For unique memory item IDs\nfrom datetime import datetime  # For timestamps\n\nimport numpy as np            # For embedding vectors\nimport pandas as pd           # For DataFrame/tabular analysis\nfrom openai import OpenAI     # For OpenAI-compatible LLM/embedding API\nfrom sklearn.metrics.pairwise import cosine_similarity  # For embedding similarity\n```\n\nNow that we have imported the libraries, let’s move on to the next steps, where we will understand what the memorization approach is and what the actual problem is.\n\n# What is the Problem?\n\nI am using the NebiusAI API, which operates under the OpenAI module and provides access to open-source LLMs. You can use any LLM API provider you prefer, just make sure that the response includes the total token count for analysis purposes.\n\nFirst, let’s initialize the client for our LLM provider along with a simple system message for the LLM.\n\n```python\n# Initialize OpenAI client with Nebius API\nclient = OpenAI(\n    base_url=\"YOUR_BASE_URL\", # Replace with your actual base url\n    api_key=\"YOU_LLM_API_KEY\"  # Replace with your actual key securely\n)\n\n# Chat history buffer (list of messages)\nchat_history = [\n    {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}\n]\n```\n\nLet’s create a function that incorporates a chat history feature into your LLM chatbot. After each query, it should print two things: the LLM response and the token usage information, so we can identify any issues.\n\n```python\ndef chat_with_history(user_input, max_length=100):\n    # Add user input to history\n    chat_history.append({\"role\": \"user\", \"content\": user_input})\n    \n    # Send request to model\n    response = client.chat.completions.create(\n        model=\"meta-llama/Llama-3.2-1B-Instruct\",\n        temperature=0.7,\n        messages=chat_history\n    )\n\n    # Get assistant message\n    assistant_message = response.choices[0].message.content\n    truncated_response = (assistant_message[:max_length] + '...') if len(assistant_message) \u003e max_length else assistant_message\n\n    # Add assistant response to chat history\n    chat_history.append({\"role\": \"assistant\", \"content\": assistant_message})\n\n    # Print required info\n    print(truncated_response)\n    print(f\"Prompt tokens: {response.usage.prompt_tokens}\")\n    print(f\"Completion tokens: {response.usage.completion_tokens}\")\n```\n\nWe are using the **LLaMA 3.2 1B LLM**. Let’s look at a simple example of how a user typically interacts with an LLM.\n\n```python\n# Asking two consecutive question to our LLM\nchat_with_history(\"Hello, who won the last FIFA World Cup?\")\nchat_with_history(\"And who was the top scorer?\")\n\n\n## OUTPUT ##\n=== AI Response (truncated) ===\nThe last FIFA World Cup was held in 2022, and the winning ...\n\n=== Token Usage ===\nPrompt tokens: 50\nCompletion tokens: 64\n===============================\n\n\n=== AI Response (truncated) ===\nThe top scorer at the 2022 FIFA World Cup was Ciro Immobile ...\n\n=== Token Usage ===\nPrompt tokens: 130\nCompletion tokens: 29\n===============================\n```\n\nI simply call the function with consecutive messages, You can see that the second message ***“Who was the top scorer?”*** depends on the previous one. Otherwise, the LLM wouldn’t know what sport and in what year user is referring to.\n\nThe most important detail here is the history tokens. In the second message, you can see that the total prompt tokens are 130 (50 previous user tokens + 64 previous answer tokens + 26 current user query).\n\nYou might have already guessed that …\n\n\u003e the deeper the conversation gets, the more tokens are passed to the LLM\n\nTake a look at this graph, where we ask our LLaMA 3.2 1B LLM to create **a children story**, an example where chat history plays an important role (because we definitely be updating the story according to our need).\n\n![Tokens Usage after each response](https://cdn-images-1.medium.com/max/1000/1*nn6W_z_t0rDxLxtZb8clWA.png)\n\nYou can clearly see that when we use our LLM function to create a marketing campaign, the prompt tokens increase rapidly with each message, up to the 10th message.\n\nThis happens because, as we observed earlier, all previous messages are included in each new prompt. So, by the time we reach the 10th message, it includes all the previous 9 messages in the context.\n\n# Memory Efficiency as a Solution\n\nNow that we understand the problem, we need to look at the solution. I call it **memory efficiency,** a technique designed to reduce the token load.\n\nLet’s first visualize how it works, and then we will understand it thoroughly.\n\n![Memory Efficiency FlowChart](https://cdn-images-1.medium.com/max/1000/1*Vx0Sy1Ci0Y7Ay__qPJwDhw.png)\n\nLet’s understand how memory system works:\n\n*   User sends a message to the AI (LLM)\n*   System receives the message\n*   System checks: Is it a **statement** or a **question**?\n\n**If it’s a statement:**\n\n*   Extract key facts from the message\n*   For each fact, find similar information in the memory store\n*   Decide what to do with each fact: **ADD** if it’s new, **UPDATE** if it changes or improves something already stored, **NOOP** if it’s already known and doesn’t change anything\n*   Store the updated or new information in memory\n\n**If it’s a question:**\n\n*   Search memory store for relevant information based on the query\n*   Use the query and found information to generate a helpful response\n*   Send the answer back to the user\n\nThe key idea is that not everything we send to an LLM is a question, most of the time, it’s a statement that needs to be stored. Only the relevant statements should be retrieved and used as a source when forming a response.\n\nThis approach not only reduces the memory size for the LLM but also makes the interaction more efficient.\n\n# Our Conversational Scenario\n\nI will be using the marketing campaign strategy scenario with our llm as it is much closer to a real world example where you create a let say (rag/standalone) chatbot that answer query related to marketing campaings.\n\nFirst let’s define a normal user conversation scenario:\n\n```python\nconversation_script = [\n    {\"role\": \"user\", \"content\": \"Hi, let's start planning the 'New Marketing Campaign'. My primary goal is to increase brand awareness by 20%.\"},\n    {\"role\": \"user\", \"content\": \"For this campaign, the target audience is young adults aged 18-25.\"},\n    {\"role\": \"user\", \"content\": \"I want to allocate a budget of $5000 for social media ads for the New Marketing Campaign.\"},\n    {\"role\": \"user\", \"content\": \"What's the main goal for the New Marketing Campaign?\"},\n    {\"role\": \"user\", \"content\": \"Who are we targeting for this campaign?\"},\n    {\"role\": \"user\", \"content\": \"Let's also consider influencers. Add a task: 'Research potential influencers for the 18-25 demographic' for the New Marketing Campaign.\"},\n    {\"role\": \"user\", \"content\": \"Actually, let's increase the social media ad budget for the New Marketing Campaign to $7500.\"},\n    {\"role\": \"user\", \"content\": \"What's the current budget for social media ads for the New Marketing Campaign?\"},\n    {\"role\": \"user\", \"content\": \"What tasks do I have pending for this campaign?\"},\n    {\"role\": \"user\", \"content\": \"Also, for the New Marketing Campaign, I prefer visual content for this demographic, like short videos and infographics.\"}\n]\n```\n\nThese are typical conversations we might have with our LLM, where we alternate between asking questions and making statements. A *statement* means we’re providing information or context to the LLM, helping it improve its responses or tailor them to our needs.\n\nFor example, in the first chat, we ask the LLM to start planning a marketing campaign. It responds with a campaign outline. Then, in the second message, we describe the target audience, to make the LLM response based on our goals, and so on.\n\n# Implementing Raw LLM Approach\n\nSo, now that we have our conversation scenario ready, We’ll simulate 10 such interactions using the **raw approach** conversation strategy and observe how it performs.\n\n```python\n# Run the scripted conversation through the LLM, storing each turn's input and response\nraw_conversation = []\nfor turn in conversation_script:\n    user_input = turn[\"content\"]  # Extract user message from the script\n    response = chat_with_history(user_input)  # Get LLM response using chat history\n    raw_conversation.append(response)  # Store the result for later analysis\n```\n\nThis will run as conversation between our messages and LLM responses, let’s see the last message total number of tokens (i.e, 10th message).\n\n```python\nprint(f\"Prompt tokens used in the last turn: {raw_conversation[-1]['prompt_tokens']}\")\nprint(f\"Completion tokens used in the last turn: {raw_conversation[-1]['completion_tokens']}\")\n\n\n## OUTPUT ##\nPrompt tokens used in the last turn: 4446\nCompletion tokens used in the last turn: 451\n```\n\nThe total prompt tokens in our last message exceeded **4,000+**, which is quite high for just 10 chat messages.\n\nLet’s now visualize this growing history to better understand the impact.\n\n![Tokens Usage after each response](https://cdn-images-1.medium.com/max/1000/1*n84WV70Rp34dvlT8wOLFlA.png)\n\nLater, we will use this prompt token count to measure the percentage change. For now, we have documented the result of the **raw LLM approach** as our baseline.\n\n# Embedding and Response Generation\n\nBefore we start coding the memory-efficient approach, we need to implement some helper functions to avoid code duplication.\n\nThe first function we will create is an **embedding generation function**. This will help us find the relationship between the user query and the stored memory state.\n\n```python\n# Embedding Model Name\nEMBEDDING_MODEL = \"BAAI/bge-multilingual-gemma2\"\n\n# Function to generate embedding\ndef get_embedding(text_to_embed):\n    \"\"\"\n    Returns the embedding vector for the given text.\n\n    Parameters:\n    - text_to_embed: The input text to embed.\n    - verbose: Not used here (kept for compatibility).\n\n    Returns:\n    - A NumPy array containing the embedding.\n    \"\"\"\n    response = client.embeddings.create(model=EMBEDDING_MODEL, input=text_to_embed)\n    return np.array(response.data[0].embedding)\n```\n\nSecond, we need to create an **LLM generation function** that will handle generating responses from the model.\n\n```python\ndef get_llm_chat_completion(messages):\n    \"\"\"\n    Sends a chat request to the language model and returns the response content and token usage.\n\n    Parameters:\n    - messages: List of messages in chat format (role/content).\n\n    Returns:\n    - content: The generated response text.\n    - prompt_tokens: Number of tokens used in the input.\n    - completion_tokens: Number of tokens used in the output.\n    \"\"\"\n    # Send request to the chat model\n    response = client.chat.completions.create(\n        model=LLM_MODEL, # Assuming LLM_MODEL is defined, e.g., \"meta-llama/Llama-3.2-1B-Instruct\"\n        messages=messages\n    )\n\n    # Extract the generated text\n    content = response.choices[0].message.content\n\n    # Get token usage info (if available)\n    prompt_tokens = response.usage.prompt_tokens\n    completion_tokens = response.usage.completion_tokens\n\n    return content, prompt_tokens, completion_tokens\n```\n\nNow that we have coded the two helper functions for our memory-efficient approach, let’s start building the main logic for it.\n\n# Memory Storing Feature\n\nSo, normally you do use a **vector database** to properly implement this feature, since it would store embedding vectors and handle similarity searches efficiently.\n\nHowever, since this is a beginner-friendly guide, we ll keep things simple and apply **OOP principles** instead.\n\nHere’s the idea:\n\n*   **Memory** refers to user-provided statements that enhance the LLM’s understanding, rather than trigger immediate responses.\n*   These statements may inform future queries, where the LLM can retrieve relevant context.\n*   To manage this, each memory entry needs a **unique identity**.\n\nLet’s now define a class structure to create and manage this identity for each memory.\n\n```python\nclass MemoryItem:\n    def __init__(self, text_content, source_turn_indices_list, verbose_embedding=False):\n        \"\"\"\n        Initializes a MemoryItem instance.\n\n        Parameters:\n        - text_content: The text to store in memory.\n        - source_turn_indices_list: List of conversation turn indices this item is based on.\n        - verbose_embedding: Whether to print debug info during embedding.\n        \"\"\"\n        self.id = str(uuid.uuid4())  # Unique ID for the memory item\n        self.text = text_content  # The text content of the memory\n        self.embedding = get_embedding(text_content) # Removed verbose=verbose_embedding as get_embedding doesn't take it\n        self.creation_timestamp = datetime.now()  # Time when this item was created\n        self.last_accessed_timestamp = self.creation_timestamp  # Time when it was last accessed\n        self.access_count = 0  # How many times this memory was accessed\n        self.source_turn_indices = list(source_turn_indices_list)  # Reference to conversation turns\n\n    def __repr__(self):\n        \"\"\"\n        String representation showing a brief summary of the memory item.\n        \"\"\"\n        return (f\"MemoryItem(id={self.id}, text='{self.text[:60]}...', \"\n                f\"created={self.creation_timestamp.strftime('%H:%M:%S')}, accessed={self.access_count})\")\n\n    def mark_accessed(self):\n        \"\"\"\n        Updates the access time and count when the memory is accessed.\n        \"\"\"\n        self.last_accessed_timestamp = datetime.now()\n        self.access_count += 1\n```\n\nThis is just a unique identifier for each memory, we haven’t implemented the **memory storing** feature yet.\n\nSo, let’s go ahead and code the part that **stores memory entries**, allowing us to keep track of user-provided statements for future reference.\n\n```python\nclass MemoryStore:\n    def __init__(self, verbose=False): # Added verbose to __init__\n        self.memories = {}  # Dictionary to store memory items by their ID\n        self.verbose = verbose # Store verbose flag\n\n    def add_memory_item(self, item):\n        self.memories[item.id] = item\n\n    def get_memory_item_by_id(self, memory_id):\n        item = self.memories.get(memory_id)\n        if item:\n            item.mark_accessed()\n        return item\n\n    def update_existing_memory_item(self, memory_id, new_text, turn_indices):\n        item = self.memories.get(memory_id)\n        if not item:\n            if self.verbose:\n                print(f\"[MemoryStore] UPDATE FAILED: ID {memory_id} not found.\")\n            return False\n\n        item.text = new_text\n        item.embedding = get_embedding(new_text) # Removed verbose=self.verbose\n        item.creation_timestamp = datetime.now()\n        item.source_turn_indices = list(set(item.source_turn_indices + turn_indices))\n        item.mark_accessed()\n\n        return True\n\n    def find_semantically_similar_memories(self, query_embedding, top_k=3, threshold=0.5):\n        # Filter out memory items with invalid embeddings\n        candidates = [\n            (mid, mem.embedding) for mid, mem in self.memories.items()\n            if mem.embedding is not None and mem.embedding.size \u003e 0 and np.any(mem.embedding)\n        ]\n        \n        if not candidates: # Handle case with no valid embeddings\n            return []\n\n        # Stack embeddings into matrix for similarity comparison\n        ids, embeddings = zip(*candidates)\n        embeddings = np.vstack(embeddings)\n        query_embedding = query_embedding.reshape(1, -1)\n\n        # Compute cosine similarity and sort results\n        similarities = cosine_similarity(query_embedding, embeddings)[0]\n        sorted_indices = np.argsort(similarities)[::-1]\n\n        # Return top-k results above threshold\n        return [\n            (self.memories[ids[i]], similarities[i])\n            for i in sorted_indices[:top_k]\n            if similarities[i] \u003e= threshold\n        ]\n```\n\nThis might seem a bit complex at first, but it’s actually quite easy to understand. Our **memory store** class includes four key functionalities:\n\n1.  **Add** a memory item if it doesn’t already exist\n2.  **Update** an existing memory item\n3.  **Retrieve** memory items relevant to the user’s query\n4.  **Find** semantically similar memories using embeddings\n\nOptionally, you can also add a **delete** feature to remove unused memories and reduce memory size but that’s entirely up to you.\n\nNow that we have defined the core helper functions for our **memory-efficient approach**, you might be wondering how can we tell if the user query is a statement or not?, this is what we will be identifying in the next section.\n\n# Classifying User Input\n\nBefore processing, we need to know if the user is making a statement or asking a question. An LLM can help with this classification.\n\n```python\ndef classify_input(user_input):\n    \"\"\"\n    Classifies the user input as either a 'query' or a 'statement'.\n\n    Parameters:\n    - user_input: The text input from the user.\n\n    Returns:\n    - A string: either 'query' or 'statement'.\n    \"\"\"\n    # Instruction for the LLM to act as a classifier\n    system_prompt = (\n        \"You are a classifier. \"\n        \"A 'query' is a question or request for information. \"\n        \"A 'statement' is a declaration, instruction, or information that is not a question. \"\n        \"Respond with only one word: either 'query' or 'statement'.\"\n    )\n\n    # Send the classification request to the model\n    response = client.chat.completions.create(\n        model=\"meta-llama/Llama-3.2-1B-Instruct\",\n        messages=[\n            {\"role\": \"system\", \"content\": system_prompt},\n            {\"role\": \"user\", \"content\": f\"Classify this: {user_input}\"}\n        ]\n    )\n\n    # Return the classification result (in lowercase)\n    return response.choices[0].message.content.strip().lower()\n```\n\nThis LLM function will help us determine whether the user query is a **question** or a **statement**. This distinction is crucial, as it dictates what memory action should be taken, whether to store, retrieve, or ignore memory.\n\nIn a production setting, this check would run **every time** the user submits a query. But since we already have a predefined conversation scenario, we’ll run this classification on our existing conversation data and observe how it performs.\n\n```bash\n# We'll add this classification to our conversation_script items\nfor turn in conversation_script:\n    classification = classify_input(turn[\"content\"])\n    turn[\"type\"] = classification\n```\n\nIt will classify all of the user’s queries.\n\nLet’s print one of them and check its **classification type** to see whether it’s identified as a **question** or a **statement**.\n\n```bash\n# Printing first item\nconversation_script[0]\n\n\n## OUTPUT ##\n{\n  'role': 'user',\n  'content': \"Hi, let's start planning the 'New Marketing  ... \",\n  'type': 'statement'\n}\n```\n\nIt classifies our first user query as a **statement**, which is correct because it provides information rather than asking a question, it’s meant to inform the LLM.\n\n# Fact Extraction\n\nNow that we know our LLM can differentiate between a query and a statement, the next step is:\n\n*   If the user query is a **question**, the LLM will generate a response.\n*   If it’s a **statement**, it should extract facts and features from it to be stored in the memory we previously coded.\n\nWe need a prompt template that will help us extract facts and features from a statement.\n\nLet’s define that prompt first.\n*(Note: The original HTML showed `fact_prompt` being defined with f-string placeholders `{recent_turns_window_text}` and `{current_user_statement_text}`. These would need to be populated dynamically when the prompt is used, or the function using it should construct this prompt.)*\n\n```python\n# This is how the prompt would be constructed within a function\n# def construct_fact_prompt(current_user_statement_text, recent_turns_window_text):\n#     return f\"\"\"\n# Extract concise, declarative facts from the 'New User Statement' based on the 'Recent Conversation Context'.\n# Focus on user's goals, plans, preferences, decisions, and key entities.\n# Ignore questions, acknowledgements, fluff, or inference.\n# Context:\n# ---BEGIN---\n# {recent_turns_window_text or \"(No prior context)\"}\n# ---END---\n# Statement: \"{current_user_statement_text}\"\n# Output ONLY a JSON list of strings (facts). Return [] if none.\n# \"\"\"\n\n# For demonstration, here's a static version of what the prompt aims for:\nfact_prompt_template_text = \"\"\"\nExtract concise, declarative facts from the 'New User Statement' based on the 'Recent Conversation Context'.\nFocus on user's goals, plans, preferences, decisions, and key entities.\nIgnore questions, acknowledgements, fluff, or inference.\nContext:\n---BEGIN---\n{recent_turns_window_text_placeholder_or_empty}\n---END---\nStatement: \"{current_user_statement_text_placeholder}\"\nOutput ONLY a JSON list of strings (facts). Return [] if none.\n\"\"\"\n```\n\nThis prompt template will extract facts and features from the statement. You can customize it to fit the specific domain your chatbot focuses on, but for now, this is the best template I have found after multiple attempts.\n\nLet’s go ahead and code the function that performs the feature extraction.\n\n```python\ndef mem0_extract_salient_facts_from_turn(current_user_statement_text, recent_turns_window_text, current_turn_index_in_script):\n    \"\"\"\n    Extracts salient facts from the current user statement and recent conversation turns.\n\n    Parameters:\n    - current_user_statement_text: Text of the current user statement.\n    - recent_turns_window_text: Text from recent conversation turns for context.\n    - current_turn_index_in_script: Index of the current turn in the overall script.\n\n    Returns:\n    - facts: A list of extracted facts (parsed from JSON).\n    \"\"\"\n    # Construct the fact_prompt dynamically\n    fact_prompt = f\"\"\"\nExtract concise, declarative facts from the 'New User Statement' based on the 'Recent Conversation Context'.\nFocus on user's goals, plans, preferences, decisions, and key entities.\nIgnore questions, acknowledgements, fluff, or inference.\nContext:\n---BEGIN---\n{recent_turns_window_text or \"(No prior context)\"}\n---END---\nStatement: \"{current_user_statement_text}\"\nOutput ONLY a JSON list of strings (facts). Return [] if none.\n\"\"\"\n\n    messages = [\n        {\"role\": \"system\", \"content\": \"Expert extraction AI. Output ONLY valid JSON list of facts.\"},\n        {\"role\": \"user\", \"content\": fact_prompt}\n    ]\n\n    # Call the LLM chat completion function\n    response_text, prompt_tokens, completion_tokens = get_llm_chat_completion(\n        messages\n    )\n    \n    # (Assuming global token counters for demo purposes, initialize them if not present)\n    # global total_prompt_tokens_mem0_extract, total_completion_tokens_mem0_extract\n    # total_prompt_tokens_mem0_extract += prompt_tokens\n    # total_completion_tokens_mem0_extract += completion_tokens\n\n\n    # Extract JSON array from the response text safely\n    facts = []\n    try:\n        start = response_text.find('[')\n        end = response_text.rfind(']')\n        if start != -1 and end != -1 and end + 1 \u003e start:\n            json_candidate = response_text[start:end + 1]\n            facts = json.loads(json_candidate)\n        else:\n            print(f\"Warning: Could not find JSON list in response: {response_text}\")\n    except json.JSONDecodeError:\n        print(f\"Warning: JSONDecodeError for response: {response_text}\")\n        # Fallback or error handling\n\n    return facts\n```\n\nThis function will extract facts from the statement in the form of a JSON array. We’ll parse this using some basic code.\n\nOnce extracted, these facts will be stored in our memory, where they can be either **added** as new entries or **updated** if they already exist.\n\n# Memory Update (ADD, UPDATE, NOOP)\n\nThere is currently no link between the **extracted facts** and the **memory database** we coded earlier.\n\nTo establish that connection, we need to create a function that determines the appropriate relationship and action.\n\nBut before we do that, we need a **prompt template** that evaluates:\n\n*   The **incoming statement**\n*   The **available memory entries**\n\nThis prompt will guide the model in deciding what action to take, such as updating an existing memory or storing a new one.\n\nLet’s define that prompt template first.\n*(Similar to `fact_prompt`, this prompt is dynamic)*\n```python\n# This is how the prompt would be constructed within a function\n# def construct_update_prompt(candidate_fact_text, similar_segment):\n#     return f\"\"\"\n# You manage a memory store. Decide how to handle this new fact:\n# \"{candidate_fact_text}\"\n# {similar_segment}\n# Choose ONE:\n# - ADD: New info.\n# - UPDATE: Improves one existing memory (give memory ID and new text).\n# - NOOP: Redundant.\n# Respond with JSON:\n# {{\"operation\": \"ADD\"}} or\n# {{\"operation\": \"UPDATE\", \"target_memory_id\": \"ID\", \"updated_memory_text\": \"Text\"}} or\n# {{\"operation\": \"NOOP\"}}\n# Decision:\n# \"\"\"\n\n# For demonstration:\nupdate_prompt_template_text = \"\"\"\nYou manage a memory store. Decide how to handle this new fact:\n\"{candidate_fact_text_placeholder}\"\n{similar_segment_placeholder}\nChoose ONE:\n- ADD: New info.\n- UPDATE: Improves one existing memory (give memory ID and new text).\n- NOOP: Redundant.\nRespond with JSON:\n{{\"operation\": \"ADD\"}} or\n{{\"operation\": \"UPDATE\", \"target_memory_id\": \"ID\", \"updated_memory_text\": \"Text\"}} or\n{{\"operation\": \"NOOP\"}}\nDecision:\n\"\"\"\n```\n\nThis prompt template will define the **action** that needs to be taken based on the user’s statement, whether it’s to add new memory, update existing memory, or take no action.\n\nLet’s now code the function that performs the appropriate action based on the response.\n\n```python\nS_SIMILAR_MEMORIES_FOR_UPDATE_DECISION = 3  # Number of similar memories to consider\n# (Initialize token counters if they are global for demo purposes)\n# total_prompt_tokens_mem0_update = 0\n# total_completion_tokens_mem0_update = 0\n\ndef mem0_decide_memory_operation_with_llm(candidate_fact_text, similar_existing_memories_list):\n    \"\"\"\n    Uses the LLM to decide whether a candidate fact should:\n    - be added as a new memory,\n    - update an existing memory,\n    - or be ignored (no-op).\n\n    Parameters:\n    - candidate_fact_text: The new fact to evaluate.\n    - similar_existing_memories_list: List of (MemoryItem, similarity_score) tuples.\n\n    Returns:\n    - A dictionary with one of the following formats:\n      {\"operation\": \"ADD\"}\n      {\"operation\": \"UPDATE\", \"target_memory_id\": \"ID\", \"updated_memory_text\": \"Text\"}\n      {\"operation\": \"NOOP\"}\n    \"\"\"\n\n    # Format the similar memories section for the prompt\n    similar_segment = \"No similar memories.\"\n    if similar_existing_memories_list:\n        formatted = [\n            f\"{i+1}. ID: {mem.id}, Sim: {sim_score:.4f}, Text: '{mem.text}'\"\n            for i, (mem, sim_score) in enumerate(similar_existing_memories_list)\n        ]\n        similar_segment = \"Similar Memories:\\n\" + \"\\n\".join(formatted)\n\n    # Construct the prompt dynamically\n    prompt = f\"\"\"\nYou manage a memory store. Decide how to handle this new fact:\n\"{candidate_fact_text}\"\n{similar_segment}\nChoose ONE:\n- ADD: New info.\n- UPDATE: Improves one existing memory (give memory ID and new text).\n- NOOP: Redundant.\nRespond with JSON:\n{{\"operation\": \"ADD\"}} or\n{{\"operation\": \"UPDATE\", \"target_memory_id\": \"ID\", \"updated_memory_text\": \"Text\"}} or\n{{\"operation\": \"NOOP\"}}\nDecision:\n\"\"\"\n\n    # Build the message for the LLM\n    messages = [\n        {\"role\": \"system\", \"content\": \"Decide memory action. Output ONLY valid JSON.\"},\n        {\"role\": \"user\", \"content\": prompt}\n    ]\n\n    # Get the LLM's response\n    response_text, p_tokens, c_tokens = get_llm_chat_completion(\n        messages\n    )\n\n    # Track token usage (assuming global counters for demo)\n    # global total_prompt_tokens_mem0_update, total_completion_tokens_mem0_update\n    # total_prompt_tokens_mem0_update += p_tokens\n    # total_completion_tokens_mem0_update += c_tokens\n\n    # Extract and parse the JSON from the response\n    decision = {}\n    try:\n        start = response_text.find('{')\n        end = response_text.rfind('}')\n        if start != -1 and end != -1 and end + 1 \u003e start:\n            json_candidate = response_text[start:end + 1]\n            decision = json.loads(json_candidate)\n        else:\n            print(f\"Warning: Could not find JSON object in response: {response_text}\")\n            decision = {\"operation\": \"NOOP\"} # Fallback\n    except json.JSONDecodeError:\n        print(f\"Warning: JSONDecodeError for response: {response_text}\")\n        decision = {\"operation\": \"NOOP\"} # Fallback\n\n    return decision\n```\n\nThis will work similarly to our previous function that extracted facts from the LLM response in a list format.\n\nThe key difference is that this one will extract information as a **JSON string**, defining the relationship between the current statement and the existing memory. Aside from that, the structure and logic remain largely the same.\n\n# Retrieval for Answering Queries\n\nIf you’re familiar with **RAG**, this step will feel very similar.\n\nWhen the user asks a **query**, we **retrieve relevant memories** and pass them along with the query to help the LLM generate a more accurate and context-aware response.\n\nThis logic can be easily wrapped into a single function, let’s implement that.\n\n```python\n# Default number of top similar memories to retrieve\nK_MEMORIES_TO_RETRIEVE_FOR_QUERY = 3\n\ndef mem0_retrieve_and_format_memories_for_llm_query(user_query_text, memory_store_instance, turn_log_entry, top_k_results=K_MEMORIES_TO_RETRIEVE_FOR_QUERY):\n    \"\"\"\n    Retrieves and formats top-k relevant memories from the store for a given user query.\n\n    Parameters:\n    - user_query_text: The current user query.\n    - memory_store_instance: Instance of MemoryStore to search in.\n    - turn_log_entry: Dict to store memory retrieval info for logging.\n    - top_k_results: Number of top similar memories to retrieve.\n\n    Returns:\n    - A formatted string of relevant memories or a fallback message.\n    \"\"\"\n    turn_log_entry['retrieved_memories_for_query'] = []\n\n    # Get embedding for the user query\n    query_embedding = get_embedding(user_query_text)\n\n    # Find top-k similar memories based on the query embedding\n    retrieved = memory_store_instance.find_semantically_similar_memories(query_embedding, top_k=top_k_results)\n\n    # Build formatted memory list\n    if not retrieved:\n        return \"Relevant memories:\\n(No relevant memories found)\"\n\n    output = \"Relevant memories:\\n\"\n    for i, (mem, score) in enumerate(retrieved):\n        memory_store_instance.get_memory_item_by_id(mem.id)  # Mark as accessed\n        output += f\"{i+1}. {mem.text} (Similarity: {score:.3f})\\n\"\n\n        # Log retrieval details\n        turn_log_entry['retrieved_memories_for_query'].append({\n            'id': mem.id,\n            'text': mem.text,\n            'similarity': score\n        })\n\n    return output.strip()\n```\n\nThis function will simply **retrieve relevant memories** based on the user’s query, **if it’s not a statement**.\n\nIn this case, the user is seeking an **answer**, not trying to improve the LLM knowledge. So, instead of storing anything, we search for related context to assist in generating a more accurate and informed response.\n\n# Running Mem0 Algorithm\n\nSo, we have coded everything needed for our memory-efficient algorithm.\n\nNext, let’s build the **main loop** of the algorithm, which will process each message in our conversation scenario. Based on the type of each message (question or statement), it will take the appropriate action, ‘whether generating a response or updating memory.\n\n*(Note: The following code block uses variables like `script`, `memory_store_instance`, `M_RECENT_RAW_TURNS_FOR_EXTRACTION_CONTEXT`, `raw_conversation_log_for_extraction_context`, `VERBOSE_MEM0_RUN`, `current_short_term_llm_chat_history`, `total_prompt_tokens_mem0_conversation`, `total_completion_tokens_mem0_conversation`, `SHORT_TERM_CHAT_HISTORY_WINDOW`. These would need to be initialized appropriately before this loop runs. The `mem0_process_user_statement_for_memory` function is also implied and would combine `mem0_extract_salient_facts_from_turn` and `mem0_decide_memory_operation_with_llm` along with MemoryStore operations.)*\n\n```python\n# ---- Helper function implied by the main loop ----\ndef mem0_process_user_statement_for_memory(\n    current_user_statement_text, \n    recent_turns_window_text, \n    memory_store_instance, \n    current_turn_index, \n    turn_log_entry, \n    verbose=False\n):\n    # 1. Extract facts\n    facts = mem0_extract_salient_facts_from_turn(\n        current_user_statement_text, recent_turns_window_text, current_turn_index\n    )\n    if verbose:\n        print(f\"[Extractor LLM] Parsed {len(facts)} fact(s).\")\n    turn_log_entry['extracted_facts'] = facts\n\n    if not facts:\n        return\n\n    if verbose:\n        print(f\"[MemoryOrchestrator] Extracted {len(facts)} fact(s).\")\n    \n    turn_log_entry['memory_operations'] = []\n\n    for fact_text in facts:\n        fact_embedding = get_embedding(fact_text)\n        similar_memories = memory_store_instance.find_semantically_similar_memories(\n            fact_embedding, top_k=S_SIMILAR_MEMORIES_FOR_UPDATE_DECISION\n        )\n        \n        decision = mem0_decide_memory_operation_with_llm(fact_text, similar_memories)\n        turn_log_entry['memory_operations'].append({'fact': fact_text, 'decision': decision})\n\n        if verbose:\n            print(f\"[MemoryOrchestrator] Fact {fact_text[:30]}...: LLM Decision -\u003e {decision['operation']}\")\n\n        if decision['operation'] == 'ADD':\n            new_mem_item = MemoryItem(fact_text, [current_turn_index])\n            memory_store_instance.add_memory_item(new_mem_item)\n        elif decision['operation'] == 'UPDATE':\n            target_id = decision.get('target_memory_id')\n            updated_text = decision.get('updated_memory_text')\n            if target_id and updated_text:\n                memory_store_instance.update_existing_memory_item(target_id, updated_text, [current_turn_index])\n            else:\n                if verbose: print(\"UPDATE decision missing target_id or updated_text.\")\n        # NOOP means do nothing\n\n# ---- Initialization for the main loop (Example values) ----\nscript = conversation_script # Already defined with 'type' by classify_input\nmemory_store_instance = MemoryStore(verbose=True)\nM_RECENT_RAW_TURNS_FOR_EXTRACTION_CONTEXT = 4 # Example: last 2 user turns + 2 assistant turns\nraw_conversation_log_for_extraction_context = []\nVERBOSE_MEM0_RUN = True\ncurrent_short_term_llm_chat_history = [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}]\nSHORT_TERM_CHAT_HISTORY_WINDOW = 3 # Example: Keep last 3 user/assistant pairs + system prompt\ntotal_prompt_tokens_mem0_conversation = 0\ntotal_completion_tokens_mem0_conversation = 0\n# Initialize other token counters if used globally, e.g., total_prompt_tokens_mem0_extract, etc.\nmem0_turn_logs = [] # To store log entries\n\n# ---- Main Loop ----\n# Iterate over each turn in the script\nfor turn_index, turn_data in enumerate(script):\n    user_msg = turn_data['content']  # User input text\n    turn_type = turn_data['type']    # Type: 'statement', or 'query'\n\n    if VERBOSE_MEM0_RUN:\n        print(f\"\\n--- Mem0 Turn {turn_index + 1}/{len(script)} ({turn_type}) ---\")\n        print(f\"User: {user_msg[:70]}...\")\n\n    # Log for this turn\n    turn_log_entry = {\n        \"turn\": turn_index + 1,\n        \"type\": turn_type,\n        \"user_content\": user_msg\n    }\n\n    assistant_response = \"(Ack/Internal Processing)\"  # Default placeholder\n\n    # Handle statements and updates\n    if turn_type == 'statement': # Simplified, original had 'statement_update'\n        # Collect recent user-assistant raw conversation log for context\n        recent = \"\\n\".join(raw_conversation_log_for_extraction_context[-M_RECENT_RAW_TURNS_FOR_EXTRACTION_CONTEXT:])\n\n        # Process the user message for fact extraction and memory update\n        mem0_process_user_statement_for_memory(\n            user_msg, recent, memory_store_instance, turn_index, turn_log_entry, verbose=VERBOSE_MEM0_RUN\n        )\n\n        # Assistant's response depends on statement type\n        assistant_response = \"Okay, noted.\" # if turn_type == 'statement' else \"Okay, updated.\"\n\n        # Log zero tokens since this path does not invoke LLM for conversational response\n        turn_log_entry.update({\n            'assistant_response_conversational': assistant_response,\n            'prompt_tokens_conversational_turn': 0,\n            'completion_tokens_conversational_turn': 0\n        })\n\n    # Handle queries (e.g., questions)\n    elif turn_type == 'query':\n        # Retrieve relevant memories based on semantic similarity to the query\n        retrieved = mem0_retrieve_and_format_memories_for_llm_query(user_msg, memory_store_instance, turn_log_entry)\n\n        # Create prompt for LLM with memory context and query\n        messages_for_llm = list(current_short_term_llm_chat_history) + [{\n            \"role\": \"user\",\n            \"content\": f\"User Query: '{user_msg}'\\n\\nRelevant Info from Memory:\\n{retrieved}\"\n        }]\n        \n        # Get LLM response\n        # Assuming get_llm_chat_completion can take max_tokens and verbose\n        assistant_response, p_tokens, c_tokens = get_llm_chat_completion(messages_for_llm) # Removed max_tokens, verbose for simplicity unless defined in get_llm_chat_completion\n\n        # Track token usage for this query\n        total_prompt_tokens_mem0_conversation += p_tokens\n        total_completion_tokens_mem0_conversation += c_tokens\n\n        # Update turn log with LLM output and token info\n        turn_log_entry.update({\n            'assistant_response_conversational': assistant_response,\n            'prompt_tokens_conversational_turn': p_tokens,\n            'completion_tokens_conversational_turn': c_tokens\n        })\n    \n    if VERBOSE_MEM0_RUN:\n        print(f\"Assistant: {assistant_response[:70]}...\")\n\n\n    # Update raw log used for memory extraction context\n    raw_conversation_log_for_extraction_context += [\n        f\"T{turn_index+1} U: {user_msg}\",\n        f\"T{turn_index+1} A: {assistant_response}\"\n    ]\n\n    # Update short-term history for future LLM context\n    current_short_term_llm_chat_history += [\n        {\"role\": \"user\", \"content\": user_msg},\n        {\"role\": \"assistant\", \"content\": assistant_response}\n    ]\n\n    # Keep only the most recent chat history within the sliding window\n    if len(current_short_term_llm_chat_history) \u003e (1 + SHORT_TERM_CHAT_HISTORY_WINDOW * 2): # 1 for system prompt\n        current_short_term_llm_chat_history = [current_short_term_llm_chat_history[0]] + \\\n            current_short_term_llm_chat_history[-(SHORT_TERM_CHAT_HISTORY_WINDOW * 2):]\n\n    # Log the memory store size after processing this turn\n    turn_log_entry['mem_store_size_after_turn'] = len(memory_store_instance.memories)\n    mem0_turn_logs.append(turn_log_entry)\n```\n\nWhen we start the above loop, the **memory efficiency mechanism** kicks in.\n\nIn a production setup, this would run live, each time a user sends a query, the system would:\n\n1.  Determine whether it’s a question or a statement,\n2.  Retrieve or update memory accordingly, and\n3.  Generate a response if needed.\n\nNow, let’s take a look at what it starts logging in the terminal.\n\n```python\n## OUTPUT OF OUR MEMO ALGORITHM ##\n\n--- Mem0 Turn 1/10 (statement) ---\nUser: Hi, lets start planning the New Marketing Campaign. My primary goal is to inc...\n[Extractor LLM] Parsed 2 fact(s).\n[MemoryOrchestrator] Extracted 2 fact(s).\n[MemoryOrchestrator] Fact The user wants to start planni...: LLM Decision -\u003e ADD\n[MemoryOrchestrator] Fact The users primary goal for th...: LLM Decision -\u003e ADD\nAssistant (Ack): Okay, noted.\n\n--- Mem0 Turn 2/10 (statement) ---\nUser: For this campaign, the target audience is young adults aged 18-25....\n[Extractor LLM] Parsed 1 fact(s).\n[MemoryOrchestrator] Extracted 1 fact(s).\n[MemoryOrchestrator] Fact The target audience for the N...: LLM Decision -\u003e ADD\nAssistant (Ack): Okay, noted.\n\n--- Mem0 Turn 3/10 (statement) ---\nUser: I want to allocate a budget of $5000 for social media ads for the New Marketing ...\n[Extractor LLM] Parsed 1 fact(s).\n[MemoryOrchestrator] Extracted 1 fact(s).\n[MemoryOrchestrator] Fact The user wants to allocate a b...: LLM Decision -\u003e ADD\nAssistant (Ack): Okay, noted.\n\n--- Mem0 Turn 4/10 (query) ---\nUser: Whats the main goal for the New Marketing Campaign?...\nAssistant: The main goal for the New Marketing Campaign is to **increase brand awareness by...\n\n\n....\n```\n\nSo, from the output of the first four chat turns:\n\n*   The **first three inputs** were identified as **statements**, so the LLM didn’t generate any response, it simply stored the extracted facts into memory.\n*   The **fourth input** was a **question**, so the LLM used the relevant memory and generated a proper response.\n\nThis confirms that the **memory is being correctly updated** with meaningful context.\n\nHowever, to really understand the impact of this approach, we need to perform a **comparative analysis**.\n\nThis analysis will show the actual **difference in token usage** between the raw LLM approach and the memory-efficient one.\n\nLet’s move on to that comparison.\n\n# Comparative Analysis\n\nWe have already computed the **raw LLM approach** and recently ran the **memory-efficient approach** for our 10-message conversation.\n\nNow, let’s create a **DataFrame** to organize the token usage data from both approaches. This will help us clearly see the difference and then we can **plot a graph** to visualize the comparison.\n\nWe need to do some basic cleaning stuff to transform our output into a proper dataframe.\n*(Note: The following code assumes that variables like `final_raw_prompt_tokens`, `final_mem0_overall_prompt_tokens`, etc., have been calculated and stored from the previous runs. You would need to sum up tokens from your logs for this.)*\n```python\n\n# Placeholder values for DataFrame creation if above sums are not implemented for this snippet\nfinal_raw_prompt_tokens, final_raw_completion_tokens, final_raw_total_tokens = 4446, 451, 4897 # From earlier\nfinal_mem0_overall_prompt_tokens, final_mem0_overall_completion_tokens, final_mem0_overall_total_tokens = 2800, 300, 3100 # Hypothetical\nfinal_mem0_conv_prompt_tokens, final_mem0_conv_completion_tokens = 1000, 200 # Hypothetical\nfinal_mem0_extr_prompt_tokens, final_mem0_extr_completion_tokens = 1200, 50 # Hypothetical\nfinal_mem0_upd_prompt_tokens, final_mem0_upd_completion_tokens = 600, 50 # Hypothetical\n\n# Prepare comparison data between Raw and Mem0 approaches\ndata = {\n    'Metric': [\n        'Prompt Tokens',            # Total prompt tokens used\n        'Completion Tokens',        # Total completion tokens used\n        'Total Tokens',             # Sum of prompt + completion\n        '',                         # Empty row for spacing\n\n        # Breakdown of token usage for Mem0 approach\n        'Mem0: Conv Prompt',        # Prompt tokens for conversation turns\n        'Mem0: Conv Completion',    # Completion tokens for conversation turns\n        'Mem0: Extract Prompt',     # Prompt tokens for memory extraction\n        'Mem0: Extract Completion', # Completion tokens for memory extraction\n        'Mem0: Update Prompt',      # Prompt tokens for memory updates\n        'Mem0: Update Completion'   # Completion tokens for memory updates\n    ],\n    'Raw Approach': [\n        final_raw_prompt_tokens,       # Raw prompt tokens\n        final_raw_completion_tokens,   # Raw completion tokens\n        final_raw_total_tokens,        # Raw total tokens\n        '',                            # Empty value for spacer row\n        '-', '-', '-', '-', '-', '-'   # No breakdown available for raw approach\n    ],\n    'Mem0 Approach': [\n        final_mem0_overall_prompt_tokens,        # Total prompt tokens across Mem0\n        final_mem0_overall_completion_tokens,    # Total completion tokens across Mem0\n        final_mem0_overall_total_tokens,         # Total tokens in Mem0\n        '',                                       # Spacer\n        final_mem0_conv_prompt_tokens,           # Prompt tokens for conversation\n        final_mem0_conv_completion_tokens,       # Completion tokens for conversation\n        final_mem0_extr_prompt_tokens,           # Prompt tokens for extraction\n        final_mem0_extr_completion_tokens,       # Completion tokens for extraction\n        final_mem0_upd_prompt_tokens,            # Prompt tokens for updates\n        final_mem0_upd_completion_tokens         # Completion tokens for updates\n    ]\n}\n\n# Create a DataFrame for display and analysis\ncomparison_df = pd.DataFrame(data)\nprint(comparison_df.to_string()) # Print DataFrame to console\n```\n\nHere is what our **comparison DataFrame** is.\n\n\n| #   | Metric                           | Raw Approach | Mem0 Approach | Percentage Difference (%) |\n|-----|----------------------------------|--------------|----------------|----------------------------|\n| 0   | Prompt Tokens                    | 7616         | 5037           | 33.86                      |\n| 1   | Completion Tokens                | 1372         | 410            | 70.12                      |\n| 2   | Total Tokens                     | 8988         | 5447           | 39.40                      |\n| 3   |                                  |              |                | NaN                        |\n| 4   | Mem0: Conversational Prompt      | -            | 788            | NaN                        |\n| 5   | Mem0: Conversational Completion  | -            | 98             | NaN                        |\n| 6   | Mem0: Extraction Prompt          | -            | 1453           | NaN                        |\n| 7   | Mem0: Extraction Completion      | -            | 168            | NaN                        |\n| 8   | Mem0: Update Logic Prompt        | -            | 2796           | NaN                        |\n| 9   | Mem0: Update Logic Completion    | -            | 144            | NaN                        |\n\n\nWe have been able to reduce the prompt tokens by up to 40% within just 10 chat conversations. Let’s visualize this now.\n\n![Comparative Analysis Graph](https://cdn-images-1.medium.com/max/1000/1*ka6hpbCJ5RDcoB2imWFzyg.png)\n\nThe red line our memory efficieny algorithm shows jump only on that run user actually ask a response but not a statement this happens on (4,5,8,9) turns while for the rest the total tokens count (Prompt+completition) didnt increase as rapidly as we see in raw llm approach.\n\nThe yellow regin shows the different and this difference will increase on a greater number of chats, here is a proove for 100 chat conversations. \n\n![For 100 Chat Conversation Graph](https://cdn-images-1.medium.com/max/1000/1*zPJZ_ixIaujjrFEwMqvBOA.png)\n\nThere is a huge difference after up to 100 conversations up to more than **60% save**, the impact is clear and significant, as it saves costs, which is the most important factor here.\n\n# What’s Next\n\nSo, it’s clear that the memory-efficient approach is significantly better and quickly adaptable.\n\nYou can use my notebook and enhance it further by developing more powerful prompts and adding features to improve speed and efficiency.\n\nAdditionally, you can incorporate evaluation techniques to assess the quality of responses from both approaches.\n\nBased on these evaluations, the algorithm can be refined to boost both efficiency and accuracy.\n\n\u003e It’s great to hear your thoughts on how this approach can be improved.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffareedkhan-dev%2Fsave-llm-api-cost","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffareedkhan-dev%2Fsave-llm-api-cost","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffareedkhan-dev%2Fsave-llm-api-cost/lists"}