GPTCache: Supercharge Generative AI

Are you waiting for the future to happen or ready to make it happen?

Don’t miss your chance to join Gen AI Launch Pad 2025 and shape what’s next.

Introduction

With the increasing use of Generative AI models like GPT-4, developers and businesses face challenges related to latency, cost, and efficiency. GPTCache is a powerful caching library designed to optimize the performance of Large Language Model (LLM) applications by storing and reusing previous responses. This not only reduces redundant API calls but also enhances user experience with faster response times.

In this blog, we’ll explore the capabilities of GPTCache, break down the code required to integrate it into AI applications, and discuss best practices for maximizing efficiency. Whether you're working on chatbots, Retrieval-Augmented Generation (RAG) systems, or other AI-driven applications, this guide will help you unlock the full potential of GPTCache.

Setting Up GPTCache

Before integrating GPTCache into your AI workflow, you need to install the required dependencies. The following command installs GPTCache along with other necessary packages:

pip install gptcache onnxruntime openai==0.28 tiktoken

To use the OpenAI API, you need to set up an API key in your environment:

import os
from google.colab import userdata

OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

Explanation

gptcache is the main library used for caching AI responses.
onnxruntime enables fast execution of machine learning models.
openai is the official library to interact with OpenAI’s API.
The OpenAI API key is retrieved from Google Colab’s userdata module and set as an environment variable.

Real-World Use Case: If your application repeatedly receives the same or similar queries, caching responses prevents unnecessary API calls, reducing costs and improving user experience.

OpenAI API Without GPTCache

Let’s first observe the standard OpenAI API call without caching:

import os
import time
import openai

question = "what’s chatgpt"
openai.api_key = OPENAI_API_KEY
start_time = time.time()
response = openai.ChatCompletion.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": question}],
)

def response_text(openai_resp):
    return openai_resp['choices'][0]['message']['content']

print(f'Question: {question}')
print("Time consuming: {:.2f}s".format(time.time() - start_time))
print(f'Answer: {response_text(response)}\n')

Expected Output

Question: what’s chatgpt
Time consuming: 0.87s
Answer: ChatGPT is a chatbot developed by OpenAI...

Analysis: Every time the same question is asked, an API call is made, leading to additional cost and increased latency.

Implementing GPTCache

To speed up responses, let’s initialize GPTCache:

from gptcache import cache
from gptcache.adapter import openai

cache.init()
cache.set_openai_key()
print("Cache loading...")

Explanation

cache.init() initializes the caching system.
cache.set_openai_key() sets up the OpenAI API key for GPTCache.

Benefit: Once caching is enabled, repeated queries will return instantly without making API requests.

Query Timing with GPTCache

question = "what's github"
for _ in range(2):
    start_time = time.time()
    response = openai.ChatCompletion.create(
        model='gpt-4o',
        messages=[{"role": "user", "content": question}],
    )
    print(f'Question: {question}')
    print("Time consuming: {:.2f}s".format(time.time() - start_time))
    print(f'Answer: {response_text(response)}\n')

Expected Output

Question: what's github
Time consuming: 0.84s
Answer: GitHub is a web-based platform...

Question: what's github
Time consuming: 0.76s
Answer: GitHub is a web-based platform...

Observation: The second call is significantly faster because GPTCache retrieves the answer without querying the API.

Implementing Semantic Search in GPTCache

To enhance caching capabilities, we use similarity-based search with ONNX and FAISS:

from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

onnx = Onnx()
data_manager = get_data_manager(CacheBase("sqlite"), VectorBase("faiss", dimension=onnx.dimension))

cache.init(
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
)
cache.set_openai_key()

Explanation

ONNX optimizes embedding computation.
FAISS accelerates vector search, making similarity-based caching highly efficient.
get_data_manager integrates a database (sqlite) and a vector search engine (faiss).

Use Case: If users ask slightly different variations of the same question (e.g., "What is GitHub?", "Tell me about GitHub"), GPTCache retrieves a previously stored response instead of generating a new one.

Exact Match Caching

For applications that require strict matching, GPTCache supports exact match evaluation:

from gptcache.similarity_evaluation.exact_match import ExactMatchEvaluation

cache.init(similarity_evaluation=ExactMatchEvaluation())
cache.set_openai_key()

response = openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    messages=[{'role': 'user', 'content': 'what is chatgpt'}]
)
print(response)

Benefit

Ensures that responses are only retrieved from cache if the query exactly matches a previous query.

Conclusion

GPTCache is a game-changer for optimizing LLM applications, offering significant reductions in API costs and response times. By leveraging exact match caching, semantic search with ONNX and FAISS, and adaptive caching policies, developers can enhance the efficiency of AI applications in production.

Next Steps

Experiment with different caching strategies based on your use case.
Integrate GPTCache into chatbot applications for improved performance.
Explore hybrid caching techniques combining exact match and similarity search.

Resources

---------------------------

Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.

Experts predict 2025 will be the defining year for Gen AI Implementation. Want to be ahead of the curve?

Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.

---------------------------

Resources and Community

Join our community of 12,000+ AI enthusiasts and learn to build powerful AI applications! Whether you're a beginner or an experienced developer, this tutorial will help you understand and implement AI agents in your projects.

Website: www.buildfastwithai.com
LinkedIn: linkedin.com/company/build-fast-with-ai/
Instagram: instagram.com/buildfastwithai/
Twitter: x.com/satvikps
Telegram: t.me/BuildFastWithAI

Are you waiting for the future to happen or ready to make it happen?

Don’t miss your chance to join Gen AI Launch Pad 2025 and shape what’s next.

Introduction

Setting Up GPTCache

Before integrating GPTCache into your AI workflow, you need to install the required dependencies. The following command installs GPTCache along with other necessary packages:

pip install gptcache onnxruntime openai==0.28 tiktoken

To use the OpenAI API, you need to set up an API key in your environment:

import os
from google.colab import userdata

OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

Explanation

gptcache is the main library used for caching AI responses.
onnxruntime enables fast execution of machine learning models.
openai is the official library to interact with OpenAI’s API.
The OpenAI API key is retrieved from Google Colab’s userdata module and set as an environment variable.

Real-World Use Case: If your application repeatedly receives the same or similar queries, caching responses prevents unnecessary API calls, reducing costs and improving user experience.

OpenAI API Without GPTCache

Let’s first observe the standard OpenAI API call without caching:

import os
import time
import openai

question = "what’s chatgpt"
openai.api_key = OPENAI_API_KEY
start_time = time.time()
response = openai.ChatCompletion.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": question}],
)

def response_text(openai_resp):
    return openai_resp['choices'][0]['message']['content']

print(f'Question: {question}')
print("Time consuming: {:.2f}s".format(time.time() - start_time))
print(f'Answer: {response_text(response)}\n')

Expected Output

Question: what’s chatgpt
Time consuming: 0.87s
Answer: ChatGPT is a chatbot developed by OpenAI...

Analysis: Every time the same question is asked, an API call is made, leading to additional cost and increased latency.

Implementing GPTCache

To speed up responses, let’s initialize GPTCache:

from gptcache import cache
from gptcache.adapter import openai

cache.init()
cache.set_openai_key()
print("Cache loading...")

Explanation

cache.init() initializes the caching system.
cache.set_openai_key() sets up the OpenAI API key for GPTCache.

Benefit: Once caching is enabled, repeated queries will return instantly without making API requests.

Query Timing with GPTCache

question = "what's github"
for _ in range(2):
    start_time = time.time()
    response = openai.ChatCompletion.create(
        model='gpt-4o',
        messages=[{"role": "user", "content": question}],
    )
    print(f'Question: {question}')
    print("Time consuming: {:.2f}s".format(time.time() - start_time))
    print(f'Answer: {response_text(response)}\n')

Expected Output

Question: what's github
Time consuming: 0.84s
Answer: GitHub is a web-based platform...

Question: what's github
Time consuming: 0.76s
Answer: GitHub is a web-based platform...

Observation: The second call is significantly faster because GPTCache retrieves the answer without querying the API.

Implementing Semantic Search in GPTCache

To enhance caching capabilities, we use similarity-based search with ONNX and FAISS:

from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

onnx = Onnx()
data_manager = get_data_manager(CacheBase("sqlite"), VectorBase("faiss", dimension=onnx.dimension))

cache.init(
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
)
cache.set_openai_key()

Explanation

ONNX optimizes embedding computation.
FAISS accelerates vector search, making similarity-based caching highly efficient.
get_data_manager integrates a database (sqlite) and a vector search engine (faiss).

Exact Match Caching

For applications that require strict matching, GPTCache supports exact match evaluation:

from gptcache.similarity_evaluation.exact_match import ExactMatchEvaluation

cache.init(similarity_evaluation=ExactMatchEvaluation())
cache.set_openai_key()

response = openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    messages=[{'role': 'user', 'content': 'what is chatgpt'}]
)
print(response)

Benefit

Ensures that responses are only retrieved from cache if the query exactly matches a previous query.

Conclusion

Next Steps

Experiment with different caching strategies based on your use case.
Integrate GPTCache into chatbot applications for improved performance.
Explore hybrid caching techniques combining exact match and similarity search.

Resources

---------------------------

Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.

Experts predict 2025 will be the defining year for Gen AI Implementation. Want to be ahead of the curve?

Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.

---------------------------

Resources and Community

Website: www.buildfastwithai.com
LinkedIn: linkedin.com/company/build-fast-with-ai/
Instagram: instagram.com/buildfastwithai/
Twitter: x.com/satvikps
Telegram: t.me/BuildFastWithAI

BuildFast Bot

Introduction

Setting Up GPTCache

Explanation

OpenAI API Without GPTCache

Expected Output

Implementing GPTCache

Explanation

Query Timing with GPTCache

Expected Output

Implementing Semantic Search in GPTCache

Explanation

Exact Match Caching

Benefit

Conclusion

Next Steps

Resources

Resources and Community

BuildFast Bot

Introduction

Setting Up GPTCache

Explanation

OpenAI API Without GPTCache

Expected Output

Implementing GPTCache

Explanation

Query Timing with GPTCache

Expected Output

Implementing Semantic Search in GPTCache

Explanation

Exact Match Caching

Benefit

Conclusion

Next Steps

Resources

Resources and Community