Updated: 2023-12-20
Explore a unified approach for measuring token consumption in various LLMs. Enhance your LLM API management with our detailed guide on token measurement across diverse models.
Do you like this article? Follow us on LinkedIn.
In the ever-changing world of language models (LLMs), LLM DevOps teams often face the complexity of managing different models from several providers. One aspect of this difficulty stems from the lack of a unified token measurement method for LLMs, as outlined in articles like How To Understand, Manage Token-Based Pricing of Generative AI Large Language Models. Each model, whether from the same or different providers, often follows its unique token count metric, or, in some cases, a character count metric, leading to challenges in consistent evaluation and integration.
This article aims to simplify the process by introducing a method to use an LLM Gateway to deploy a unified token measurement approach across different models. Rather than advocating for a single standard tokenizer model, we focus on a versatile technique that allows developers to uniformly measure token usage irrespective of the underlying LLM. This approach is designed to simplify the assessment and management of LLMs, making it easier for developers to compare, optimize, and integrate these models into their projects.
As we dive into this topic, we will examine the current challenges in measuring token consumption, highlight the importance of a unified measurement method, and propose a practical solution that can be adapted to various LLMs, thereby enhancing efficiency and clarity in LLM DevOps.
Imagine a scenario where we, as application developers, are exploring different language models (LLMs) for integration into our application. We use a variety of LLM APIs for both unique and overlapping functions. A key aspect we need to understand is the token consumption across models, which impacts decisions and ability to optimize our future usage costs and efficiency.
The primary challenge is the lack of a standardized method for measuring token consumption in a consistent manner between the models, as noted in the introduction. Each LLM, tied to its own token counting mechanism, presents its metrics in a unique way. For instance, using an LLM Gateway like Gecholog.ai, we can easily capture and report prompt, completion, or total tokens per API call. However, challenges arise when some models do not report token usage. For example, the default setting for the Llama2 model we are using doesn’t even include completion tokens in the response payload. Again, we are looking for a unified token measurement method for Large Language Models.
How, then, can we consistently measure token consumption across these diverse models? We can appreciate the fact if we want to use a unified way to report token consumption, we need to accept a systematic error due to the difference in models. Nevertheless, it’s interesting to see how we can establish an internal measurement method that applies uniformly across all models we use.
Before diving into our solution, it's crucial to understand why we focus on tokens rather than just characters. Tokenization is a fundamental process in LLMs. It involves breaking down text into manageable units – tokens – which could be words, subwords, characters, or bytes. This process is essential for the model to interpret and process text efficiently. Since LLMs are inherently designed and trained based on this tokenization concept, measuring in tokens rather than characters offers a more precise reflection of the computational effort and resources each model requires. In essence, it's about measuring the workload in the language model's own terms.
To address the challenge of inconsistent token measurement across different LLMs, we are going to use a method that is a) based on a standardized LLM architecture and b) does not tamper with the real time request-response to the LLM API. In order to fulfill the criteria of using a standardized architecture, we are using an LLM Gateway to route our LLM API traffic flows (as proposed in the architecture by Andreessen Horowits). Secondly, we make sure our method for unified token measurement is asynchronous, meaning it is performed only after the response from the LLM API is sent back to the application.
Logging API Calls: We utilize an LLM Gateway, specifically Gecholog.ai, for its capability to log each API call in a simple and cloud platform agnostic manner. This platform not only simplifies the process of log generation but also facilitates comprehensive analytics. (Refer to our previous article LLM API Traffic Management: Mastering Integration with LLM DevOps and LLM Gateway).
Post-Processing with Custom Processor: To augment the logs for Gecholog.ai asynchronously, we use the “custom processor” functionality of the gateway that allows us to connect our own services to augment, modify or change the traffic flows. In this case we want a non-intrusive method so we will deploy a post-processing processor that is designed to count tokens from the response (completion) data. The advantage of post-processing is that the real-time performance of LLM requests remains unaffected.
Utilizing Hugging Face Transformers: Our processor leverages the Hugging Face Python transformers library, known for its wide range of tokenizers. For demonstration purposes, we've chosen the GPT2Tokenizer, though this method is flexible enough to incorporate any tokenizers like tiktoken from OpenAI or the tokenizer from the BERT model etc, depending on our preferences.
Find example a Quick Start guide to the customer processor we used on our docs site.
In our LLM DevOps approach, we've established a uniform method to assess token consumption, specifically focusing on completion tokens, across a range of models from various providers. We are using Gecholog.ai and a custom post-processor is tagging each request based on the GPT2 token count.
Our initial tests show that GPT2's token calculations have a slight deviation, about 5%, compared to GPT4 and GPT3.5. We also now have a token measurement for Llama2 completion tokens that we didn’t have before.
Consider the case where we're satisfied with the quality of the LLM API responses from Llama2, which notably consume more tokens. With our unified measurement approach, we can now accurately gauge the token consumption we might expect if we were to switch these calls to GPT3.5 or GPT4 instead.
Embracing this unified token measurement method allows us to not only compare token consumption but also to expand our analytics. We can integrate standard performance metrics like average request duration or error rates, providing a holistic view of LLM performance.
Furthermore, this method opens avenues for more detailed analyses, such as comparing token distribution patterns across different models. Such comparative studies can yield valuable insights into the behavior and efficiency of various LLMs.
As we conclude, the unified method we have proposed for measuring token consumption explores a way to try to create control in a quickly developing and still fragmented generative AI landscape. One clear benefit of deploying and controlling your own measurement is that you can change & control the method and prevent lock-in and simplify the ability to transition or evaluate models. This unified approach does not solve the challenge of the complexing pricing of different LLM API but provides one building block for establishing futureproof LLM DevOps methods.
By deploying a unified token measurement, we enable a more consistent and comparative analysis of model performance, irrespective of the underlying differences in their tokenization methodologies. This unified approach simplifies the task for developers and analysts, ensuring that their focus can remain on optimizing model performance and integrating these advanced technologies into their projects more effectively.
While this unified method streamlines certain aspects of LLM management, it's important to recognize its limitations. The approach involves a level of approximation and may not capture the nuances of each model's unique tokenization process. Nevertheless, the benefits of having a common metric far outweigh these limitations, especially in a field where model diversity and complexity are ever-increasing.
Looking forward, perhaps a more standardized way or best practice will emerge, and if that is the case, the approach proposed in this article can easily adapt to such best practice. In essence, our journey into unifying token measurement is just the beginning. As the field of LLMs continues to evolve, so too will our methods and tools for measuring and understanding them.
Interested in unifying your token consumption tracking? Sign up for our free trial to enhance your LLM API traffic management. Boost your application’s efficiency, security, and scalability today.