How to track token usage #513

pomber · 2023-06-15T23:26:45Z

pomber
Jun 15, 2023

The non-streaming chat/completions API from OpenAI has a usage object in the response:

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "\n\nHello there, how may I assist you today?",
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21
  }
}

Which is great when you have different users and want to put limits depending on usage.

Any recommended way to do the same with the Vercel AI SDK?

Answered by lgrammel

May 27, 2024

streamText has a onFinish callback (starting with v3.1.15) that sends usage (among other things)

View full answer

TheBinaryGuy · 2023-06-16T01:31:54Z

TheBinaryGuy
Jun 16, 2023

One potential option might be to track prompt tokens by using a tokenizer library before starting the stream or in the onStart callback.

For the completion tokens, the streaming APIs return one token at a time, so you can track this in the onToken callback.

From a quick search, dqbd/tiktoken seems to support Vercel's Edge Runtime.

0 replies

jaredpalmer · 2023-06-17T20:18:57Z

jaredpalmer
Jun 17, 2023
Maintainer

Great feedback. We want to add this to the playground too.

0 replies

jensen · 2023-06-19T17:58:57Z

jensen
Jun 19, 2023

I use https://github.com/dqbd/tiktoken for our production application. I've noticed that it gets very slow as the number of tokens you are counting goes up, so if you try and count in the onCompletion you might find the same result. If you were to use it in the onToken callback as suggested it could be fast enough.

7 replies

benjick Feb 15, 2024

Any examples?

jensen Feb 15, 2024

He @benjick, Which examples are you looking for.

I took the recommendation from @davit-b and cached the encodings that I needed. Things were immediately much faster.

benjick Feb 15, 2024

@jensen hey thanks for the reply. How do you cache the encoding function?

jensen Feb 15, 2024

Let's say you create a helper function like this:

function encode(content: string, model: TiktokenModel) {
  const encoding = encoding_for_model(model);
  
  const tokens = encoding.encode(content);

  encoding.free();

  return tokens;
}

The issue here is that we load and free the encodings for the model every time we need to count anything.

By keeping the reference to the encoding you can avoid this overhead.

/*  cache encodings for the models that you use, in this case both are using c100k so it would be fine to share encodings
    I wanted to leave this flexible to support any encoding type
*/
const encodings = {
  "gpt-3.5": encoding_for_model("gpt-3.5"),
  "gpt-4": encoding_for_model("gpt-4")
};

function encode(content: string, model: TiktokenModel) {
  return encodings[model].encode(content);
}

I am using Fastify on AWS EC2 servers rather than serverless for this instance. Unfortunately Vercel did not support cancelling streams when I built this, so I had to deploy this part of the API to AWS last minute.

benjick Feb 15, 2024

Awesome, thank you! 🙏

siddharthsharma94 · 2023-06-20T04:15:24Z

siddharthsharma94
Jun 20, 2023
Collaborator

We built this for this exact use case with David’s tiktokenizer package and David’s help. https://tiktokenizer.vercel.app/

0 replies

pomber · 2023-06-20T09:25:53Z

pomber
Jun 20, 2023
Author

Apparently, OpenAI already has a feature for this, but it's disabled.

From https://community.openai.com/t/usage-info-in-api-responses/18862/3 :

The feature wasn’t enabled in streaming by default because we found that it could breaking existing integrations. It does exist though! If you would like it turned on, send us a message at help.openai.com

Maybe someone can convince them to enable it with a flag or something.

0 replies

MaxLeiter · 2023-08-07T18:49:22Z

MaxLeiter
Aug 7, 2023
Maintainer

We've done a bit of research here and every tokenizer is too large (generally due to wasm) for us to include by default with the SDK.

Our recommendation going forward will be to use your tokenizer of choice paired with the onToken / onCompletion callbacks. Will this work sufficiently? If so I'll add it to the docs.

0 replies

pomber · 2023-08-08T11:10:33Z

pomber
Aug 8, 2023
Author

We've done a bit of research here and every tokenizer is too large (generally due to wasm) for us to include by default with the SDK.

Makes sense.

To be honest, my hope was that Vercel could convince OpenAI to add the usage field to their response, especially since it's something that apparently they already have but it's disabled.

0 replies

colinricardo · 2023-08-23T17:34:41Z

colinricardo
Aug 23, 2023

maybe not the right place to ask, but how can we access the stop_reason when streaming with the new v4 sdks?

1 reply

davit-b Oct 1, 2023

same question, don't believe the onCompletion is provided that.

4ortytwo · 2023-09-15T13:42:59Z

4ortytwo
Sep 15, 2023

How does everyone approach this?

0 replies

guillegette · 2023-09-17T09:45:45Z

guillegette
Sep 17, 2023

OpenAI should just provide this in the stream response. Following this thread to see how ppl are doing it in the meantime.

0 replies

giuseppeg · 2023-09-24T07:25:49Z

giuseppeg
Sep 24, 2023

@pomber what solution did you settle on?

1 reply

pomber Sep 24, 2023
Author

I didn't want to include a tokenizer. For my use case, a naive estimation of the token count based on string length was good enough.

jensen · 2023-09-24T15:40:23Z

jensen
Sep 24, 2023

I have been using https://www.npmjs.com/package/@dqbd/tiktoken for months but have found issues since the 16k context was released for GPT-3.5.

If the prompt context is long enough, then tiktoken takes a noticeable time to run. It takes longer as the prompt grows. I have some ideas to optimize this.

3 replies

gablabelle Sep 24, 2023

If it's not critical for your app to have the usage on the spot, maybe you could simply decouple the tiktoken operation from your serverless/edge functions by using a queue like Quirrel or something similar.

jensen Sep 24, 2023

It is important.

In addition to usage tracking (which could benefit from a lazy operation), I use token counting to determine how much context we can still fit in a prompt. If the user goes over 4k tokens, then I upgrade them to 16k when using GPT-3.5. If they used GPT-3.5 for a completion with a 16k context window and then switch to GPT-4 to continue, I need to count how much context they need to lose to fit back within the 8k limit.

Thanks for the recommendation.

jensen Feb 15, 2024

Ended up caching the encoders as was recommended by @davit-b. Speed has not been an issue since.

wangrui6 · 2024-04-18T04:05:09Z

wangrui6
Apr 18, 2024

found this quite useful if you want to a clean counting on tokens https://github.com/Cainier/gpt-tokens

0 replies

dion-blutui · 2024-05-23T23:24:34Z

dion-blutui
May 23, 2024

streamText has a usage option which will give you the token usage. Usage response

  const result = await streamText({
    model: openai('gpt-4o'),
    messages,
    temperature: 0
  })
  
   const stream = result.toAIStream({
    async onFinal(completion: any) {
      const tokenCount = await result.usage // if you want to call the usage tokens onCompletion + save stuff etc
  })

returns an object { promptTokens: 2877, completionTokens: 357, totalTokens: 3234 }

3 replies

cbaird-1337 Jul 12, 2024

I'm assuming the answer is yes, but will this work with ANY model consumed via the Vercel AI SDK? I'm looking at Claude 3.5 Sonnet via Amazon Bedrock in particular...

dion-blutui Jul 21, 2024

I would assume so

dion-blutui Sep 16, 2024

As @lgrammel suggested a better way would be to use the onFinish callback. toAIStream and StreamingTextResponse are deprecated.

const result = await streamText({
    model: openai('gpt-4o'),
    messages,
    temperature: 0,
    onFinish: async (result: any) => {
      const tokenCount = result.usage // if you want to call the usage tokens onCompletion + save stuff etc
    }
  })

lgrammel · 2024-05-27T12:48:19Z

lgrammel
May 27, 2024
Maintainer

streamText has a onFinish callback (starting with v3.1.15) that sends usage (among other things)

6 replies

lgrammel Jun 5, 2024
Maintainer

@laflechaenelaire you mean you want to receive it on the client?

laflechaenelaire Jun 5, 2024

Sorry, I hadn't read the documentation properly. I wanted to do something like this:

import { useCompletion } from "ai/react";

  const { completion, usage } = useCompletion({
    api: '/api/scompletion',
    initialInput: prompt,
  })

since I was hoping to be able to obtain the total token usage from the client in the request. But maybe I have to save the data in the database and retrieve it from the client once the transmission is completed.

lgrammel Jun 6, 2024
Maintainer

@laflechaenelaire would you mind filing a ticket with an extension request?

laflechaenelaire Jun 6, 2024

Thank you for your response and suggestion. I will open a ticket with the extension request so that you can consider implementing this feature. I really appreciate your work.

Nishchit14 Jul 20, 2024

@laflechaenelaire have you created the ticket?

How to track token usage #513

Replies: 15 comments · 21 replies

jaredpalmer Jun 17, 2023 Maintainer

siddharthsharma94 Jun 20, 2023 Collaborator

pomber Jun 20, 2023 Author

MaxLeiter Aug 7, 2023 Maintainer

pomber Aug 8, 2023 Author

pomber Sep 24, 2023 Author

lgrammel May 27, 2024 Maintainer

lgrammel Jun 5, 2024 Maintainer

lgrammel Jun 6, 2024 Maintainer

Replies: 15 comments 21 replies

jaredpalmer
Jun 17, 2023
Maintainer

siddharthsharma94
Jun 20, 2023
Collaborator

pomber
Jun 20, 2023
Author

MaxLeiter
Aug 7, 2023
Maintainer

pomber
Aug 8, 2023
Author

pomber Sep 24, 2023
Author

lgrammel
May 27, 2024
Maintainer

lgrammel Jun 5, 2024
Maintainer

lgrammel Jun 6, 2024
Maintainer