Bongiysite

Generative AI & LLMs Crash Course: AI এর ভিতরের দুনিয়া বুঝুন

1.2. What are Tokens?

Tk 99

Already purchased? To view Sign In

Generative AI & LLMs Crash Course একটি ফাউন্ডেশন টু অ্যাডভান্সড লেভেলের কোর্স, যেখানে আপনি শিখবেন কীভাবে Generative AI এবং Large Language Models (LLMs) আসলে কাজ করে এবং এর ভিতরের প্রযুক্তি কীভাবে পুরো AI সিস্টেমকে চালায়।

What are Tokens?

In the context of Large Language Models (LLMs), such as ChatGPT or GPT-4, a token is a small unit of text used by the model to process and generate language. Tokens can represent a whole word, part of a word, punctuation, or even spaces—depending on the language and the tokenization method used.

Tool > https://platform.openai.com/tokenizer?utm_source=chatgpt.com

How LLMs Use Tokens

  • Tokenization: When a user inputs text, the LLM breaks this text into tokens before processing. This is known as tokenization. For instance, the sentence:

    I heard a dog bark loudly at a cat
    

    could be tokenized as: ["I", "heard", "a", "dog", "bark", "loudly", "at", "a", "cat"], with each word assigned a unique token ID. The text can then be represented as a sequence of numbers (e.g., [1, 2, 3, 4, 5, 6,3]

    Example 1

    Sentence:

    "I love AI."

    Tokenization (GPT-style, subword-based):

    ["I", " love", " AI", "."]
    

    ➡️ 4 tokens

    Example 2

    Sentence:

    "ChatGPT is awesome!"

    Tokens:

    ["Chat", "G", "PT", " is", " awesome", "!"]
    

    ➡️ 6 tokens

    (Notice how "ChatGPT" is split into three tokens.)

    Example 3

    Sentence:

    "Learning artificial intelligence is fun."

    Tokens:

    ["Learning", " artificial", " intelligence", " is", " fun", "."]
    

    ➡️ 6 tokens

Types of Tokens:

  • Word tokens: Each word is treated separately (“Hello”, “world”).

  • Subword tokens: Words are broken into meaningful parts (“unbreakable” → “un”, “break”, “able”).

  • Character tokens: Individual characters (used in some models).

  • Punctuation tokens: Marks like “!”, “,”.

  • Special tokens: Placeholders for beginnings, endings, or special features.

Example in Practice:

  • Sentence: "Hello, world!"

  • Tokens: ["Hello", ",", " world", "!"] (with GPT and similar models, spaces before punctuation can form a new token).

Another example: Wayne Gretzky’s “You miss 100% of the shots you don’t take” is split into 11 tokens. In English, a token is roughly four characters or three-quarters of a word, but the rule varies by language.

Why Tokens Matter for You

  1. Understand Limits

    • Every AI model (like ChatGPT) has a limit on how much text it can handle at once, measured in tokens (not just words).

    • Example: GPT-4 may handle ~128,000 tokens (~100,000 words). If you paste too much, it won’t fit.

  2. Cost Awareness

    • If you use AI tools that charge by tokens, your bill depends on the number of tokens (input + output).

    • Example: A short message = ~10 tokens; a long article = thousands of tokens.

  3. Better Prompts

    • Knowing tokens helps you write concise prompts. Long, repetitive instructions = more tokens (costly + slower).

    • Clear, short prompts = fewer tokens, faster response.

  4. Copy-Paste Planning

    • If you want to paste a whole PDF or long article into ChatGPT, token limits decide how much text can fit.

    • Sometimes you’ll need to split the text into chunks.

resently

Instructor

Pijush Saha

Pijush Saha is the Digital Marketing Consultant, Coach and Ex Google Employee. He has been working for 12 years in the digital marketing sector involving predominantly in Performance Marketing including SEO, Media Buying, & Web Analytics.