If you've been on the internet at least once over the last 4 years, you're probably familiar with ChatGPT. As of February 2026, ChatGPT boasts 900 million weekly active users, by far the most popular AI chatbot in the world. The model's scale goes beyond just the breath of its audience: the latest iteration of the LLM likely contains well over one trillion parameters. But how many of those 900 million users know what a parameter means, or how we even ended up with a parameter count that's on the same order as the number of cells in the human body?

Understanding the building blocks of ChatGPT: from the parameters (essentially adjustable weights) that make up the model to the way in which OpenAI, the company that made ChatGPT, has grown (and shrunk) over the last decade, will make you appreciate just how much goes into a single query of the model.


The Blocks of ChatGPT

All content written, coded, and arranged by Nikhil Chinchalkar
No Generative AI was used in the making of this project
Apr. 25 2026

GPT-1: 117m parameters

OpenAI, the company that produces what's now known as ChatGPT, made their first iteration of the model in June 2018, under the name GPT-1. It featured 117 million parameters. If this

one pixel represented 5,000 parameters, those 117 million parameters would take up this much space:

117 million parameters

Before actually going into the details of what these parameters do, we'll need to better define the model itself. G, P, and T aren't arbitrarily chosen letters. They stand for Generative Pretrained Transformer, a type of model architecture that underlies every chatbot. The first term in the name is fairly straightforward: the model is generative---it repeatedly generates sets of characters, called tokens. Pretrained just means the model is fed in some initial data before it gets deployed. For GPT-1 that data was BookCorpus, a 4.5 gigabyte dataset consisting of around 7,000 books, or about 985 million words, double the amount the average human will say in their lifetime.

To consider a physical representation of that dataset, let's imagine those books were stacked up against each other. Note that this is purely for illustration: the books were never printed. Instead, they were uploaded by "indie" authors to a website (Smashwords), and scraped by researchers without the consent of the authors. But if those books were indeed published, let's conservatively assume that each physical copy had a width of exactly 1 inch. If we placed each book back-to-back we'd then get a stack of books about 583.3 feet tall, a tower taller than the Washington Monument.

A comparison in size between the Washington Monument and the BookCorpus dataset. Image by Nikhil Chinchalkar

GPT-2: 1.5b parameters

The biggest difference between GPT-1 and its successor GPT-2 was the increase in model parameters and the size of the training dataset. Both scaled by a factor of 10.

1.5 billion parameters

The uncreative name change was a signal that much of the model architecture remained the same, that model architecture being the Transformer. Despite OpenAI being on the frontier of chatbot development in the late 2010s, it was researchers at their now rival company Google that developed the transformer model structure, in a 2017 paper titled "Attention Is All You Need." While the title is an apt reference to the power of attention (a concept we will explain later), it really isn't all you need. There's other components of the model that play key roles.

Fundamentally, a transformer consists of four main model layers: (1) a tokenizer, (2), an embedding layer, (3), a transformer layer, and a (4) un-embedding layer. While just four layers may seem unreasonably simple, within each of those main layers are several other, complex, building-block components that form the larger, main layer. Each of those building blocks are themselves represented by millions (or sometimes billions) of model parameters, which are essentially numerical weights that the model can adjust to obtain the best result. For a GPT model, that result is the best answer to a particular question from the user.

GPT-3: 175b parameters

After GPT-2 came GPT-3, with 175 billion parameters, illustrating just how complex these layers really are.

175 billion parameters

The first of these layers is the aforementioned tokenizer. All this layer does is convert a string of text to a sequence of integers. That is, it converts words, broken down into small components (of around 4 characters) called tokens, which are then associated with a 1-to-1 mapping of an integer. For GPT-1, a total of 40,478 unique tokens were included in the model vocabulary. As an example of word-to-token mapping, the token "cornell" is mapped to token 30,622, and the token "uni" is associated with token 1,822. Note that the values of tokens themselves are arbitrary---30,622 has no other significance other than being the token identifier for "cornell." With a 1-to-1 mapping, the tokens can be turned back into readable text, as the model's output.

You can actually see how later GPT models break down text into tokens with OpenAI's tokenizer, though they might be different than the ones described here, since the number of unique tokens has significantly increased from GPT-1 to GPT-4:

The number of unique tokens in each iteration of the GPT model.Image by Nikhil Chinchalkar

Tokens are also how OpenAI computes model costs---as of April 2026, the standard GPT-5.4 API charges $2.50 for every 1,000,000 tokens fed into the model, and $15.00 for every 1,000,000 of output that the model produces.

The sole purpose of the model ties back to tokens, too. Fundamentally, ChatGPT's goal is to predict the next token again and again until the output terminates, forming a complete response to a user query. All of the layers that follow this tokenization are developed with that goal in mind.

Once tokens are created, the model then creates an embedding vector for each one. This embedding vector, which is just a list of numbers, represents some information about the token in a higher dimensional space. Truthfully, the amount of information humans can glean from these embedding vectors is relatively little compared to what the vectors represent for the models, but consider the following example for some basic intuition. If we consider the embedding vectors for King, Queen, Man, and Woman plotted in a two-dimensional space, it might look something like this:

A sample embedding space for the vectors King, Queen, Man, and Woman.Image by Nikhil Chinchalkar, based on "Metaconcepts: Isolating Context in Word Embeddings."

The important thing to note here is how the numerical vectors represent semantic meaning about each word. For instance, the subtraction of Man from the King vector produces a vector that represents the quality of being royalty. If you added that same vector to the vector for Woman, it would produce the vector for Queen. While this example is only in 2 dimensions, the same principle applies to higher dimensional embedding spaces. For GPT-3 that meant 12,288 dimensions for each token embedding.

While the embedding vector might represent some numerical representation of semantic meaning, it does not include any information about the location of a token in an input. For that, transformers use positional embeddings, which is a vector with the same dimensionality as the embedding space, only instead of representing semantics, they represent positions of particular tokens. To combine the two vectors: embeddings and positions, they are added together. Note that a clean way of representing all of these embedding vectors is by placing them in a matrix, which is essentially a list of vectors. The formation of this embedding matrix completes the embedding layer of the model.

With the vector representations of tokens in the embedding matrix, the model then proceeds to the transformer layer, the most complicated part of the model architecture. To understand the need for this layer, consider the now classic example of the difficulty for machines to understand text. In the sentence "the animal didn't cross the street because it was too tired," what does the word "it" refer to? For humans, it obviously refers to the animal, since a street cannot ever be tired. But that comprehension is considerably more difficult for an LLM. Hence the need for an attention layer (within the transformer layer), which allows token embeddings to communicate with one another to understand context.

Within an attention layer, or head, there are three weight matrices: a query matrix, a key matrix, and a value matrix. Each of those matrices are used in tandem with the token embedding matrix to communicate across tokens. Specifically, each token embedding vector is multiplied by the query, key, and value matrices to produce query, key, and value vectors for that specific token. Though it's never this clean, you might think of each query vector representing some question that the token is asking about the surrounding text: "I am token ‘uni', in position 8, and I am looking for adjective tokens that come before me." Then a key vector that might encode "I am an adjective token at position 6" as an "answer" to that query. The value vectors are a bit separate---they might represent the actual content or meaning of a token. Each query vector (one for every token) gets multiplied by every key vector, resulting in a sort of "communication" matrix with high values representing lots of communication, or, attention.

A way of visualizing this attention head up to this point might look like this:

Two possible attention distributions between sets of keys (K) and queries (Q).Image by Nikhil Chinchalkar, based on "Transformer: A Novel Neural Network Architecture for Language Understanding."

Each diagram represents the attention from the word "it" to other words in the sentence (in this example, for interpretability, tokens are considered interchangeable with words). Each line represents the value of the normalized query-key dot product between each key in the sentence and the query for "it." The normalization in this case just scales down the matrix values and converts them into probabilities with a "softmax" function. Higher values are represented with darker shades, indicating that in the example on the left, the word "it" has a stronger connection to the word "animal," while on the right, the word "it" has a stronger connection to the word "street," correctly reflecting what "it" actually represents in both sentences.

Note that in the actual GPT model, the matrix is masked in a way that makes it impossible for queries from tokens earlier in the text to be "answered" by keys that occur later in the text. That mask usually just takes the form of zeroing out a section (a triangle) of the "communication" matrix.

The calculation does not end there though---the value vectors are yet to be used. This "communication" matrix is then multiplied by the value vectors, which represent token content, to produce the final attention matrix. As an intuition for this step, consider the above visualization. While the word "it" is associated (probabilistically) with "animal" or "street," the model needs to quantify that association, which it does by multiplying the weight of the connection (the numbers in the "communication" matrix) by the content of the associated word represented by the value vector.

Hence, values in this final matrix represent changes to be made to the original token embeddings, based on information the model learns from the surrounding context. Those changes are enacted in the next stage of the model, by simply adding the final matrix values to the original token embedding. For the animal and street example, that might mean adding a value to the embedding for "it" to now make it more closely resemble something that refers to an "animal" or a "street" in a higher dimensional space.

This whole process of attention, a pivotal concept within the transformer model, can be summarized in the following formula:

The exact attention formula found in "Attention Is All You Need."

While the authors of the "Attention Is All You Need" paper designed the Transformer as a way to translate text between different languages, they had enough foresight to see the potential of the model in "question answering"---essentially what ChatGPT is today.

Having the context of what AI has become, this scaled dot-product attention formula is undoubtedly the most important math equation in the last decade.


To close the book on model architecture, in GPT models, there are actually several attention heads used simultaneously (called multi-head attention). Each attention head might learn different features about the text---one might understand information about nouns, another one adjectives, and another one verbs. In reality, though, the information each attention head represents is never as clear-cut. Those attention heads have their attention outputs all merged together and transformed with a linear layer to a smaller dimensionality. You can think of a linear layer as a matrix of weight values that multiplies the inputted matrix. All together, it looks like this:

A diagram from "Attention Is All You Need" that illustrates the composition of attention and multi-head attention.

The output of this multi-head attention block is added into the original encodings (and normalized), as a way to update them with additional context. Following that addition, the data is then transformed via another set of steps, known as a feed-forward layer, which consists of a linear layer that increases the dimensionality of the data (specifically to 4 times the embedding size), some activation function that transforms the linear layer's output in a non-linear way, and yet another linear layer that brings the dimensionality of the data back down, before it is again added back to the inputs of the feed-forward layer, and normalized. It is these feed-forward layers that make up the majority of the parameters within a transformer model, which now looks like this:

The transformer block, from "Attention Is All You Need," and its corresponding Feed Forward Network.

The combination of all of that: the multi-head attention, addition and normalization, feed-forward, and additional and normalization again makes up just a single component---a transformer block within the model (also called a decoder layer).

Each transformer block feeds into another, before being converted back to a text, via the un-encoding layer, which applies a linear transformation to the output of the final transformer block and takes the softmax of the result to produce a probability distribution over all tokens in the model's vocabulary. For GPT-1, this entire process could be summarized in the following diagram:

The full GPT-1 model architecture. Marxav, Wikipedia

Note that dropout is a feature of neural networks that involves removing a randomized set of parameters at various stages in training the model. The overarching goal of performing dropout is to prevent overfitting, which means the model predicts tokens that are too closely aligned to its training set (as opposed to having it be better fitted to new prompts). Dropout isn't unique to the Transformer, and while it's important to implement if you wanted to recreate ChatGPT from scratch, it's a minor detail for understanding the overall architecture of GPT.

Using the final token distribution, the model selects a token at random (weighted by the probability of its selection) as its output. For training, the model would compare the probability distribution of its predicted output to the actual output itself, then change all of its parameters to have a better chance of predicting the actual output (via a method called backpropagation).

Sample training for the sentence "That's one small step for man, one giant leap for mankind," if the tokens were words. Image by Nikhil Chinchalkar

This process is repeated billions of times over the training data, where the model essentially tries to predict the value of the first token given zero tokens, then adjusts, then the value of the second token given the first token, then adjusts, then the value of the third token given the first two tokens, and so on, until the model tries to predict the value of the final token, given the maximum number of previous ones. Performing training in these steps greatly increases the amount of predictions the model is able to learn from, making the LLM better.

Upon deployment, the model is essentially doing the same thing, except it predicts based on the maximum context possible, not limiting itself as it did when training. Here, context is defined as a window, and is measured via tokenization. It represents the amount of context a GPT model is able to use for calculations, or equivalently, the maximum length of input the model can reasonably ingest. GPT-1 started with a context size of a now measly 512 tokens, while the latest GPT-5 can take in over 1 million tokens:

The context window for each iteration of the GPT model.Image by Nikhil Chinchalkar

After predicting a single token, the model simply shifts the context window one token over, and repeats the prediction process, now trying to predict the next token. In the setting of a chatbot, that context window starts with a system prompt that gives the AI an appropriate setting from which to proceed. The system prompt is also where the model can be given a "personality" or told to avoid mentioning certain topics or words. For instance, if the user asked "What is the answer to the ultimate question of life, the universe, and everything?" that would be brought to the model prepended with an additional prompt that explains the situation:

System prompt: You are ChatGPT, a Large Language Model built by OpenAI that answers user questions in a helpful and concise way.

User prompt: What is the answer to the ultimate question of life, the universe, and everything?

From there, the model would try its best to act like a "helpful and concise" AI, and predict what that AI would respond with as its next token, possibly resulting in "42."


To even get to this point where coherent and accurate results are emitted from the model, it needed to have its 175 billion parameters trained. For the case of GPT-3, that meant feeding 570 GB of text input into the transformer, from CommonCrawl, WebText, English Wikipedia, and two books corpora (Books1 and Books2). Such a substantial increase in size was not without controversy, as much of the dataset was scraped from webpages that did not consent to have their data used by OpenAI.

The story of Books1 and Books2 is particularly interesting. Despite OpenAI creating these datasets for training (as opposed to downloading existing sources that were pre-existing), the company left out details about the creation of the data in the paper announcing GPT-3. It was only until a class-action lawsuit that OpenAI was forced to reveal the sources of Book1 and Books2, which were pirated books from LibGen. Noteworthy was that OpenAI ended up deleting both datasets, citing non-use, in mid-2022. The lawsuit indicates foul play regarding the sudden deletion of material, months before ChatGPT would be unveiled to the public.

Specifically, plaintiffs argue that the presence of Slack channels named "excise-libgen" and "project-clear" might make OpenAI's apparent "non-use" actually be a cover for their need to avoid legal troubles when presenting the model. The lawsuit is still ongoing, now meddled in the details of whether or not OpenAI has the right to have the communications regarding the deletion of the Books datasets kept private.

Following the release of GPT-3, OpenAI continued to finetune the model, a process that involves Supervised Fine-Tuning and Reinforcement learning with human feedback (RLHF). Supervised fine-tuning of GPT-3 meant writing prompt-answer pairs for the model to learn from. The important thing to note about this step was that this fine-tuning was supervised, meaning the generation of labels, or answers, was done by humans. Specifically, for several of the GPT models, this generation was done by workers in the Global South, who were worked for lower wages and longer hours. An artifact of this training was that certain words that were used more commonly in certain areas, like Nigeria, bled into the model's preferences. One of the more infamous of those words is "delve," though there exist several others.

RLHF follows a similar structure, only in this case, the model is the one generating outputs. That is, the model generates two outputs to the same prompt, and a human chooses their preferred response. You might recognize this as something ChatGPT will do somewhat randomly to your own prompts. Upon gaining information about which response "wins out" the model parameters are slightly adjusted, accordingly.

After those adjustments to GPT-3, OpenAI released GPT-3.5---ChatGPT---to the public, on November 30, 2022.

Twitter/X

Five days after that tweet, ChatGPT would have 1 million users.

One month later, it would surpass 100 million monthly users, making it the fastest growing internet application ever. After another three weeks, a prompt directly from the model would grace the cover of Time magazine.

The February 27/March 6, 2023 issue of Time Magazine.

Not much more needs to be stated about the success of ChatGPT---if you've read this far, you probably understand it's the most pivotal invention of the 21st century.

GPT-4: 1.8t parameters

As Generative AI began to take over the tech world, and other companies began publishing their own models, OpenAI made the decision that their release of GPT-4 in March 2023 would now hide the model specs that were shared about previous versions. That meant the model architecture, training data, or parameter counts were now no longer publicly available. In their own words:

"Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar."

Information leaks and estimates place the model at around 1.8 trillion parameters, 10 times the scale of GPT-3, which featured 175 billion.

1.8 trillion parameters

An increase in scale also contributes to longer training times, with estimates that it took just under 100 days of continuous model training for GPT-4 to be finalized. The residuals of such a long and arduous training process reveal themselves in the water consumption of data centers that supply the computation power for training. Generally speaking, the water that is used in data centers is for the purpose of cooling down hardware components that perform calculations that the model is based on. That water needs to be fresh, meaning it could otherwise be consumed by a human. While some data centers supply their cooling using closed-loop systems that prevent excess water consumption, most do not, enabling large data centers to consume water on the same level as a town the size of Ithaca, NY.

Truthfully, the amount of water that a chatbot uses depends entirely on the question being asked: whether or not you count the training process, or other uses of the data center that contribute significantly to the overall water consumption. Depending on the agenda of the person you ask, they can spin AI water usage to paint whatever picture they want.

Regardless of how the calculations are done, it's clear that data centers, which are being constructed more frequently as a result of an increase in AI demand, do indeed affect the people that live near them. Notably, a datacenter in Memphis, TN, built by xAI, a competitor of OpenAI, has severely deteriorated the lives of those who live in the surrounding area, on account of immense water usage and toxic air pollutants.

As the model grew, parts of OpenAI shrank. Specifically, their AI safety team, which manages the risk associated with model development, lost half their staff in mid-2024, including leaders Ilya Sutskever and Jan Leike. At its core, the team's departure was due to a distrust of Sam Altman, the CEO of the company.

Just a few months earlier, at the request of OpenAI's board, Sutskever had compiled a 70-page document of Slack messages and H.R. documents detailing the extent of Altman's lies, especially in regards to internal safety protocols. Citing this lack of candidacy in his communications, Altman was fired in November 2023.

Cutting off Altman didn't go quite as well as the board and Sutskever planned, however, as just two days later he accepted a job at Microsoft, and 95% of then OpenAI employees threatened to do the same if Altman didn't come back.

Forcing their hand, Altman was re-instated as CEO a few days after he was fired, allowing him to gain further control over the company. That control manifested itself in OpenAI catalyzing an arms race with other companies to develop the most powerful AI, with the safety of a purported "super-intelligent" system de-prioritized.

Despite the turbulence within the company, OpenAI would release their latest model, GPT-5, in August 2025.

GPT-5: ??? parameters

GPT-5, like its predecessor GPT-4, did not release official parameter counts, and online estimates vary from around 1 trillion to nearly 80 trillion parameters.

??? parameters

As of this article, GPT-5 is the set of models that currently powers the most recent versions of ChatGPT, with millions (soon to be billions) of monthly users. Those billions of users of ChatGPT are largely unaware of the steps it took to get here: the transformer model that was initially developed by researchers at Google, the terabytes of copyrighted data that each LLM is trained from, the effects of pollutants from data centers to their nearby areas, and the labor of those in Nigeria, helping to train the first iterations of the model for less than $2 an hour.

What was once a model architecture meant to translate text between different languages has now taken over the world, and the company that pioneered it has strayed slightly from their goal of "acting in the best interests of humanity throughout [their] development."

As AI continues to improve, keep in mind that every token that ChatGPT generates not only passes through the architecture of the model, but also the humans that helped shape it, for better or for worse.


The Blocks of ChatGPT

All content written, coded, and arranged by Nikhil Chinchalkar
Code, sources, and notes on inspiration for this project can be found in my GitHub repository.
Apr. 25 2026