Basics

To understand what is a prompt and how it influences the response of a large language model one has to go back to basics of machine learning. Because, an LLM is nothing more than a very big machine learning model with an input and output process attached. The input process transforms words and their position in a sentence to a mathematical construct, and the output process takes a mathematical construct to transform it back to words. All these are transparent to the end-user.

Basics of Talking

We all had this moment that we have been searching in our mind what is the next word

The Earth is ... [?]

The model (we call them auto-regressive) does just one thing, predicts the next word (actually token) that fits a sequence of words. Take this very simple example The earth is. There are many possible words that can come after is, [old, round, flat, blue, big, ...]. In order to select the appropriate next word we would need a bit more context. So if the sentence was Many believe the earth is then flat may be more probable to round. Whereas in a sentence It is a known fact that the earth is the word round would be more probable. Considering the dictionary [old, round, flat, blue, big, ...] we can every time assign probabilities for each word. Large language models do this for all words and combinations of words and tokens that are in their vocabulary, efficiently by having learnt the underlying mathematical function by reading a lot of text.

Large language models

Large language models are based on the transformer architecture which is a very big neural network composed of a repetitive structure of layers that is made deep and wide with added multi head attention. Apart from the fancy naming it is just a sequence of multiplications of multi-dimensional tensors for those who know. Non-linearities do exist, such as normalization, and sigmoid functions, but it does not alter the fact of the matter, it is just a sequence of what we know as transformations of the input.

All machine learning models and algorithms need to have input transformed in some magical way into vectors ( i.e. just mathematical jargon for a list of numbers and some operations such as addition defined for those numbers ). Vectors are usually mistaken for arrows, just because arrows can be represented as vectors. There are many way to transform anything into a sequence of numbers. For example a bucket of a certain type can just be the diameter of the top ring, the below part and the height. That being said one can imagine encoding other characteristics of the bucket, such as color, material, and coating as numbers. As long as the same position in the list is used to represent the same characteristic everything is going to work out nicely.

Word transformation

There is need of a way to encode a sentence differently when it occurs at certain position to another. The magic turning words and sentences into vectors is called a tokenizer. Coupled with positional encoding, the necessary information to discriminate between tokens in different positions of a sentence is captured. A tokenizer breaks words apart to smaller pieces. We humans learn tokens as childs called phonemes. Computers having much larger memory capability may store a larger set of tokens. They are not entire words, nor it is necessary to be a single word, but it can be as well. Tokenizers are a bit of a secret and many non-commercial LLMs do not publish their tokenizer. Because, getting a great and efficient tokenizer is just half of the work needed to have a successful LLM in production.

Once the input is transformed into a vector, then it is just standard machine learning, but it is not going to be further detailed here. However, it is important to keep in mind that position does matter for instructions and where things appear in a prompt. Trully, it can be a make or break should one decide to put personas anywhere in complex prompts.

Word Production

As we said a prompt, i.e. the input to the LLM, is turned into a vector. This vector is the starting point of the LLM to produce output. We will focus mostly on GPT style models for now, as they seem to be winning the race of AI.

In its most basic form a GPT model will produce an output vector with the probability of each token being the next token given the input vector. Everytime this token will be added the input and then another token will be produced. So how do we select the next token based on probabilities? There are number of sampling techniques such as top-p nucleus sampling and the most common parameters seen in production LLMs are parameters for sampling the final part of production alone.

The most influential parameters for most of the tasks are top-p and the temperature. Modifying ( top-p) will control how much creativity is permitted, whereas randomness in the output is controlled via the (temperature) parameter. A straightforward observation is that lower values produce safer output, which is actually not entirely correct, but good enough as a generic rule of thumb.

These parameters reflect the two steps approach to select the next word in sequence. First a probability distribution for all tokens in the vocabulary is computed, then the probabilities are exagerated with respect to the the temperature. Finally, top-p is used to select the top candidates. Then the next word is chosen by choosing randomly from these.

Temperature

The AI model generates a probability distribution over the vocabulary for each token in the sequence. The temperature parameter is used to adjust this distribution before sampling the next token. Here's how it works:

Higher temperature (e.g., 0.8): The probability distribution becomes "flatter," meaning that the differences between the probabilities of various tokens are reduced. This leads to more randomness in the selection, and the model is more likely to choose less probable tokens. As a result, the generated text tends to be more creative, and potentially coherent. Lower temperature (e.g., 0.2): The probability distribution becomes "sharper," meaning that the differences between the probabilities of various tokens are increased. This makes the model more likely to choose the most probable tokens. The generated text is more focused, conservative, and coherent, but it may lack creativity.

In summary, the temperature parameter adjusts the balance between exploration (creativity and diversity) and exploitation (consistency and coherence) in the text generation mechanism, by modifying the probability distribution used for sampling the next token in the sequence.

Top-p

A more detailed explanation of the top-p sampling (nucleus sampling) technique is detailed here. The probabilities for all tokens in the large language model vocabulary are computed. The the tokens are sorted in descending order based on their probabilities, so the token with the highest probability comes first, followed by the second most probable token, and so on. Then the cumulative probability is computed by picking one token at a time starting with the token with the highest probability is the sort. When we pass the value of top-p the tokens included form the nucleus from which we finally sample. For example, if p = 0.9, the nucleus will include tokens whose combined probability accounts for 90% probability. Then the next token in the generated sequence is randomly sampled from those include in the nucleus, with the probability of each token in the nucleus still being proportional to its original probability value. The top-p sampling ensures that only tokens within the nucleus have a chance of being selected.