It’s Not All just “AI”

Apps like ChatGPT and Midjourney have brought “artificial intelligence” into the limelight…Or is it “machine learning”? And when we encounter other terms like “neural networks”, “deep learning”, “stable diffusion”, and “large language models”, it can all get a little perplexing.

We often have some idea of how these technologies can be used, but just as often we lack any kind of understanding of how everything fits together underneath the surface. In this and a follow-up article, we’re going to shoot from the hip for a quick, intuitive understanding of how these concepts and tools relate to one another.

AI vs Machine Learning

There is some confusion between “AI” and “ML”, with many people using the terms interchangeably. This confusion is understandable, especially due to the dominance of ML approaches. However, the terms have distinct meanings whereby one subsumes the other. 

Artificial intelligence is the most generic top-level term for the collection of approaches to simulating/automating intelligent behavior. As such, AI encompasses all traditional approaches for reaching said goal. Some popular and/or historical approaches are expert systems, search algorithms, fuzzy logic, and machine learning.

Machine learning, as a sub-discipline within AI, specifically involves fitting mathematical functions to input/output data to predict outputs from novel inputs. It’s a data-driven approach, made possible by the abundance of data. Said differently, machine learning involves the development of algorithms that can automatically learn patterns and relationships from large amounts of data.

Most of the widely popular applications of AI today are achieved through the application of the subset of techniques collected together under the heading of “machine learning” – for this reason, it is not surprising that AI and ML are used interchangeably outside academic contexts.  

Machine learning plays such a big role in AI these days because its techniques leverage several related advancements of recent decades. Importantly, their matrix math is easily programmed using computers and accelerable via GPUs and clustering. And training crucially depends on and benefits from the abundance of data made available in recent years. 

In essence, AI is the broader concept, while machine learning is a subset of AI that specifically focuses on training algorithms to learn from data and improve their performance over time.

Neural Networks

When it comes to the discipline of machine learning, there exist many different mathematical approaches to “learn from data” or “fit mathematical functions to data”. One approach is to construct multiple decision trees and combine their predictions. In contrast, the “neural network” approach leverages research and insights into the structure and organization of the human brain as well as mathematical techniques like linear algebra. 

How does this actually work?

No article of this kind should be without a densely connected neural network diagram:

Beautiful. Credit: ALEXANDER LENAIL

But…How do we read this? What is it actually saying? Is it saying anything? Does it translate into code? Where’s the matrix? 

Let’s use this diagram to help connect our own dots in this complicated conceptual space.

An Algorithm

At the very highest level, you can read it as describing a flow of computation from left to right. It’s an algorithm. 

One level below that, we can say that it describes a function that takes in an array of 4 numbers on the left (the 4 empty circles), expands those into 12 numbers, reduces those to 10 numbers and finally outputs an array of 2 numbers on the right. We can see these are labeled “input layer”, “hidden layer” and “output layer” in the diagram.

If we look even closer, the first element of the diagram we see on the left is the circle. Circles represent inputs that accept a value but are also “activation functions” that transform the value provided to them. Some common activation functions are ReLU, which keeps the value if it's greater than 0, or returns 0 if it's not. The sigmoid function turns any number into a value between 0 and 1. 

The next thing we see, starting on the left and moving right, are the lines that connect circles on the first layer and the circles on the second (labeled “hidden”) layer. These can be thought of as weighted connections between inputs that multiply the input value by the connection’s weight as the value “passes through it”. 

Lots of things could be varied about this diagram. As well as the more obvious choices of adjusting the weights of the connections, varying the number of inputs on any of the layers, and adding hidden layers, it should be noted that not every input on a layer needs to be connected to every other input on the nearby layers. Also, inputs need not be connected to inputs only one layer away – you can even connect the output layers to the input layers. 

While all that may make sense in a vacuum, how is any of it actually useful? How could this magnificent abstraction be in any way put into practice? 

This question is particularly relevant in consideration of the fact that all machine learning algorithms operate on numbers.

Tokens

Consider that a first step to, for instance, processing a paragraph of text in a ChatGPT-like LLM scenario is to convert the paragraph of text into “tokens”. You can see how ChatGPT does this using the OpenAI Tokenizer where you can look at the tokenized text and the Token IDs.

For instance, the sentence, Magritte made the statement “this is a question.” ?

is broken into the following fragments:

Which are converted into the following array of numbers or integer token IDs:

[34015, 99380, 1903, 279, 5224, 1054, 576, 374, 264, 3488, 2029, 30]

The idea behind these numbers is just that they are consistent so that the same fragments of sentences will result in the same sequence of numbers. This way, as the tokens are found again and again amongst the billions of training sentences, patterns can settle out of their recurring proximities to one another. These patterns are what the mathematical functions are fitting to.

These tokens are passed to the input layer (of a much more elaborate artificial neural network than what we have pictured above). In fact, tokens using this same encoding are expected to come out of the output layer of an LLM to be converted back into human-readable text. We’ll go into greater detail about how this works when we discuss LLMs specifically.

Context Windows

This already helps us to understand a common feature of tools like ChatGPT called the “context window” which determines how large our prompt can be. We can read that ChatGPT 4.0 Turbo has a context window of 128K so it can be given 128,000 tokens (or pieces) of text at once. This context window is the size of the input array, in tokens. So, using ChatGPT 4.0 Turbo, you can pass 128,000 tokens (about 100,000 words) to the input layer.  

In summary, the diagram above describes what happens to input values. The lines connecting nodes representing “weights” are multiplied by the values on the left. The values flow through the computation and are adjusted by weights and activation functions and are expanded, combined, and reduced by flowing through the layers. 

Deep Learning

Deep learning is simply an approach to machine learning that uses artificial neural networks with 2 or more hidden layers. The above diagram is technically a “deep neural network”. What’s been found is that by adding many hidden layers and playing around with those activation functions and the connections between layers, a surprising degree of order can be automatically extracted from input data. It’s more of a craft than a science at this point.

It should be acknowledged that nobody really understands why a dense interconnection of weights and activation functions organized so-and-so can obtain the behavior, when set up correctly, but differently here and there, of extracting and functionally representing complex and implicit relationships between inputs and outputs.

There’s the actual magic.

If you’ve come this far and find yourself wanting more, consider Understanding Deep Learning (free PDF).

Training & Models

Now that we’ve waded into neural networks and stuck our big toe into the deep learning current, let’s consider training and the output of training. 

First, it’s necessary to understand that training occurs at a higher level than the generic computational flow described above. That was simply a computation performed on an array of inputs and like any mathematical function, by putting the same array of 4 values in, you’ll always get the same array of 2 values out. 

The computational flow we described in the previous section is the ‘model’, and as we alluded to earlier, can be reduced to a series of matrix multiplications with the weights and other parameters included.

On the other hand, training a model occurs at a different level and involves using one or more strategies to adjust the weights between the inputs. One way to accomplish this is by running the model over and over on training data that consists of inputs paired with known outputs. And adjusting the weights to more closely align the actual output with the expected output. Each of these runs through the full training dataset is known as an “epoch”. A model is finalized by doing many epochs. This is called supervised learning. There’s also unsupervised learning, reinforcement learning, and more.

When it comes to supervised learning, you want to set aside some training data as validation data. This data isn’t used for training but helps test the model on unseen examples, allowing you to determine when training is sufficient. This is an important step to avoid overfitting your model to your training data. If a model is overfit to its training data it will likely not handle new data well.

It should be noted that the output of the “training” process is a modified version of the model itself. The adjustments were to the weights and perhaps to other aspects of the model, such as to the activation functions or other parameters. The accumulated refinements of the malleable variables of the original model are the whole tamale – it’s what’s produced by the ocean-boiling GPU clusters.

These models can’t be further flattened or simplified – they are already the flattened simplification of the complicated implicit structure we magically extracted from our training data. 

The sequential transformations of the values through the layers being multiplied by the weights while having the activation functions applied are what is being delivered as a whole. This is the model everyone keeps talking about (and downloading).

Fine Tuning & LoRAs

These facts about the model imply that if you can get a pre-trained model you can train it further, since to train it is simply to adjust some of the values contained within the model. This is called fine-tuning. There are many approaches to fine-tuning and the details will depend on the kind of model you are adjusting and what you are trying to achieve. 

A fine-tuned model will be mostly the same size as the original. As model architectures grow larger this can become a challenge, as they require more and more powerful hardware to store the entire model in memory at once, as is necessary for training. Training modern models often happens on huge clusters of computers that are not available to the average business or hobbyist.

Low-Rank Adaption, LoRA, is a method that addresses these issues by storing a smaller set of weights that can be expanded and merged into specific layers of the pretrained model, such as the cross-attention layers. While you can use a fine-tuned model the same way you use the original pre-trained model when it comes to LoRAs, you have to apply the adjusting weights after loading the model itself. LoRAs can be used with both pre-trained and fine-tuned models, although may target or work better with one or the other.

The cross-attention layers that are cleverly adjusted during LoRA training concerning stable diffusion are part of the transformer neural network architecture. 

We’ll get into the role cross-attention plays when we discuss transformers in a future article that digs into large language models (LLMs) and Stable Diffusion image generators. With this article, we’ve laid the foundations to gain an intuitive understanding of the elements at play in the more complicated applications making waves in society today.

Jerome Meyers

Lead Engineer on the Cloud Hosted Services Team at Magnopus

Previous
Previous

Trilinear Point Splatting (TRIPS) And Its Advantages Over Gaussian Splatting And ADOP

Next
Next

Meet the Magnopians: Blair McLaughlin