Kartik Talamadupula, Author at Symbl.ai

Can Conversational Feature Transfer in LLMs Help Detect Deception?

Kartik Talamadupula — Wed, 17 Jul 2024 06:01:44 +0000

Introduction

Large Language Models (LLMs) have revolutionized natural language processing, demonstrating impressive capabilities in sentiment analysis and emotion detection. However, the way most LLMs learn and interpret language differs significantly from human language acquisition. This discrepancy raises an important question: do LLMs trained with multimodal features and across different forms of data, such as conversation and text, effectively utilize those features when processing data from a single modality? To answer this, we set up experiments to compare general LLMs against LLMs that are specialized on multimodal data. Specifically, we compare the Llama-2-70B general model against a version of that model which is fine tuned on human conversation data (Llama-2-70B-conversation).

Human communication and conversation is inherently multimodal, involving both verbal and non-verbal cues. We learn to interpret conversational communication first, including intonation and modulation, and then transfer those skills to written communication. Additionally, the conversation modality encodes distinct differences from other text data, such as turn-taking, context dependency, dialog and speech acts, real-time interaction etc. The question of whether the skills needed to excel at different modalities (text vs. conversation) are transferable across those modalities is what we aim to explore in our new research paper, currently under submission at the Association for Computational Linguistics’ rolling review cycle.

To test our hypothesis, we pick one of the most challenging use cases in conversation understanding and NLP in general: deceptive communication. This includes sarcasm, irony, and condescension, and serves as an illustrating test case for multimodal feature transfer. These forms of covert deception are challenging to detect in text representations of media, as they often rely on multi-turn modulation and prosody changes that are absent in just plain text data.

At Symbl.ai, our animating purpose has been to investigate the nuances and complexities that make human conversation distinct from the mere processing of text data on the web. This line of work extends those investigations by examining whether there are inherent features in conversational data that can be utilized by LLMs to better detect and understand one of the most complex human conversational behaviors – deceptive communication.

Motivation

The motivation for this research stems from the observation that LLMs, until recently, primarily learned language through vast amounts of text-only data on the web. While this approach has yielded impressive results, it fails to capture the inherently multimodal nature of human communication.

The ability to detect deceptive communication is a complex task for both humans and machines, especially in the text-only modality. We focus on this specific aspect of communication to evaluate the multimodal transfer of skills in LLMs. By comparing the performance of multimodal models (conversation+text) with unimodal models, we aim to gain insights into how LLMs interpret and utilize multimodal features.

Results

Our experiments involved comparing the performance of two types of models: text-only models, and text models trained with a special emphasis on human-to-human conversations. These models are exemplified respectively in our current work by the Llama-2-70B model – a very popular openly available LLM; and a fine tuned version of that LLM which specializes in conversational data. We also varied the prompting approach, using both basic prompts and prompts designed to emphasize the model’s conversational features.

Table 1: Average percentage difference between Llama-2-70B-chat and Llama-2-70B-conversation

Table 2: Average percentage difference between the basic prompt and the conversational-features-emphasized prompt.

The results, presented in Tables 1 and 2, offer valuable insights into multimodal feature transfer in LLMs. Table 1 highlights the advantage of using conversation+text models over unimodal text models for identifying deceptive communication such as snark, irony, and condescension. The Llama-2-70B-conversation model achieves higher accuracy and precision in identifying such deceptive communication, with impressive improvements in accuracy and F1-score. This supports our central hypothesis that adding the additional features that come from the conversation modality improves the performance of the language model on challenging use cases and data.

Table 2 reveals the impact of changes in prompting techniques. Emphasizing conversational features in prompts yields mixed results, with a slight improvement in accuracy and precision but a decline in recall. This suggests that while the model may better identify deceptive communication correctly when it is guided to pay special attention to features from the conversation modality via the input prompt, this sharpened focus may also cause it to miss more instances of such communication from an overall set.

Conclusion

Our findings suggest that the phenomenon of multimodal feature transfer occurs in LLMs, as conversation-tuned models outperform unimodal models in deceptive communication detection – a traditionally challenging use case for language models. Additionally, prompts emphasizing speech and conversation features can enhance performance in certain cases.

These results have important implications for future research and applications, indicating that models are capable of transferring what they learned on multimodal data to single-modality data, improving LLM performance on specific tasks that may require multimodal training. We are currently further investigating the effect of other modalities associated with human conversation data on the feature transfer phenomenon in LLMs, and on the overall accuracy of tasks that are challenging to today’s large language models.

The post Can Conversational Feature Transfer in LLMs Help Detect Deception? appeared first on Symbl.ai.

A Guide to Building an LLM from Scratch

Kartik Talamadupula — Fri, 31 May 2024 19:21:43 +0000

Up until recently, building a large language model (LLM) from scratch was a difficult and involved process – only reserved for larger organizations able to afford the considerable computational resources and highly skilled engineers that are required.

Today, with an ever-growing collection of knowledge and resources, developing a custom LLM is increasingly feasible. Organizations of all sizes can harness the power of a bespoke language model to develop highly-specialized generative AI applications that will boost their productivity, enhance their efficiency and sharpen their competitive edge.

In this guide, we detail how to build your own LLM from the ground up – from architecture definition and data curation to effective training and evaluation techniques.

Determine the Use Case For Your LLM

The first – and arguably most important – step in building an LLM from scratch is defining what it will be used for: what its purpose will be.

This is crucial for several reasons, with the first being how it influences the size of the model. In general, the more complicated the use case, the more capable the required model – and the larger it needs to be, i.e., the more parameters it must have.

Subsequently, the more the number of parameters, the more training data you will need. The LLM’s intended use case also determines the type of training data you will need to curate. Once you have a better idea of how big your LLM needs to be, you will have more insight into the amount of computational resources, i.e., memory, storage space, etc., required.

In an ideal scenario, clearly defining your intended use case will determine why you need to build your own LLM from scratch – as opposed to fine-tuning an existing base model.

Key reasons for creating your own LLM can include:

Domain-Specificity: training your LLM with industry-specific data that aligns with your organization’s distinct operations and workflow.
Greater Data Security: incorporating sensitive or proprietary information without fear of how it will be stored and used by an open-source or proprietary model.
Ownership and Control: retaining control over confidential data, you can improve your own LLM over time – as your knowledge grows and your needs evolve.

Create Your Model Architecture

Having defined the use case for your LLM, the next stage is defining the architecture of its neural network. This is the heart, or engine, of your model and will determine its capabilities and how well it performs at its intended task.

The transformer architecture is the best choice for building LLMs because of its ability to capture underlying patterns and relationships from data, handle long-range dependencies in text, and process input of variable lengths. Additionally, its self-attention mechanism allows it to process different parts of input in parallel, allowing it to utilize hardware, i.e., graphics processing units (GPUs), more efficiently than architectures that preceded it, e.g., recurrent neural networks (RNNs) and long short-term memory (LSTMs). Consequently, the transformer has emerged as the current state-of-the-art neural network architecture and has been incorporated into leading LLMs since its introduction in 2017.

Previously, an organization would have had to develop the components of a transformer on its own, which requires both considerable time and specialized knowledge. Fortunately, today, there are frameworks specifically designed for neural network development that provide these components out of the box – with Pytorch and Tensorflow being two of the most prominent.

PyTorch is a deep learning framework developed by Meta and is renowned for its simplicity and flexibility, which makes it ideal for prototyping. TensorFlow, created by Google, is a more comprehensive framework with an expansive ecosystem of libraries and tools that enable the production of scalable, production-ready machine learning models.

Creating The Transformer’s Components

Embedding Layer

This is where input enters the model and is converted into a series of vector representations that can be more efficiently understood and processed.

This occurs over several steps:

A tokenizer breaks down the input into tokens. In some cases, each token is a word but the current favored approach is to divide input into sub-word tokens of approximately four characters or ¾ words.
Each token is assigned an integer ID and saved in a dictionary to dynamically build a vocabulary.
Each integer is converted into a multi-dimensional vector, called an embedding, with each characteristic or feature of the token represented by one of the vector’s dimensions.

A transformer has two embedding layers: one within the encoder for creating input embeddings and the other inside the decoder for creating output embeddings.

Positional Encoder

Instead of utilizing recurrence or maintaining an internal state to track the position tokens within a sequence, the transformer generates positional encodings and adds them to each embedding. This is a key strength of the transformer architecture as it can process tokens in parallel instead of sequentially, and keep better track of long-range dependencies.

Like embeddings, a transformer creates positional encoding for both input and output tokens in the encoder and decoder, respectively.

Self-Attention Mechanism

This is the most crucial component of the transformer – and what distinguishes it from other network architectures – as it is responsible for comparing each embedding against others to determine their similarity and semantic relevance. The self-attention layer generates a weighted representation of the input that captures the underlying relationships between tokens, which is used to calculate the most probable output.

At each self-attention layer, the input is projected across several smaller dimensional spaces known as heads – and is hence referred to as multi-head attention. Each head independently focuses on a different aspect of the input sequence in parallel, enabling the LLM to develop a richer understanding of the data in less time. The original self-attention mechanism contains eight heads, but you may decide on a different number, based on your objectives. However, the more the attention heads, the greater the required computational resources, which will constrain the choice to the available hardware.

Multiple attention heads enhance a model’s performance as well as its reliability: if one of the heads fails to capture important information from the input, the other heads can compensate, resulting in a more robust training process.

Both the encoder and decoder contain self-attention components: the encoder has one multi-head attention layer while the decoder has two.

Feed-Forward Network

This layer captures the higher-level features, i.e., more complex and detailed characteristics, of the input sequence, so the transformer can recognise the data’s more intricate underlying relationships. It is comprised of three sub-layers:

First Linear Layer: this takes the input and projects it onto a higher-dimensional space (e.g., 512 to 2048 in the original transformer) to store more detailed representations.
Non-Linear Activation Function: this introduces non-linearity into the model, which helps in learning more realistic and nuanced relationships. A commonly used activation function is the Rectified Linear Unit (ReLU).
Second Linear Layer: transforms the higher-dimensional representation back to the original dimensionality, compressing the additional information from the higher-dimensional space back to a lower-dimensional space while retaining the most relevant aspects.

Normalization Layers

This layer ensures the input embeddings fall within a reasonable range and helps mitigate vanishing or exploding gradients, stabilizing the language model and allowing for a smoother training process.

In particular, the transformer architecture utilizes layer normalization, which normalizes the output for each token at every layer – as opposed to batch normalization, for example, which normalizes across each portion of data used during a time step. Layer normalization is ideal for transformers because it maintains the relationships between the aspects of each token; and does not interfere with the self-attention mechanism.

Residual Connections

Also called skip connections, they feed the output of one layer directly into the input of another, so data flows through the transformer more efficiently. By preventing information loss, they enable faster and more effective training.

During forward propagation, i,e., as training data is fed into the model, residual connections provide an additional pathway that ensures that the original data is preserved and can bypass transformations at that layer. Conversely, during backward propagation, i,e., when the model adjusts its parameters according to its loss function, residual connections help gradients flow more easily through the network, helping to mitigate vanishing gradients, where gradients become increasingly smaller as they pass through more layers.

Assembling the Encoder and Decoder

Once you have created the transformer’s individual components, you can assemble them to create an encoder and decoder.

Encoder

The role of the encoder is to take the input sequence and convert it into a weighted embedding that the decoder can use to generate output.

The encoder is constructed as follows:

Embedding layer
Positional encoder
- Residual connection that feeds into normalization layer
Self-attention mechanism
Normalization layer
- Residual connection that feeds into normalization layer
Feed-Forward network
Normalization layer

Decoder

The decoder takes the weighted embedding produced by the encoder and uses it to generate output, i.e., the tokens with the highest probability based on the input sequence.

The decoder has a similar architecture to the encoder, with a couple of key differences:

It has two self-attention layers, while the encoder has one.
It employs two types of self-attention
- Masked Multi-Head Attention: uses a causal masking mechanism to prevent comparisons against future tokens.
- Encoder-Decoder Multi-Head Attention: each output token calculates attention scores against all input tokens, better establishing the relationship between the input and output for greater accuracy. This cross-attention mechanism also employs casual masking to avoid influence from future output tokens.

This results in the following decoder structure:

Embedding layer
Positional encoder
- Residual connection that feeds into normalization layer
Masked self-attention mechanism
Normalization layer
- Residual connection that feeds into normalization layer
Encoder-Decoder self-attention mechanism
Normalization layer
- Residual connection that feeds into normalization layer
Feed-Forward network
Normalization layer

Combine the Encoder and Decoder to Complete the Transformer

Having defined the components and assembled the encoder and decoder, you can combine them to produce a complete transformer.

However, transformers do not contain a single encoder and decoder – but rather a stack of each in equal sizes, e.g., six in the original transformer. Stacking encoders and decoders in this manner increases the transformer’s capabilities, as each layer captures the different characteristics and underlying patterns from the input to enhance the LLM’s performance.

Data Curation

Once you have built your LLM, the next step is compiling and curating the data that will be used to train it.

This is an especially vital part of the process of building an LLM from scratch because the quality of data determines the quality of the model. While other aspects, such as the model architecture, training time, and training techniques can be adjusted to improve performance, bad data cannot be overcome.

Consequences of low-quality training data include:

Inaccuracy: a model trained on incorrect data will produce inaccurate answers
Bias: any inherent bias in the data will be learned by the model
Unpredictability: the model may produce incoherent or nonsensical answers with it being difficult to determine why
Poor resource utilization: ultimately, poor quality prolongs the training process, and incurs higher computational, personnel, and energy costs.

As well as requiring high-quality data, for your model to properly learn linguistic and semantic relationships to carry out natural language processing tasks, you also need vast amounts of data. As stated earlier, a general rule of thumb is that the more performant and capable you want your LLM to be, the more parameters it requires – and the more data you must curate.

To illustrate this, here are a few existing LLMs and the amount of data, in tokens, used to train them:

Model	# of parameters	# of tokens
GPT-3	175 billion	0.5 trillion
Llama 2	70 billion	2 trillion
Falcon 180B	180 billion	3.5 Trillion

For better context, 100,000 tokens equate to roughly 75,000 words – or an entire novel. So GPT-3, for instance, was trained on the equivalent of 5 million novels’ worth of data.

Characteristics of a High-Quality Dataset

Let us look at the main characteristics to consider when curating training data for your LLM.

Filtered for inaccuracies
Minimal biases and harmful speech
Cleaned – that the data has been filtered for:
- Misspellings
- cross-domain homographs
- Spelling variations
- Contractions
- Punctuation
- Boilerplate text
- Markup, e.g., HTML
- Non-textual components, e.g., emojis
Deduplication: removing repeated information, as it could increase bias in the model
Privacy redaction: removing confidential or sensitive data
Diverse: containing data from a wide range of formats and subjects, e.g., academic writing, prose, website text, coding samples, mathematics, etc.

Another crucial component of creating an effective training dataset is retaining a portion of your curated data for evaluating the model. If you use the same data with which you trained your LLM to evaluate it, you run the risk of overfitting the model – where it becomes familiar with a particular set of data and fails to generalize to new data.

Where Can You Source Data For Training an LLM?

There are several places to source training data for your language model. Depending on the amount of data you need, it is likely that you will draw from each of the sources outlined below.

Existing Public Datasets: data that has been previously used to train LLM made available for public use. Prominent examples include:
- The Common Crawl: a dataset containing terabytes of raw web data extracted from billions of pages. It also has widely-used variations or subsets, including RefinedWeb and C4 (Colossal Cleaned Crawled Corpus).
- The Pile: a popular text corpus that contains data from 22 data sources across 5 categories:
  - Academic Writing: e.g., arXiv
  - Online or Scraped Resources: e.g., Wikipedia
  - Prose: e.g., Project Gutenberg
  - Dialog: e.g., YouTube subtitles
  - Miscellaneous: e.g., GitHub
- StarCoder: close to 800GB of coding samples in a variety of programming languages.
- Hugging Face: an online resource hub and community that features over 100,000 public datasets.
Private Datasets: a personally curated dataset that you create in-house or purchase from an organization that specializes in dataset curation.
Directly From the Internet: naturally, scraping data directly from websites en-masse is an option – but this is ill-advised because it won’t be cleaned, is likely to contain inaccuracies and biases, and could feature confidential data. Additionally, there are likely to be data ownership issues with such an approach.

Training Your Custom LLM

The training process for LLMs requires vast amounts of textual data being passed through its neural network to initialize its parameters, i.e., weights and biases. This is composed of two steps: forward and backward propagation.

During forward propagation, training data is fed into the LLM, which learns the language patterns and semantics required to predict output accurately during inference. The output of each layer of the neural network serves as the input to another layer, until the final output layer, which generates a predicted output based on the input sequence and its learned parameters.

Meanwhile, backward propagation updates the LLM’s parameters based on its prediction errors. The model’s gradients, i.e., the extent to which parameters should be adjusted to increase accuracy, are propagated backwards through the network. The parameters of each layer are then adjusted in a way that minimizes the loss function: this is the algorithm that calculates the difference between the target output and actual output, providing a quantitative measure of performance.

This process iterates over multiple batches of training data, and several epochs, i.e., a complete pass-through of a dataset, until the model’s parameters converge to output that maximizes accuracy.

How Long Does It Take to Train an LLM From Scratch?

The training process for every model will be different – so there is no set amount of time taken to train an LLM. The amount of training time will depend on a few key factors:

The complexity of the desired use case
The amount, complexity, and quality of available training data
Available computational resources

Training an LLM for a relatively simple task on a small dataset may only take a few hours, while training for more complex tasks with a large dataset could take months.

Additionally, two challenges you will need to mitigate while training your LLM are underfitting and overfitting. Underfitting can occur when your model is not trained for long enough, and the LLM has not had sufficient time to capture the relationships in the training data. Conversely, training an LLM for too long can result in overfitting – where it learns the patterns in the training data too well, and doesn’t generalize to new data. In light of this, the best time to stop training the LLM is when it consistently produces the expected outcome – and makes accurate predictions on previously unseen data.

LLM Training Techniques

Parallelization

Parallelization is the process of distributing training tasks across multiple GPUs, so they are carried out simultaneously. This both expedites training times in contrast to using a single processor and makes efficient use of the parallel processing abilities of GPUs.

There are several different parallelization techniques which can be combined for optimal results:

Data Parallelization: the most common approach, which sees the training data divided into shards and distributed over several GPUs.
Tensor Parallelization: divides the matrix multiplications performed by the transformer into smaller calculations that are performed simultaneously on multiple GPUs.
Pipeline Parallelization: distributes the transformer layers over multiple GPUs to be processed in parallel.
Model Parallelization: distributes the model across several GPUs and uses the same data for each – so each GPU handles one part of the model instead of a portion of the data.

Gradient Checkpointing

Gradient checkpointing is a technique used to reduce the memory requirements of training LLMs. It is a valuable training technique because it makes it more feasible to train LLMs on devices with restricted memory capacity. Subsequently, by mitigating out-of-memory errors, gradient checkpointing helps make the training process more stable and reliable.

Typically, during forward propagation, the model’s neural network produces a series of intermediate activations: output values derived from the training data that the network later uses to refine its loss function. With gradient checkpointing, though all intermediate activations are calculated, only a subset of them are stored in memory at defined checkpoints.

During backward propagation, the intermediate activations that were not stored are recalculated. However, instead of recalculating all the activations, only the subset – stored at the checkpoint – needs to be recalculated. Although gradient checkpointing reduces memory requirements, the tradeoff is that it increases processing overhead; the more checkpoints used, the greater the overhead.

LLM Hyperparameters

Hyperparameters are configurations that you can use to influence how your LLM is trained. In contrast to parameters, hyperparameters are set before training begins and aren’t changed by the training data. Tuning hyperparameters is an essential part of the training process because it provides a controllable and measurable method of altering your LLM’s behavior to better align with your expectations and defined use case.

Notable hyperparameters include:

Batch Size: a batch is a collection of instances from the training data, which are fed into the model at a particular timestep. Larger batches require more memory but also accelerate the training process as you get through more data at each interval. Conversely, smaller batches use less memory but prolong training. Generally, it is best to go with the largest data batch your hardware will allow while remaining stable, but finding this optimal batch size requires experimentation.
Learning Rate: how quickly the LLM updates itself in response to its loss function, i.e., its frequency of incorrect prediction, during training. A higher learning rate expedites training but could cause instability and overfitting. A lower learning rate, in contrast, is more stable and improves generalization – but lengthens the training process.

Temperature: adjusts the range of possible output to determine how “creative” the LLM is. Represented by a value between 0.0 (minimum) and 2.0 (maximum), a lower temperature will generate more predictable output, while a higher value increases the randomness and creativity of responses.

Fine-Tuning Your LLM

After training your LLM from scratch with larger, general-purpose datasets, you will have a base, or pre-trained, language model. To prepare your LLM for your chosen use case, you likely have to fine-tune it. Fine-tuning is the process of further training a base LLM with a smaller, task or domain-specific dataset to enhance its performance on a particular use case.

Fine-tuning methods broadly fall into two categories: full fine-tuning and transfer learning:

Full Fine-Tuning: where all of the base model’s parameters are updated, creating a new version with altered weighting. This is the most comprehensive way to train an LLM for a specific task or domain – but requires more time and resources.
Transfer Learning: this involves leveraging the significant language knowledge acquired by the model during pre-training and adapting it for a specific domain or use case. Transfer learning requires many or all of the base LLM’s neural network layers to be “frozen” to limit which parameters can be tuned. The remaining layers – or, often, newly added – unfrozen layers are fine-tuned with the smaller fine-tuning dataset – requiring less time and computational resources than full fine-tuning.

Evaluating Your Bespoke LLM

After training and fine-tuning your LLM, it is time to test whether it performs as expected for its intended use case. This will allow you to determine whether your LLM is ready for deployment or requires further training.

For this, you will need previously unseen evaluation datasets that reflect the kind of information the LLM will be exposed to in a real-world scenario. As mentioned above, this dataset needs to differ from the one used to train the LLM to prevent it from overfitting to particular data points instead of genuinely capturing its underlying patterns.

LLM Benchmarks

An objective way to evaluate your bespoke LLM is through the use of benchmarks: standardized tests developed by various members of the AI research and development community. LLM benchmarks provide a standardized way to test the performance of your LLM – and compare it against existing language models. Also, each benchmark includes its own dataset, satisfying the requirement of using different datasets than during training to help avoid overfitting.

Some of the most widely used benchmarks for evaluating LLM performance include:

ARC: a question-answer (QA) benchmark designed to evaluate knowledge and reasoning skills.
HellaSwag: uses sentence completion exercises to test commonsense reasoning and natural language inference (NLI) capabilities.
MMLU: a comprehensive benchmark comprised of 15,908 questions across 57 tasks that measure natural language understanding (NLU), i.e., how well an LLM understands language and, subsequently, can solve problems.
TruthfulQA: measuring a model’s ability to generate truthful answers, i.e., its propensity to “hallucinate”.
GSM8K: measures multi-step mathematical abilities through a collection of 8,500 grade-school-level math word problems.
HumanEval: measures an LLM’s ability to generate functionally correct code.
MT Bench: evaluates a language model’s ability to effectively engage in multi-turn dialogues – like those engaged in by chatbots.

Conclusion

In summary, the process of building an LLM from scratch can roughly be broken down into five stages:

Determining the use case for your LLM: the purpose of your custom language model
Creating your model architecture: developing the individual components and combining them to create a transformer
Data curation: sourcing the data necessary to train your model
Training: pre-training and fine-tuning your model
Evaluation: testing your model to see if it works as intended; evaluating its overall performance with benchmarks

Understanding what’s involved in developing a bespoke LLM grants you a more realistic perspective of the work and resources required – and if it is a viable option.

However, though the barriers to entry for developing a language model from scratch have been significantly lowered, it is still a considerable undertaking. So, it is crucial to determine if building an LLM is absolutely essential – or if you can reap the same benefits with an existing solution.

The post A Guide to Building an LLM from Scratch appeared first on Symbl.ai.

A Guide to Building an End-to-End Speech Recognition Model

Kartik Talamadupula — Thu, 02 May 2024 21:53:38 +0000

With speech being such a natural and fundamental form of communication, speech recognition is among the most exciting, and important, applications of AI.

A large reason for this is that speech-activated applications feel natural and intuitive to users, offering a gentle learning curve, which allows a level of comfort that facilitates fast adoption. Consequently, voice-activated technology offers some of the first examples of widely-used AI applications, as the general public has been familiar with automated customer service systems for decades. Even more recently, digital assistants, such as Apple’s Siri and Google’s Assistant, have had a large impact – and have become a daily fixture for tens of millions of people globally.

With this in mind, this guide explores the process of building your own end-to-end speech recognition model. We take a look at how they work, their common applications, their various architectures, and how to train and evaluate them.

What is an End-to-End Speech Recognition Model?

An end-to-end speech recognition model is a deep learning model that takes an aural speech signal as its input and outputs a textual transcript.

However, in contrast to prior “traditional” speech recognition systems that are composed of multiple models that process audio and text independently, i.e., feature extraction, acoustic and language modeling, etc., end-to-end speech models directly map the audio input to its corresponding text output. This eliminates the need for explicit intermediate representations, reducing the complexity of developing the model while improving its performance and accuracy.

How do End-to-End Speech Recognition Models work?

Here is an overview of the end-to-end speech recognition process, broken down into several key stages.

Acoustic Signal Processing: an acoustic signal input, i.e. the analogue waveform of the audio, is captured by a microphone and converted to digital data.
Feature Extraction: relevant features are extracted from the data, including its pitch, intensity, and spectral characteristics, i.e., the audio signal’s unique frequency properties and patterns. Spectral characteristics can be represented by Mel spectrogram, a visual representation of the signal’s frequency over time, or Mel-Frequency Cepstral Coefficients (MFCCs), which represent the signal’s frequency range – and capture the audio’s higher-level characteristics.
Acoustic Model Encoding: a statistical model, such as a neural network, that maps the signal’s extracted features to phonetic or sub-word representations to capture the acoustic-textual relationship.
Language Model Encoding: a statistical model that predicts the probability of words, or sub-word tokens, sequences occurring in a sequence. The language model complements the acoustic model by incorporating linguistic, grammatical, and syntactic context to help distinguish between words that sound similar but mean different things.
Decoding: the acoustic and language models work in tandem to generate output that represents the most probable text transcription of the audio signal. A decoding algorithm, such as beam search or greedy decoding, traverses possible word sequences and identifies those that are most likely to represent the audio input.

Determining the Use Case for Your Speech Recognition System

Before you begin building your end-to-end speech recognition model, it is essential to clearly define its use case.

This is a fundamental part of the process because it will factor into your choice of model architecture and, more importantly, the model size, i.e., the number of parameters, to aim for. Generally, the more capable you want your speech model to be, e.g., the larger its vocabulary, its adaptability, whether it is multi-lingual, etc., the more trained parameters it requires.

These aspects then trickle down to the training process as they determine the amount of data you will need – and the more capable your model, the larger the required training dataset. Subsequently, the more training data you have, the longer it will take to train the model and the more computational resources, i.e., memory, storage, electricity, etc., that are required.

Applications of End-to-End Speech Recognition Models

Popular uses for speech recognition systems include:

Digital Personal Assistants: likely the first thing that comes to mind for many when it comes to voice-activated technology are virtual assistants like Siri and Alexa, which use speech recognition models to process and respond to spoken queries and commands.
Home Automation: voice-activated digital assistants act as the central hub in “smart” homes that use speech recognition in home automation tasks. This includes controlling security devices, lighting, air conditioning and other appliances.
Customer Service: speech models are used in automated customer service lines to handle simple queries and tasks, or to streamline the process of directing the customer to the appropriate human agent.
Transcription Services: speech models can turn verbal expressions into written text, making them ideal for tasks such as transcribing interviews or meeting minutes, as well as for creating content.
Translation Tasks: if a speech model is multilingual, it can receive spoken input in one language and transcribe it into the written text of another language – or even multiple languages.
Accessibility Features: speech recognition models can enhance the accessibility of a vast range of digital solutions, making them far more functional for people who are physically or mentally impaired.

Defining the End-to-End Speech Recognition Model’s Architecture

After identifying the use case for your end-to-end speech recognition model, the next stage is choosing a neural network architecture to match your intended use case.

Fortunately, instead of having to build the required neural networks from scratch, there are engines, frameworks, and toolkits that streamline the process of building end-to-end speech recognition models. Let us look at two of the most widely used frameworks – DeepSpeech and ESPnet.

DeepSpeech is a prominent speech recognition engine that uses the TensorFlow deep learning framework as its foundation. Developed by Mozilla, it is based on the DeepSpeech model proposed in the influential paper published by Chinese technology giant Baidu – credited with popularizing the idea of end-to-end speech recognition models. It is compatible with various languages, including Python, JavaScript, and C, and is lightweight enough to enable deployment on resource-constrained devices.

DeepSpeech employs a combination of convolutional neural networks (CNNs), to identify patterns and extract high-level features from the audio, and recurrent neural networks (RNNs), to capture the audio’s temporal and predict textual output. It offers both pre-trained models and the components and infrastructure to develop your own end-to-end speech recognition systems, including comprehensive libraries for data preprocessing, training, and evaluation.

ESPnet is an open-source framework developed by researchers at Johns Hopkins University for building end-to-end speech processing. Built on top of the PyTorch and Chainer neural network libraries, its modular and extensible design enables the development of models capable of automatic speech recognition (ASR), text-to-speech (TTS), and translation tasks. Much like DeepSpeech, you can start from a selection of pre-trained models or customize one of several RNN or transformer-based architectures.

Building a Data Pipeline

After selecting the architecture for your end-to-end speech recognition model, it’s time to build a data pipeline for training your model. This is a crucial part of the process because the quality of your training data determines the quality, i.e., the capabilities and accuracy, of your speech recognition model. To underscore the importance of this step, it is not uncommon to see data curation listed before selecting a neural network architecture in a list of steps for creating a deep learning model.

Building a data pipeline is composed of four steps:

Collecting data
Preprocessing the training data
Data augmentation
Dividing the data into training and evaluation sets

Let us consider each step in greater detail.

Collecting Data

The first stage of creating your data pipeline is compiling a training dataset. Speech recognition datasets contain two types of data: spoken audio data that serves as input and text transcripts that represent the target output labels.

Training an end-to-end speech recognition model requires a lot of data -ranging from tens to thousands of hours of audio data. Generally, the more sophisticated you need your model to be, the more data you will need to amass. Additionally, if you’re training your model for a domain-specific purpose – for use in the engineering field, for instance – you’ll also need specific, typically smaller datasets with the appropriate content for fine-tuning the model after its initial pre-training.

Fortunately, there are existing datasets compiled by the AI research community that are publicly available; here are some of the most commonly used datasets for training speech recognition systems:

LibriSpeech: around 1000 hours of English-language recording from various audiobooks read by diverse speakers. The Multilingual LibriSpeech (MLS) dataset is also available, which contains 44,500 hours of English audio and 6,000 hours of audio in French, Spanish, Dutch, and several other languages.
Mozilla Common Voice: a popular open-source dataset that features over 20000 hours of recordings of sentences read in dozens of languages, offering a diverse selection of speakers, accents, and recording conditions.
Switchboard-1: composed of close to 2500 telephone conversations between pairs of speakers on various topics.

Preprocessing the Training Data

Next, we need to pre-process our compiled dataset to make it easier for the model to process and to make training more efficient. This involves addressing variations within data instances to ensure they have the similar dimensions the model will expect.

Typical preprocessing tasks include:

Filtering the audio signal to reduce background noise
Establishing uniform attributes for the data, including:
- Duration: truncating longer sequences, e.g., based on silence or pauses, or padding shorter ones, e.g., with zero-value samples
- Sampling rate
- Bit depth
- Channels

Data Augmentation

Data augmentation is a technique used to artificially increase the diversity of your dataset to increase your dataset size. This strategy is especially helpful when data is scarce or your model is overfitting.

For speech recognition, data augmentation techniques for speech recognition models include:

Changing the pitch
Altering the speed
Adding background noise
Adding reverb
Lengthening or compressing the audio signal
Resampling, i.e., changing the sampling rate
Time shifting, i.e., moving the audio signal by a small percentage to introduce variations

Dividing the Data into Training And Evaluation Sets

Using the same data to both train and evaluate your speech model can result in overfitting. This occurs when the model is already familiar with data instances and has learned them – instead of being able to generalize to new data. For this reason, you should retain a portion of the training data for the later evaluation stage.

Training Your Speech Recognition Model

Training an end-to-end speech recognition model passing through training data to initialize the parameters, i.e., weights and biases within its neural network. The objective of this process is for the model to learn the characteristics of the audio input data well enough to predict its corresponding text output labels. This process is composed of two stages: forward propagation (also called the forward pass) and backward propagation.

During forward propagation, the audio input enters the speech recognition model, which learns to extract relevant features and patterns from the audio signals. As the input progresses through the neural network’s hidden layers, the model captures increasing numbers of higher-level representations, so it gains a richer and deeper understanding of the speech signal from which to transcribe it most accurately. The forward pass continues until the network’s final layer outputs a text transcription based on the input audio data and the model’s parameters.

Backward propagation is where the model’s parameters are updated based on the incorrect predictions for the correct output label. The gradients of the model’s parameters, i.e., the nature of adjustments required to maximize the speech model’s accuracy, are computed and propagated backwards through the network. These gradient adjustments are made to minimize the loss function – an algorithm that measures the difference between the predicted text output and the actual “ground truth” transcription from the training data.

A commonly used loss function algorithm is connectionist temporal classification (CTC), which aligns the audio input with the textual output sequence. A key strength of CTS is that it aligns sequences automatically, without you needing to explicitly and/or manually stipulate alignment as part of the labeled training data. Backward propagation continues iteratively until the speech model’s parameters converge to a point where the loss function is minimized and, consequently, the model is optimized for accurate speech recognition.

Fine-Tuning Your Model

After training your model with general-purpose audio data, it will (pending successful evaluation) be capable of general speech recognition tasks. Making your end-to-end speech recognition model effective at your specific use case, however, may require you to fine-tune it – with further training on domain or task-specific data. Much like the initial training process, known as pre-training, fine-tuning is an iterative process in which the speech model’s parameters are updated to minimize the loss function according to the newly introduced fine-tuning dataset.

There are two types of fine-tuning: full fine-tuning and transfer learning:

Full Fine-Tuning: the most comprehensive fine-tuning method whereby all of the model’s parameters are updated. Though likely to yield the best results, it requires more data, memory, and time.
Transfer Learning: this method leverages the existing capabilities of the speech model developed during pre-training and transfers them to the desired use case. Transfer learning requires many or all of the base LLM’s neural network layers to be “frozen” to limit which parameters can be altered. The unfrozen layers are fine-tuned with new data, which requires smaller datasets, time and computational resources than full fine-tuning. Parameter-efficient fine-tuning (PEFT) is a commonly used method of transfer learning.

Evaluating Your Speech Recognition Model

After training and fine-tuning your speech model, it is time to evaluate whether it can carry out its intended use case and to what level. For evaluation, you’ll need different data from that used to train the model – to avoid overfitting. This is often called a holdover dataset because it was “held over” from the training data to test the model later in the process.

One of the most effective ways to measure the performance of your end-to-end speech recognition model is with evaluation metrics. Commonly used evaluation metrics for speech models include:

Word Error Rate (WER): compares the model’s output transcription with the original ground truth transcription on a per-word basis.
WER = (S+D+I/N) x 100, where:
S = substitutions
D = deletions
I = insertions
N = number of words

The resulting figure is the WER expressed as a percentage; the lower the WER, the better the model’s performance.

Token Error Rate (TER): similar to WER but measures errors at a sub-word level, allowing for a more precise measure of accuracy.
Character Error Rate (CER): measures errors at a character level, offering higher precision than WER and TER.

Conclusion

To briefly recap, the process of building an end-to-end speech recognition system is as follows:

Determining the model’s use case
Defining the speech model’s architecture
Building a data pipe
Training the speech recognition model
Evaluating the model

While speech recognition models have seen considerable advancements in recent years, the inherent challenges in processing audio, including persistent background interference, different accents, distinct vocal tonality and cadence, etc., have seen them progress at a slower rate than other areas of deep learning, such as language models. Consequently, the use of end-to-end speech models is not as democratized as other AI applications, as only companies with the necessary vast resources can afford to build models performant enough for consistent real-world performance.

Fortunately, because speech recognition is a key component of human-to-machine communication, and gives rise to so many use cases, it’s an area that AI vendors and researchers are sure to throw their resources behind.

The post A Guide to Building an End-to-End Speech Recognition Model appeared first on Symbl.ai.

Emotional Intelligence in LLMs: Evaluating the Nebula LLM on EQ-Bench and the Judgemark Task

Kartik Talamadupula — Mon, 29 Apr 2024 18:27:07 +0000

Introduction

Large Language Models (LLMs) have become increasingly significant in the field of artificial intelligence, with their ability to effortlessly process and generate human-like language at scale. However, the traditional benchmarks used to evaluate LLMs often fall short, focusing primarily on syntax and semantics. A key part of any human interaction is the expression and understanding of emotions: these are cues that give more depth to our communication, and can be used to interpret and assess text with more depth. As we aim to generate more human-like interactions using LLMs, it becomes crucial to assess the emotional reasoning capabilities of these models.

Are Human Conversations Special?

In prior work, we examined the question of whether human conversation data is inherently special, and poses more challenges for LLMs than other types of data. In our study, we found that conversation data makes up a minuscule fraction of the training data used to train today’s LLMs. Consequently, these trained models are unable to generalize their attention patterns to understand the long-term contextual relationships exhibited in conversations. This plays a key role in how LLMs understand and interpret the context of conversations. However, another key issue that most models are not evaluated against when it comes to understanding and generating natural conversations is whether they are able to adequately process the emotional cues in conversations.

The Importance of Emotional Intelligence in Conversations

Understanding and processing human language involves more than just syntax and semantics. Emotional intelligence and reasoning – the ability to interpret and respond to emotional cues – is essential for natural communication. By incorporating emotional intelligence into LLMs, we can enhance their output to be more contextually appropriate and human-like; and produce artifacts that humans are more likely to engage with and respond to.

Beyond Standard Benchmarks: EQ-Bench

A major shortcoming of current LLM benchmarking techniques is their narrow focus on basic language tasks, and the fact that they primarily evaluate only the output generated by models. However, to create more advanced and human-like AI systems, we need to go beyond these standard evaluations. This is where the EQ-Bench benchmark comes into play.

EQ-Bench is an innovative benchmark specifically designed to evaluate aspects of emotional intelligence in Large Language Models (LLMs). The benchmark assesses the ability of LLMs to understand complex emotions and social interactions by asking them to predict the intensity of emotional states of characters in a dialogue. This interpretative approach assesses a model’s ability to predict the magnitude of different emotions, without the need for human judges, thus eliminating length bias. EQ-Bench tasks focus on emotional understanding (EU) – which is defined as the ability to comprehend and interpret complex emotions and their meanings in social contexts. The emotions measured by the benchmark include surprise, confusion, anger, forgiveness, offense, empathy, confidence, and dismissiveness.

Results from the EQ-Bench benchmark show a strong correlation with MMLU (r = 0.97), which is an established benchmark for the evaluation of LLMs.

EQ-Bench Judgemark Task

The Judgemark task, a part of EQ-Bench, goes one step further in generalizing the evaluation of LLMs. Rather than focus on judging the output of the model for a specific set of test cases, it instead inverts the paradigm and focuses on measuring the ability of a model to act as a judge of creative writing. This inversion eliminates the biases that result from judging models on limited instances of their own output, and instead measures a more complete notion of whether a model is able to understand the emotional nuances in pieces of creative writing at a level that it can be deemed a high quality judge of all such output.

In this challenging test, the model is presented with pre-generated creative writing outputs from 19 test models and is tasked with assigning scores, just as a human judge would. The specific metrics evaluated in this task include correlation with EQ-Bench scores (EQB-Corr), correlation with LMSys Arena ELO (Arena-Corr), standard deviation of scores (a proxy for discriminative power), and bias statistics such as self-bias and family bias.

Relating EQ-Bench and Conversational Language

The design of the Judgemark task renders it an effective metric for judging a language model’s ability to understand and generate natural sounding conversational language. This correlation stems from the shared narrative and expressive qualities between creative writing and conversation. Both forms aim to engage and captivate their audience, often employing diverse narrative structures and storytelling techniques to achieve this. They also utilize figurative language – including metaphors, similes, and idiomatic expressions – to convey complex ideas and emotions in a nuanced and compelling manner. Additionally, both creative writing and conversational language involve character development through the sharing of context, the description of settings, and the use of dialogue: all of these contribute to a natural and immersive experience for the interlocutor.

Models that demonstrate a strong understanding of creative writing text, as indicated by high Judgemark scores, also tend to exhibit an improved capacity for generating engaging and contextually appropriate responses in conversational settings. This is because of the high overlap between the nuances of creative writing – such as tone, style, and narrative arc – and the complexities of natural language that is used in conversations. Judgemark thus serves as a valuable indicator of a model’s potential for generating compelling and emotionally appropriate responses in conversational contexts.

Nebula LLM’s Breakthrough Performance

Model	Provider	Judgemark Score
nebula-chat-large	Symbl.ai	76.63
claude-3-opus-20240229	Anthropic	75.23
gpt-4-turbo-2024-04-09	OpenAI	70.43
gemini-1.5-pro-preview-0409	Google	66.58
mistral-medium	MistralAI	58.84
Meta-Llama-3-70B-Instruct	Meta	54.32
Mixtral-8x22B-Instruct-v0.1	MistralAI	51.45
dbrx-instruct	Databricks	27.17
gpt-3.5-turbo-0125	OpenAI	16.06

The EQ-Bench Judgemark leaderboard, showing performance on the Judgemark task, with a snapshot of the score values as of April 26, 2024.

Among the various LLMs evaluated on the Judgemark task, the Nebula LLM stands out with a score of 76.63, surpassing all other leading models. When compared to Claude 3 Opus (75.23), GPT-4 (70.43), Gemini 1.5 Pro (66.58), Llama 3 70B (54.32), and other models, Nebula demonstrates a stronger ability to assess creative writing and provide nuanced analysis.

This state of the art performance is made possible due to the Nebula model’s training regimen, which includes a significantly higher proportion of human conversation data than other, larger models. This breakthrough highlights the potential for more advanced and emotionally intelligent applications such as chatbots and copilots built using the Nebula LLM’s advanced understanding and modeling of human emotions.

Implications and Future Directions

The impressive showing of the Nebula LLM on the Judgemark task has significant implications for the future of artificial intelligence and natural language processing. With improved emotional reasoning capabilities, LLMs can enhance various sectors that involve close interaction with humans, including customer service, sales, healthcare, and education. As evaluation methods like EQ-Bench and Judgemark continue to be refined, we move closer to creating AI systems that truly understand and respond to human emotions in as natural a way as another human might. By focusing on emotional intelligence and reasoning, we can create more human-like AI systems that better understand the nuances of natural communication.

FAQs

What is EQ-Bench?
EQ-Bench is a benchmark designed to evaluate the emotional intelligence of Large Language Models (LLMs). It focuses on assessing the ability of LLMs to understand complex emotions and social interactions by predicting the intensity of emotional states of characters in a dialogue.

How does EQ-Bench differ from traditional LLM benchmarks?
Traditional LLM benchmarks often focus on basic language tasks and evaluate only the output generated by models. EQ-Bench, on the other hand, specifically targets emotional understanding (EU), which is the ability to comprehend and interpret complex emotions in social contexts.

What is emotional intelligence?
Emotional intelligence (EI or EQ) is defined as “The ability to monitor one’s own and others’ feelings, to discriminate among them, and to use this information to guide one’s thinking and action.” It involves perceiving, using, understanding, and managing emotions effectively.

Why is emotional intelligence important for LLMs?
Emotional and social understanding are crucial for LLMs as they primarily interact with humans through natural language conversations. By incorporating EI, LLMs can enhance their ability to comprehend and respond to the complexities and nuances of emotional interactions, making their output more contextually appropriate and human-like.

How does EQ-Bench assess the emotional intelligence of LLMs?
EQ-Bench asks LLMs to predict the intensity of emotional states of characters in a dialogue. This interpretative approach focuses on emotional understanding and measures the ability of LLMs to interpret and predict the magnitude of different emotions without the need for human judges.

What emotions does EQ-Bench measure?
The emotions measured by EQ-Bench include surprise, confusion, anger, forgiveness, offense, empathy, confidence, and dismissiveness. These emotions are selected to include a range of obvious and nuanced emotions, requiring LLMs to demonstrate a deep understanding of emotional nuances.

What is the Judgemark task?
The Judgemark task is a part of the EQ-Bench benchmark. It evaluates the ability of a model to act as a judge of creative writing by assigning scores to pre-generated outputs from test models. This approach eliminates the biases that arise from judging models based on limited instances of their own output.

How is the Judgemark task related to conversational language?
The Judgemark task is an effective metric for assessing a language model’s ability to understand and generate natural-sounding conversational language. This is because creative writing and conversation share similar narrative and expressive qualities, such as the use of diverse narrative structures, storytelling techniques, and figurative language.

How does the Judgemark task eliminate biases?
The Judgemark task eliminates biases by inverting the traditional paradigm. Instead of judging the output of the LLM, it evaluates the LLM’s ability to act as a judge of creative writing outputs from other models. This approach assesses the LLM’s understanding of emotional nuances in a more comprehensive and unbiased manner.

What are the specific metrics evaluated in the Judgemark task?
The Judgemark task evaluates several metrics, including correlation with EQ-Bench scores (EQB-Corr), correlation with LMSys Arena ELO (Arena-Corr), standard deviation of scores (discriminative power), and bias statistics such as self-bias and family bias. These metrics provide a holistic evaluation of the LLM’s ability to understand and judge creative writing outputs.

What are the implications of Nebula LLM’s performance on the Judgemark task?
The Nebula LLM’s high score on the Judgemark task highlights its advanced understanding and modeling of human emotions. This indicates that Nebula has the potential to enhance various sectors that involve close interaction with humans, such as customer service, sales, healthcare, and education.

The post Emotional Intelligence in LLMs: Evaluating the Nebula LLM on EQ-Bench and the Judgemark Task appeared first on Symbl.ai.

A Guide to Transformer Architecture

Kartik Talamadupula — Mon, 22 Apr 2024 19:59:38 +0000

Transformers were introduced in the now seminal paper Attention is All You Need (2017) by Vaswani et al, a research team from Google and the University of Toronto. Though initially developed for machine translation tasks, transformers have since provided the foundation for a variety of large language models (LLMs) and other machine learning models and have been applied to a wide range of real-world use cases.

In this guide, we take an in-depth look at the transformer architecture, including its core components, what distinguishes it from its predecessors, and how it works.

What Is the Transformer Architecture?

A transformer is a type of neural network architecture capable of learning context and relationships from sequential data such as text. This makes it applicable to a wide range of natural language processing (NLP) tasks such as:

Machine translation
Classification tasks
Question answering (QA)
Text generation
Text summarization
Sentiment analysis
Conversational agents, i.e., chatbots
Semantic search

Although there are many variations of the transformer architecture, one of the most common ways to classify transformers is as encoder-decoder, encoder-only, and decoder-only.

Encoder-Decoder Transformers: also called sequence-to-sequence transformers, they encode an input sequence and decode it into an output sequence. The best examples of encoder-decoder transformers are the original transformer model and the Text-to-Text Transfer Transformer (T5) model.
Encoder-Only Transformers: these models only encode input and do not undertake decoding. The best examples of encoder-only transformers are Bidirectional Encoder Representations from Transformers (BERT), also developed by Google, as well as its many variations like RoBERTa.
Decoder-Only Transformers: in contrast to encoder-only models, decoder-only transformers specialize in decoding input into output. Prolific examples of decoder-only transformers are the Generative Pre-trained Transformer (GPT) family of models by OpenAI, e.g., ChatGPT.

In this guide, we are going to focus on the encoder-decoder transformer.

Shortcomings of Previous Neural Network Architectures

To better appreciate the capabilities of the transformer architecture, let us now take a look at two of its prominent predecessors, recurrent neural networks (RNNs) and long short-term memory (LSTM).

Recurrent Neural Networks (RNNs)

An RNN processes input sequences a token at a time in cyclic iterations. The network’s input layer receives the first token of the input, which is then passed to hidden internal layers that process and output it for the next iterative step. This output, along with the next token from the sequence, is fed back into the neural network – so the output at every step is dependent on previous outputs as well as the current input. This process is repeated for every token in the input prompt.

Additionally, the RNN maintains a hidden state – in the form of a vector that stores the context and dependencies between the tokens it has learned so far – effectively acting as the network’s memory.
Long Short-Term Memory (LSTM)

An LSTM is a type of RNN that improves upon the conventional memory mechanism through cell states. These allow an LTSM to selectively recall or forget particular aspects of previous input according to their importance.

A cell contains three gates that store a value between 0 and 1, signifying the extent to what should be “let through” the gate and passed on to the next cell:
- Forget gate: indicates what current state information can be forgotten
- Input gate: what new information should be added to the state
- Output gate: what information stored in the current state should be output
Each cell takes a token, the previous cell state, and the output of the previous cell’s output to generate a new cell state, and an output.

Both RNNs and LTSMs have two main drawbacks:

They process input sequentially: each step in the process depends on the previous ones, resulting in longer training and inference times. Plus, this does not make efficient use of GPUs – which are designed for parallel computation.
Inability to handle long-term dependencies: this is where the network becomes less effective at keeping track of data points that are far apart in an input sequence; generally, the longer the input sequence, the higher the chance that contextual information is lost.

Although LSTMs are designed to mitigate this problem, they only do so up to a point – and longer input sequences still often struggle to retain their context. The probability of retaining the context from a token positioned far away from the current token decreases exponentially with the distance from it – due to the vanishing gradient problem, i.e., gradients becoming increasingly smaller during backward propagation.

In contrast, the transformer’s self-attention mechanism allows it to process input sequences simultaneously in parallel, resulting in faster training and inference. Consequently, the transformer architecture makes efficient use of a GPU’s processing abilities and is more scalable than its predecessors – because you can add more GPUs to increase computational power.

Secondly, the positional encoding mechanism within the transformer tracks the position of each token – eliminating the need for recurrence or hidden state vectors. This makes it easier for the network to handle longer-range dependencies, which allows for larger context windows.

Components of the Transformer Architecture

Embedding Layer

This is where input enters the transformer, which breaks it down into tokens, i.e., around four characters or 0.75 words per token on average, and turns them into numerical representations called embeddings that the model is better able to understand and process.

Positional Encoder

This adds information to each token’s embedding to indicate its position within the sequence – without recurrence or maintaining an internal state. This is typically achieved by using an alternating set of sine and cosine functions to generate a unique positional signal for each token. Sine and cosine functions are well-suited to this purpose because they repeat their patterns over a regular interval, which is ideal for capturing sequential relationships, while being perpendicular to each other – preventing overlap.

Self-Attention Mechanism

The transformer’s self-attention component systematically compares token embeddings against each other to determine their similarity and relevance. This results in a weighted representation of the input which captures the appropriate patterns and relationships between the tokens, which the transformer can use to calculate the most probable output.

Both the encoder and decoder feature self-attention mechanisms, with the encoder containing a single self-attention layer and the decoder containing two such layers.

Encoder

The encoder’s purpose is to take the input sequence and convert, or encode, it into a weighted embedding that the decoder can use to generate output.

As opposed to a single encoder, transformers contain several encoders in a stack – with the original transformer featuring a stack of six encoders, for example. This increases the transformer’s efficacy, as each encoding layer captures different aspects of the input to enhance its understanding and, subsequently, the model’s predictive capabilities.

Decoder

The decoder takes the weighted embedding output by the encoder, generates the most probable output tokens, and decodes them into readable output.

Like the encoder, the transformer architecture contains a stack of decoders – mirroring the number of encoders, e.g., six in the original design.

How Does the Transformer Architecture Work?

Let us now turn our attention to how the transformer architecture works in more detail.

In short, an encoder-decoder transformer architecture works by the encoder taking a given input sequence and converting it, a token at a time, into a numerical representation, i.e., word embeddings. These input embeddings are then passed to the decoder, which uses them to generate output as a series of embeddings before being ultimately converted into text.

This process encompasses the following steps:

Generation of Input Embeddings

The input prompt is fed into the encoder which tokenizes it and converts it into a series of embeddings.
Addition of Positional Encodings

The transformer generates positional encodings and adds them to the input embeddings for each token to provide information about their position within the sequence.
Multi-Head Self-Attention

The next stage is the self-attention layer in which the encoder develops an understanding of the input sequence and assigns each token an attention score, i.e., how much importance a token should receive. This process is referred to as multi-head attention because the attention mechanism features multiple heads that enable the encoder to process different parts of the input sequence in parallel – increasing the model’s capabilities and speed of training and inference.

This part of the process is divided into several sub-stages:

i. Calculation of Queries, Keys, and Values

First, each embedding is further broken down into three components: query, key and value.

Query: akin to a question that each token asks itself, i.e., what the current token is looking for in other tokens to gain a better understanding of its own context.
Key: this provides information about a token that helps other tokens in the sequence better understand it – and how relevant the current token is to them.
Value: the actual content or meaning of the token that other tokens in the sequence will use to update their own embeddings.

This enables the transformer to better compare each token against each other to determine its context and relative importance within the input sequence. The queries, keys, and values are calculated through linear transformations using parameters learned during the model’s training. Instead of sequentially, each head in the encoder performs its attention mechanism processes on the queries, keys and values in parallel.

ii. Creation of Score Matrix

The encoder creates a score matrix of each token by taking the dot product of its query and the key of every other token in the sequence, i.e., multiplying them together. The score matrix determines the level of emphasis each token should place on other tokens: the higher the score, the greater the emphasis.

Additionally, the score matrices are scaled down – by dividing them by the square root of the dimension of the query and key embeddings. This results in more stable gradients, as the dot products have the potential to be high in some cases.

iii. Application of Softmax Function

A softmax function is applied to each matrix to produce a set of attention scores that add up to 1. This distributes the attention among all the tokens and makes it easier to compare their relative importance. For instance, if a token has a softmax score of 0.6 and another has a score of 0.2, the first token is three times more significant than the second.

iv. Multiplying Softmax Attention Scores with Value Embeddings

The softmax-adjusted attention scores are multiplied by the token’s value to create an output embedding – and fed into a final linear transformation layer for further refinement.

Finally, the output for each embedding is added together to produce a concatenated embedding that represents the entire input sequence.

Normalization and Residual Connections

After the self-attention layer, the input passes through a normalization layer to ensure the embeddings fall within a reasonable range. This helps to stabilize the model and expedite the training process by preventing very small or very large, i.e., vanishing or exploding gradients.

Vanishing gradients are problematic as they often result in small updates to the model’s parameters during training – which prolongs the process. This is especially the case in neural networks with many layers, as gradients tend to diminish during backwards propagation, i.e., the model correcting its parameters through its loss function. So, the more layers between the output and input nodes, the greater the potential for vanishing gradients.

Conversely, exploding gradients cause overly drastic changes to the model during training and prevent it from converging on the optimal output. Additionally, if gradients grow too large, they can result in overflow errors as the model attempts to save them to memory – halting training entirely. Both vanishing and exploding gradients can cause underfitting or overfitting, where the model exhibits poor performance on the training data and/or evaluation datasets.

Additionally, the encoder and decoder feature residual connections that feed the output of one layer into the input of another, so data can flow more efficiently through a neural network – particularly those with many layers. These connections enable the neural network to learn to predict the difference, or residual, between the input and corresponding output – instead of the output itself. Like normalization, this helps to mitigate the vanishing gradient problem and enables faster and more effective training.

As well as after the multi-headed attention layer, this process also takes place after the feed-forward layer, before the input is passed to the decoder.
Feed-Forward Network

After passing through the self-attention mechanism and being normalized, the input reaches the feed-forward network. The purpose of this step is for the model to capture the input sequence’s higher-level features so it can learn more intricate relationships from the data.

It is composed of three layers:
- Linear Transformation: each token is multiplied by a weight matrix and added to a bias vector (both learned through training) allowing it to better fit the data and learn its more complex underlying relationships.
- Activation Function: this introduces non-linearity into the network, further enabling it to model complex patterns that mirror relationships in real-world relationships – which are not simply linear. The most commonly used activation function within transformer architectures is the Rectified Linear Unit (ReLU), which works by directly outputting the input when it is a positive value while outputting zero if it is negative – creating a non-linear relationship between input and output.
- Linear Transformation: similar to the first layer transformation, but with its own set of weights and biases.
Decoder

Following its conversion into a weighted numerical representation, the input is passed to the transformer’s decoder, which uses it to generate the appropriate output sequence.

Much of the decoder’s workflow mirrors that of the encoder – with a few key differences, as outlined below:
- Create Output Embeddings: the decoder receives the output from the last encoder layer – the embedding of the input sequence, tokenizes it, and converts it into embeddings.
- Output Positional Encoding: positional encodings are added to the output embeddings to incorporate data about their position within the sequence.
- Self-Attention: in contrast to the encoder, the decoder features two self-attention layers:
  - Masked Multi-Head Attention: similar to the self-attention mechanism within the encoder, with the addition of causal masking – which prevents the present token from comparing itself against future tokens.
  - Encoder-Decoder Multi-Head Attention: also known as cross attention, each token in the output sequence calculates attention scores against all tokens in the input sequence. This allows the decoder to better establish relationships between the input and output tokens. More specifically, the input tokens serve as queries and keys, while the output from the previous self-attention layer are the values. Causal masking is also employed here.
- Normalization and Residual Connections: these appear in the decoder three times: after each attention layer and after the feed-forward network.
- Feed-Forward Network: the output passes through a feed-forward layer to introduce non-linearity to the output.
- Output Projection: the refined output from the previous layers is projected into an embedding that is as large as the number of output possibilities, i.e. the vocabulary of the output language.
- Output Probability Calculations: the projected output is fed into a softmax function to convert the attention scores into probabilities. The token with the highest probability for each position in the sequence is selected as output.

What are the Limitations of the Transformer Architecture?

Despite its many advantages, the transformer architecture isn’t perfect and still has its shortcomings.

Limited Context Length: while transformers effectively mitigate the long-term dependency issues exhibited by RNNs and LTSMs, they fail to do so completely. When the context length, i.e., the maximum size of the input, grows past a certain point, transformers still struggle with recalling relevant information in the middle of the sequence.
Large Resource Requirements: the transformer’s computational complexity makes them resource-intensive, requiring large amounts of memory and storage. In contrast to RNNs, for which computational demands scale linearly with the length of an input sequence, the nature of the self-attention mechanism means that memory requirements scale quadratically with increasing sequence lengths. Additionally, the larger the size of the neural network, the harder they are to deploy on resource-constrained devices – making their use increasingly less feasible.
Longer Training Times: the complexity of the transformer architecture results in longer training times than its predecessors. Transformers also require large, labeled datasets to ensure effective training.
Lack of Transparency: transformer models are often described as “black boxes” because it is difficult to interpret their internal reasoning and explain how they arrived at certain predictions.

Conclusion

The development of the transformer architecture represented a landmark moment in the field of machine learning and laid the groundwork for many subsequent innovations in the field of AI. However, as powerful as transformers have proven to be, it is likely that we are only scratching the surface of their potential.

Addressing the shortcomings of the transformer architecture is one of the key focuses of AI researchers. Considerable – and encouraging – efforts have been made to make them less computationally demanding, e.g., quantization, and to effectively extend their context length without sacrificing accuracy.

More notably, significant research is being devoted to improving the self-attention mechanism itself – with a solution that scales sub-quadratically with sequence length. This will be a significant breakthrough that, in mitigating the limitations of the transformer architecture, will open the door to a vast range of possibilities in AI.

The post A Guide to Transformer Architecture appeared first on Symbl.ai.

A Guide to Overfitting and Underfitting

Kartik Talamadupula — Wed, 03 Apr 2024 23:25:40 +0000

The predictive capabilities of AI models have progressed rapidly in the last few years and continue to push new boundaries. However, there are two issues that still persistently plague AI developers and researchers: overfitting and underfitting. Moreover, as the field progresses and models are tasked with processing increasingly complex, high-dimensional datasets (as with multi-modal models for instance), the more likely overfitting and underfitting become. It is thus crucial to understand these challenges and develop strategies to mitigate them.

In this guide, we explore the concepts of overfitting and underfitting: the reasons they occur, how to diagnose instances of both, and techniques to mitigate them to enhance a model’s performance.

Explaining Bias and Variance

To best explain the concepts of overfitting and underfitting, we first need to understand two concepts that are central to them – bias and variance.

Bias is a statistical concept that has come to denote an AI model’s inability to accurately identify the patterns within a dataset due to a difference between its predicted output and the correct output label. The amount of bias present in a model is dependent on the number of incorrect assumptions during training – the more assumptions made, the higher the potential bias. Bias can occur for several reasons, such as an overly simplistic model or using a dataset that is too small and/or lacks sufficient variety.

Variance, meanwhile, is used to measure how sensitive an AI model is to changes in its training data. The greater the difference between a model’s performance on its training dataset and subsequent (test) datasets, the higher the variance. Variance can be caused by several factors, including a model that is too complex, and non-optimal feature selection.

What Is Overfitting and How Does It Occur?

Overfitting describes a situation in which a model consistently makes accurate predictions on its training dataset, but fails to perform as well on testing data.

For a real-world analogy of overfitting, picture a student preparing for an exam with a comprehensive set of practice questions and answers. The student studies the practice questions so thoroughly that they come to memorize the answers. However, when they eventually take the exam and are faced with a set of unfamiliar questions, they perform poorly – because they studied and memorized the answers, but not the methodology or reasoning behind how the answers were generated.

In the same way, overfitting occurs when a model understands how to produce the correct outputs based on the training data – even going as far as to capture its noise and outliers – but doesn’t learn the relationship between the data points that produced those outputs.

Overfitting is caused by the combination of a model having low bias and high variance – which means an overfitted model makes few assumptions and is very sensitive to changes in its training data.

How Can You Tell if a Model Is Overfitted?

To diagnose whether a model is overfitted, you must pay attention to two key metrics: training error rate and testing error rate. The model’s training error rate measures how often it makes inaccurate predictions on its training dataset while the testing (or generalization) error rate measures its performance on a separate testing or validation dataset.

Subsequently, you can tell if overfitting is occurring if the model consistently produces a low training error rate and a higher testing error rate, i.e., that it performed better on its training dataset than the testing dataset. Another indicator of overfitting is instability: if the model shows a disproportionately large difference in performance in response to small changes in the training data, i.e., high variance.

What Are the Implications of Overfitting?

To explore the effects of an overfitted model, let us look at an example featuring the width of a bird’s wingspan compared to its weight.

In most instances, the size of the wingspan will positively correlate with its weight, i.e., the larger the bird, the wider its wingspan. However, this won’t always be the case due to outlying data points for which a smaller species of bird has a wider wingspan. If the model overfits to the training data, including the outliers, it won’t accurately capture the relationship within the dataset, and it will produce inaccurate predictions with subsequent evaluation sets.

The implication of this is that, despite showing early promise, the model is too inconsistent to apply to real-life use cases. Although it may perform well on new data that is similar to the data it was trained on, the model is too unpredictable to be relied upon for a wider range (or distribution) of unseen data.

What Is Underfitting and How Does It Occur?

In contrast to overfitting, underfitting refers to a situation in which a model shows poor predictive abilities for both its training data and testing data.

Returning to our student analogy, underfitting is akin to a student who may have studied the practice questions and answers – but failed to comprehend them very well. Consequently, when they take their exam and are faced with new questions, they perform poorly – and would likely have performed poorly even if the questions were all already part of the practice set – because they are not familiar with the subject at all.

Similarly, an underfitted model – whether due to its simplicity, a lack of training time, or an insufficient dataset – is unable to identify and learn the relationship between the input data and the target output labels, resulting in inconsistent or poor predictive capabilities.

Underfitting is caused by high bias and high variance, which means the model makes too many assumptions and has a high sensitivity to changes in the training dataset.

How Can You Tell if a Model Is Underfitted?

When a model is underfitted, it will exhibit both a high training error rate and a high testing error, reflecting the twin facts that it hasn’t learned the relationships between the training data, and cannot generalize on previously unseen data in the validation set.

Since an underfitted model is characterized by poor predictive ability on its training data, it is typically easier to diagnose than an overfitted model.

What Are the Implications of Underfitting?

Returning to our example dataset that measures a bird’s wingspan with respect to its weight: an underfitted model will fail to fully capture the relationship, thus making predictions inconsistently as a result.

Unfortunately, the implications of this are that the model is even less suitable for real-world use than the already inconsistent overfitted model. Furthermore, as underfitting is caused by both high bias and high variance, it will likely take more work to mitigate than an overfitted model, where variance is the main prohibitor of performance.

Overfitting and Underfitting: How to Achieve Balance Between the Two

To maximize a model’s predictive abilities across a wide range of data and ensure its reliable performance in real-world use cases, one must strike the balance between a model being overfitted and underfitted. To achieve this, we must aim for the model to exhibit low bias and low variance.

This is possible by systematically employing various techniques designed to reduce bias and/or variance, respectively, until the model’s training and testing error rates are comparable. We explore some of those techniques in greater detail below.

Mitigating Overfitting

To address an overfitted model, we need to lower variance. Here are some of the main ways to do so:

Increase the size of the training dataset
If a dataset is too small, it will lack the required diversity to sufficiently represent its underlying data distribution. This could result in overfitting – as the model tries to map a probabilistic relationship on to the limited number of available data points. In contrast, by providing the model with a larger dataset, it has access to a greater variety of data with which to learn the relationships between data points – and will pay less attention to details specific to just the training dataset.
Improve the quality of the training dataset
In addition to its quantity, the quality of the training data plays a considerable role in how well a model can identify underlying relationships in the data. Ways of increasing the quality of training data include:
- Removing inaccurate data: a model’s output can only be as accurate as its input, so it is crucial to remove incorrect data from the training corpus. This prevents the possibility of the model learning from errors within the training data and making inaccurate predictions based on them.
- Ensuring the training data is representative of the broader dataset: the training data must be diverse and unbiased so it accurately represents the data involved in the use case to which the model will be applied. If not, the model may struggle with generalizing to previously unseen data points.
- Handling outliers: instances that vary considerably from the majority of the data can cause a model to overfit if it attempts to perfectly map them to a probabilistic relationship. This is because the model is capturing the noise in the data instead of genuine patterns.
  
  On the other hand, however, removing all outliers might prevent the model from learning aspects of the underlying relationship or being able to generalize successfully. With this in mind, one must decide how representative the outliers are of instances in the overall dataset, and whether to remove them. Similarly, it is prudent to remove varying amounts of outliers (perhaps based on their deviation from the mean) and observe if the model performs better on the test dataset.
Remove irrelevant features
If the training data contains an excessive amount of features, i.e., a measurable characteristic from which to predict output, it can cause the model to make inaccurate predictions due to unnecessary complexity.
Fewer training epochs
If a model is trained for too long, it could start to learn the training data itself instead of the underlying distribution – resulting in overfitting. Instead, one must identify the optimal number of training epochs by monitoring the training and validation error rates and stopping when they are comparable, i.e., when the model finds the balance between its accuracy with the training data and test data.
Regularization
This refers to a series of techniques that reduce a model’s complexity by reducing or nullifying the influence of particular parameters, or weights. Since overfitting often makes a model exaggerate the importance of certain weights, regularization lessens their significance on the output, resulting in better results with testing datasets.

Common methods of regularization include:
- L1 regularization: also called lasso regularization, this involves adding the sum of the absolute values of the model’s weights to the loss function, i.e., the quantified error between the predicted and actual output.
- L2 regularization: also referred to as ridge regression, this sees the sum of squared values of weights added to the loss function (accounting for negative values)
  
  Both L1 and L2 include the use of an alpha coefficient, which is a value between 0 and 1 that determines the extent of regularization. The closer the alpha is to 0, the greater the chance of over-signifying the importance of weights and continuing to overfit; while the closer the alpha is to 1, the more the risk of underestimating the importance of weights and causing underfitting.
  
  The difference between L1 and L2 regularization is that L1 reduces the value of certain weights to zero, creating a more sparse model that lends itself well to feature selection. Alternatively, L2 regularization reduces the value of weights more uniformly, which is better for models and networks with codependent features. However, the two methods can be combined, in a process called elastic net regularization, to benefit from their respective benefits.
- Dropout regularization: In each epoch, every weight has a probability, p, of being rendered inactive and consequently not factoring into the output.

Mitigating Underfitting

To address an underfitted model, variance needs to be lowered therefore, some of the mitigation methods overlap with those for overfitting. However, the model’s bias also needs to be lowered. Here are a few ways of accomplishing both objectives.

Increase the size of the training dataset
If the training dataset is too small for the model to learn its patterns, this will result in underfitting. Consequently, a larger training set gives the model more opportunities to learn the relationships between data points and make more accurate predictions.

Additionally, enlarging the dataset increases the diversity of data – helping to decrease inherent bias. As a result, the model itself will feature less bias, which is a key factor in mitigating underfitting.
Improve the quality of the training dataset
Just as noisy or uncleaned data can cause overfitting, it can also cause a model to underfit. Removing inaccurate instances from the dataset is the most fundamental way to enhance its quality and potentially mitigate underfitting.

Additionally, one must decide how to proceed with outlying data with respect to how accurately they reflect deviations from the mean in the broader context of the use case. On the one hand, removing the noise can prevent underfitting; while conversely, it may result in the model failing to adequately learn the true extent of the relationships within the data.
Increase the model’s complexity
If a model isn’t sophisticated enough, it may not be capable of identifying patterns within the dataset, leading to underfitting. By increasing the number of layers within its architecture, and/or the number of nodes in each layer, the model will have more parameters with which to better frame its training data, resulting in greater accuracy.
Improve Feature Selection
Similarly, a model may underfit because the input features of the training data aren’t descriptive or expressive enough for the model to consistently predict the correct output label. With this in mind, adding more features and/or making certain features more prominent will help increase the model’s complexity and yield more consistently correct predictions.
More Training Epochs
In the same way that an insufficient dataset can prevent a model from determining underlying patterns, training a model for an insufficient amount of time can have the same effect. Therefore, simply increasing the number of epochs could help mitigate underfitting and enhance a model’s performance.
Decrease regularization
While regularization can mitigate overfitting, if applied too rigorously, i.e., with too high an alpha coefficient, the data can become too uniform, preventing the model from identifying underlying patterns. Thus, by decreasing the level of regularization, one can increase the model’s complexity and prevent underfitting.

Conclusion

Overfitting and underfitting are two of the most consistent challenges in developing AI models: fortunately, there is increasing awareness about effective strategies to solve these issues.

First and foremost, with the size and quality of a model’s training data being so crucial to a model’s predictive capabilities, AI researchers and vendors are making a stronger effort to ensure their data is sourced and cleaned well. With progress in the application of AI to various tasks, it has become easier to obtain higher-quality datasets – this already goes a considerable way towards mitigating overfitting and underfitting.

Additionally, as model monitoring tools become more sophisticated, it becomes easier to diagnose instances of overfitting or underfitting and pinpoint the most effective method of addressing its cause. This reduces the amount of trial and error involved in finding the optimal balance between bias and variance, and expedites the process of developing accurate and more performant AI models.

The post A Guide to Overfitting and Underfitting appeared first on Symbl.ai.

A Guide to AI Agents

Kartik Talamadupula — Mon, 25 Mar 2024 22:33:02 +0000

While the concept of artificial intelligence has been around for decades – and its use has been widespread in various basic forms for years – the release of ChatGPT in November 2022 was the first time the world caught a glimpse of the promise of large language models (LLMs).

AI agents – particularly the new crop of such agents, based on very large machine learning (ML) models – take us a step further towards that reality. With these agents being one of the areas of most interest to AI vendors and researchers, their promise and potential to transform how we view AI and technology as a whole is now being realized.

With that in mind, this guide explores the concept of AI agents, how they work, their benefits and potential applications, and possible future trends for this exciting new application of the latest AI technology.

What Is an AI Agent?

Sometimes also referred to as autonomous AI agents, an AI agent is an application or system capable of executing a given task without ongoing direct human intervention. When assigned an objective, an AI agent will perceive its environment, assess the tools at its disposal, and formulate a plan to achieve its given goal.

As examples, in a professional setting, you could instruct an AI agent to find a list of suppliers, email them for quotes, and sort replies according to the best price. In a personal setting, you could instruct an AI agent to create a shopping list based on a recipe, purchase the ingredients online, and have them delivered to you.

How Does an AI Agent Work?

In essence, an AI agent works through the process of assigning, creating, or inferring an objective, which it will then break down into a series of tasks and attempt to complete. This process can be divided into three stages: task definition and planning, decision-making, and feedback and adaptation.

Task Definition and Planning
- Define and assign an objective: giving an agent a predefined goal you want it to accomplish.
- Assign Resources: select the tools and sources of information the agent will be permitted to use to achieve its goal.
- Environmental assessment: the agent uses sensors and any other available data sources to collect information about its environment; this helps it understand its given task in greater context, as well as potential obstacles.
- Plan generation: taking the tools at its disposal into account, the agent devises strategies for achieving its goal, which typically involves breaking down its tasks into subtasks, and potentially also breaking the goal down into subgoals.
Decision-Making
- Data analysis: analyzing available data – such as environmental sensor readings, past experiences, and the model that powers it, to predict the outcomes of prospective actions it could take.
- Action execution: the agent sequentially selects and executes the action it has determined will maximize the likelihood of success.
Feedback and Adaptation
- Performance Monitoring: the agent monitors the outcome of its actions and evaluates if they brought it closer to accomplishing the given objective.
- Feedback loop: The agent uses the feedback to adjust its strategy and potential actions to take. If permitted, i.e., if programmed to do so, it can ask for human intervention if it is stuck on a task.
- Adaptation and learning: The agent continuously learns from its experiences. It monitors the results of its actions and updates its knowledge base and decision-making processes based on the new information.

The Components of an AI agent

AI Model: an AI model: recent advances such as an LLM, VLM (vision-language model), or, more recently, LMM (large multi-modal model) could be used as the core of an AI agent, acting as its decision-making mechanism. The model will process data collected by sensors, make decisions based on said data, and take actions to achieve the agent’s goals.
Sensors: components responsible for collecting data from the agent’s environment so it can “perceive” it accurately and act accordingly. Sensors are an agent’s input devices, enabling it to learn about the world. In software agents, a sensor would be digital interfaces to websites or databases, while in a robotic agent, these include cameras and microphones.
Actuators: these are an agent’s means of output – enabling it to take actions based on its given objective and data collected from its sensors. In software agents, these are components that can control other applications or devices, while for a robotic agent, these could be appendages, i.e., limbs, display screens, or speakers.

Multi-Agent Systems

A multi-agent system (MAS) is a framework that allows multiple AI agents to collaborate to solve problems and achieve predetermined goals. They offer several benefits over systems consisting of a single AI agent, with the major one being that they are capable of taking on more complex tasks.

A key reason for this is that agents can learn from the behavior of other agents as well as their environment. Better still, an MAS is scalable – additional agents can be instantiated if the existing system isn’t up to the task, or can’t cope with growing demand.

Secondly, multi-agent systems are more fault-tolerant to individual agent failures than single-agent systems. This offers higher availability, which is desirable if the agent-powered system represents a critical function.

Finally, a novel way in which an MAS can achieve its given goal is to have agents cooperate to reach a desired objective. This could involve agents sharing information about the measures they have taken so far, so other agents avoid wasting effort.

What Are the Benefits of AI Agents?

One of the reasons AI agents are increasing in use is the large range of advantages they offer; let’s look into each in more detail.

Efficient Automation: AI agents can take on repetitive tasks such as FAQs, common requests, batch jobs, etc., without the need for human involvement. This frees up employees to work on more rewarding and value-adding activities.
Improved Decision-Making: agents can analyze vast amounts of data faster and more accurately than a person can, allowing for better data-driven decision-making.
Reduced Human Error: subsequently, the ability to process large amounts of data with greater accuracy means work carried out by agents does not suffer from the mistakes made by humans.
Increased Availability: AI agents are scalable and deployable around the clock. This ensures that services or support are available any time a user requires it.
Increased Safety: autonomous agents can be deployed in dangerous environments, replacing the need for humans and eliminating the risk of injury or catastrophic loss as a result.

Cost Savings:as a result of the automation offered by agents, the workforce is freed from the burden of mundane work, boosting their productivity and optimizing labor costs.
Scalability: As your user base or data grows, more agents can easily be deployed at scale to meet demands.

Types of AI Agents

Simple Reflex Agents

Also known as rule-based agents, these are the most basic type of AI agent and follow a collection of rules that specify an action to perform for a particular predefined condition or “trigger”. Simple reflex agents make decisions based on the current data from their sensors without memory or the capacity to learn.

Model-Based Reflex Agents

A model-based reflex agent maintains an internal state that represents aspects of the world, so that it can use its past experiences to make decisions. While such agents still rely on predefined condition-action rules, like simple reflex agents, they can use learned information to make decisions – making them more versatile and capable of operating within more unpredictable environments.

Goal-Based Agents

Goal-based agents maintain a goal/objective description, and devise strategies to achieve it, evaluating their current state based on their objective and selecting actions that will facilitate the completion of the goal conditions. Since goal-based agents are adaptable, they are well-suited to more intricate or dynamic environments.

‍Utility-Based Agents

This type of agent is designed to choose the set of actions that maximizes a defined utility or reward. Subsequently, utility-based agents decide which actions to take in accordance with their objective and the environment, but also based on a reward function that quantifies the desirability of different outcomes or “states”.

This makes utility-based agents suitable for use cases where multiple goals are in competition and the relative importance of each goal must be considered. An example of this would be an AI-powered investment portfolio application that must account for factors like return, risk, and liquidity when executing trades.

Multi-Modal Agents

Emerging alongside large multi-modal models (LMMs), multi-modal agents are capable of autonomously carrying out tasks that require a variety of multiple modalities, i.e., text, audio, images, etc. Multi-modal agents have the means to process multiple forms of input, enabling them to perceive their given environment more accurately than other types of agents. Consequently, they can be applied to a wider number of use cases because they are able to utilize more tools and resources.

Applications of AI Agents

Let us turn our attention to some of the use cases to which AI agents are currently being applied.

Virtual Assistants: services like Siri, Alexa, and Cortana are actually AI agents that are capable of understanding natural language commands in real-time and performing a wide variety of tasks like researching the answers to questions, ordering items, and controlling internet-connected smart devices.
Customer Service Chatbots: used by organizations to automate customer service interactions. In addition to answering common questions, agents reroute queries and escalate more involved issues to human personnel.
Recommendation Engines: AI agents can learn user preferences from past behavior to offer personalized recommendations for products, services, or content. Streaming services like Netflix and eCommerce platforms like Amazon are prominent examples of this.
E-Learning Agents: these deliver educational content based on a student’s particular competence and rate of progress. For instance, they will provide additional resources for areas the student finds challenging while skipping over familiar material.
Data Analysis and Forecasting: agents can process and interpret vast datasets to identify patterns and anomalies in an effort to predict future events. This is ideally suited for the financial industry that uses such predictive analysis to identify stock market trends, fluctuations, and potential risks – to optimize investment strategies.
Real-time Cybersecurity Monitoring and Alerting: AI agents are capable of monitoring IT infrastructure and spotting potential security breaches around the clock – and with far more accuracy than any single human can.
Diagnosis and Treatment Agents: an agent’s ability to process massive amounts of data and detect patterns is also useful in the medical field. AI agents can help identify medical conditions from a patient’s data – particularly when it comes to analyzing patterns in images as well as in large amounts of suitably anonymized data (across patients).
Robotics: self-driving vehicles and autonomous drones are examples of robotic agents that use sensors and actuators to interact with physical environments. Meanwhile, in sectors such as large-scale manufacturing, industrial robots are used as agents to perform work such as assembly and welding.
Infrastructure and System Monitoring: AI agents help in overseeing industrial infrastructure and systems, detecting anomalies in equipment that could signal an impending failure.
Gaming: Non-playing characters (NPCs) in video games are a basic form of agent that simply responds to a user’s actions. However, more advanced agents can add realism to games: a trend that will become more prevalent as we move into Virtual and Augmented Reality environments.

Future Trends in AI Agents

To finish, let us briefly explore a few of the likely advancements and trajectories of AI agents as we head into the future.

Increased Integration and Prominence Across a Variety of Industries

First and foremost, the use of AI agents will become increasingly widespread as they become more accessible and better understood by organizations. Additionally, as research into the applications of autonomous agents continues and their capabilities increase, they will play a larger role in an array of sectors. This includes:

Healthcare: agents will assist medical professionals in diagnosis, treatment, and even surgery.
Transportation: self-driving vehicles and autonomous drones will become more common.
Manufacturing: AI agents will manage entire facilities – including the operation and maintenance of machinery.
Customer service: AI agents will handle most customer service inquiries, making cheap and efficient 24/7 support the status quo.

Less Use of Individual Applications

Presently, with AI agents in their infancy, we still have to consider which tools and resources to give an agent access to for it to be able to complete its given objective. In the future, however, this may not be the case – as agents will be advanced enough to incorporate a wide variety of common applications in a standardized fashion – perhaps in an “agent app store”. As a result, we may be able to entrust an agent with an objective, composed of a series of tasks, without having to explicitly specify the actual agents or tools required for the task. In light of this, it is entirely possible that people will stop thinking in terms of apps – and will instead think of agents and assistants.

More Advanced AI Agents

Inevitably, just like past AI techniques systems, autonomous agents will become far more powerful and sophisticated – distinguished by their ability to learn and adapt in real time. In particular, there are two additional evolutionary leaps that autonomous agents will most likely make:

Theory of Mind: this will lead to agents possessing a level of cognitive understanding, with the ability to interpret emotions and intentions – in both people and other agents. This will create more authentic human-agent interactions, in which an agent can tell if a user is frustrated, happy, angry, confused, etc., and match their conversational style – or make recommendations to their mood.
Self-Aware Agents: the apex of autonomy, in which an agent understands its environment and is self-aware – enabling it to reflect on its capabilities, assess challenges, and consider what it might need to complete the task at hand.

Agent Swarms

Similar to the concept of containers in application development, an agent swarm enables the rapid automated deployment of multiple agents. In such a swarm, AI agents will be capable of activating additional other agents – of various types most suited to the given objective, assign them tasks, coordinate their activity, and monitor their progress.

Conclusion

It is no exaggeration to state that AI agents are poised to be a revolutionary technology that will expedite the further adoption of AI technology. A key reason for this is that, as a concept, they are easy to understand – abstracting many of the complexities of AI and ML models from the user. Moreover, the multitude of benefits they offer, such as increased productivity and cost savings, is something stakeholders in every industry can understand.

However, autonomous AI agents can also be a sobering and even frightening prospect. By streamlining so many tasks, jobs previously held by humans may become obsolete – necessitating labor reshuffling or employee reskilling. More importantly, the ethical and security concerns that become increasingly apparent as AI agents grow in sophistication is something that will have to be addressed by AI vendors and researchers, key stakeholders from every industry, and even governments.

The post A Guide to AI Agents appeared first on Symbl.ai.

Are Human Conversations Special? A Language Model Perspective

Kartik Talamadupula — Mon, 11 Mar 2024 15:26:20 +0000

In the rapidly evolving field of artificial intelligence (AI), large language models (LLMs) have exhibited remarkable capabilities across a broad spectrum of tasks in a very short time. These models’ capabilities have taken the world by storm, resulting in large-scale investment into the training and deployment of proprietary and open-source models. However, one nagging question remains: Do these models perform equally well across all types of data and domains? This question becomes particularly relevant when considering the domain of human-human interactions.

Read Paper

Human-Human Interactions (Conversations)

Human conversations are at the heart of human language and communication. They are distinguished by their complexity and depth. Any progress on language models should reflect concordant progress in handling human conversations.

Unlike other data types, human conversations are distinguished by several key characteristics:

Interactivity: The reciprocal and dynamic nature of human conversations makes them highly interactive, with participants actively responding to and building upon each other’s contributions.
Contextuality: Most conversations are deeply embedded in specific shared contexts, which may include physical surroundings, relationships, cultural backgrounds, shared history etc.
Adaptability: Participants in a conversation typically adjust and adapt their speech and content to immediate and real-time feedback from the other participant(s).
Emotional & Psychological States: At their most detailed, conversations convey more than just factual information: they also act as a transfer medium for emotional and psychological states, through tone, pace, volume, emotion, choice of words, etc.

These attributes highlight the rich, multifaceted nature of human conversations, presenting a unique challenge for LLMs.

Domain Representation in LLM Datasets

Today’s LLMs are trained on vast amounts of data – well north of 1 trillion tokens – across different domains and data distributions. However, a model is only as good as the data that it has seen: to understand how well the model understands human conversations, investigating the composition of the model’s training data is crucial. Despite the vast amounts of data fed into LLMs, a closer inspection reveals a disproportionate under-representation of human conversation data. Utilizing the Common Crawl dataset, a foundational source for many LLM training corpora representing data on the internet, our analysis uncovers that human conversations constitute a mere 0.0085% of LLM training data.

This shows that human conversation data on the web is underrepresented, and that this is a resource-poor domain. This lopsided representation leads to several artifacts that hamper a model’s ability to effectively and accurately attend to data and contexts from such underrepresented (and important) domains.

Quantitative Analysis of Data Domains

In order to analyze how a language model treats human-human conversations as compared to data from other domains we employ a set of three metrics aimed at dissecting language models’ attention – the mechanism LLMs use to model contextual associations. These metrics jointly seek to provide an empirical overview of the work that a model undertakes when navigating different data domains. Our analysis spans four domains – human-human conversations, web data, code, and mathematics.

Attention Distance Difference

This metric highlights the difference in the distance between the attention spans across tokens between a pair of domains: that is, how a model needs to adapt its attention span when switching from data in one domain to the other. The full details of how this metric is calculated can be found in the paper.

Differences in attention distances when comparing general web data against 3 data domains: human conversations (left), code (middle), and math (right). Higher distance indicates longer contextual dependencies.

Human-human conversations demand significantly longer attention distances compared to web content, as well as code and math data, indicating that more robust modeling of longer-term contextual relationships by models is necessary. This can be seen when comparing to the middle attention chart (featuring code data), where higher attention distances are observed in the initial half of the layers, while there is a reduction as we head into the deeper layers – suggesting that the model’s attention is able to become more localized and focus on closer contextual relationships.

Average attention distance difference across domains.

Attention Dispersion

We can also calculate the dispersion of the model’s attention across domain data, which indicates whether the model can learn to focus narrowly on just a few telltale keywords and markers, or whether it must spread its attention thin across the entire context. For this calculation, we use the mean attention entropy as a proxy for the model’s attention dispersion, considering the same four domains – web, conversations, code, math – across layers and heads, as below:

This heatmap shows that there is much higher entropy between layers 22 – 36 for the conversation data, which indicates that the model has to attend strongly to more tokens in this domain than in the others. This indicates a higher complexity for the human conversation domain, leading to higher attention dispersion on the part of the model as it tries to wrap its attention around the complexity of the context.

Qualitative Analysis of Data Domains

In addition to the quantitative results above, we also strive to provide a more quantitative interpretation of our study, to show how human-human conversation data is different from other domains that are better represented in the training corpora of today’s language models.

t-SNE

To visualize and understand the representation in the model of various data domains, we use the t-SNE (t-distributed Stochastic Neighbor Embedding) visualization, an unsupervised non-linear dimensionality reduction technique that can be used to represent high-dimensional data by assigning each datapoint to a point in a lower dimension space. We plot these for the first (left), middle (middle), and last (right) layers across the four data domains.

In the first layer, all domains are relatively close to each other, albeit with clear boundaries emerging amongst the clusters. This closer proximity suggests a base level of generic processing before the model adapts to the specific demands of each domain. By the middle layer, human conversations and math data have separated, while web and code still show some overlap. This divergence illustrates the model’s evolving understanding and representation in distinguishing data across domains. In the last layer, there is slightly better separation between web and code domains, while there is still some minimal overlap. This analysis goes to show that models are representing data and contexts from different domains differently.

Example Showing Attention Dispersion

Finally, to demonstrate how the model’s attention differs when attending to content from different domains, we visualize 4 different contexts/data points from each of the domains at the same layer and head.

Some interesting attention patterns are exhibited in the above. For human conversations (top left), as indicated by all the quantitative metrics, the model must attend to tokens across the entire context, and accommodate longer-term dependencies. For code (top right), by comparison, the model very clearly knows that only specific tokens are important tokens, and that attention must be pooled there: a reflection of the structured and rule-based nature of code data. Web (bottom left) and math (bottom right) data is somewhere in the middle, with the attention not as dense as human conversations, but also more distributed than the code domain.

An LLM for Human Conversations

This analysis distinguishes and separates human conversation data from other data domains, and the unique demands that conversation data places on a model (via a study of attention). The results highlight that human conversations exhibit a unique set of characteristics, necessitating more robust modeling of long-term contextual relationships by any model that hopes to produce high-quality results. The higher attention dispersion (entropy) observed in the human conversation domain suggests a strong need for large language models to adopt a much broader focus to capture the necessary context and nuances required to succeed in such domains.

At Symbl, we have built an LLM specialized for human conversations – Nebula. Nebula is trained with a significant amount of human-human conversation data, and optimized to work best on human conversations. Learn more about using Nebula here.

Read the paper that this article is based on here.

The post Are Human Conversations Special? A Language Model Perspective appeared first on Symbl.ai.

A Guide to LLM Inference Performance Monitoring

Kartik Talamadupula — Tue, 05 Mar 2024 05:15:19 +0000

With a growing number of large language models (LLMs) on offer, choosing a model that best suits your needs is crucial to the success of your generative AI strategy. The wrong choice can consume considerable time and resources and even possibly lead to a premature conclusion that AI can’t, in fact, enhance your organisation’s efficiency and productivity.

Although there are several ways to determine an LLM’s capabilities, such as benchmarking, as detailed in our previous guide, one of the methods most applicable to real-world use, is measuring a model’s inference. i.e., how quickly it generates responses.

With this in mind, this guide explores LLM inference performance monitoring, including how inference works, the metrics used to measure an LLM’s speed, and how some of the most popular models on the market perform.

What is LLM Inference Performance Monitoring and Why is it Important?

LLM inference is the process of entering a prompt and generating a response from an LLM. It involves a language model drawing conclusions or making predictions to generate an appropriate output based on the patterns and relationships to which it was exposed during training.

Subsequently, LLM inference performance monitoring is the process of measuring the speed and response times of a model. Measuring LLM inference is essential as it allows you to assess an LLM’s efficiency, reliability, and consistency – all of which are crucial aspects in determining its ability to perform in real-world scenarios and provide the intended value within an acceptable timeframe. Conversely, insufficient means to correctly evaluate LLMs leaves organisations and individuals with blind spots and an inability to properly distinguish one model from another. This is likely to lead to wasted time and resources down the line, as a language model proves to be ill-suited for its intended use case.

How LLM Inference Works

To better understand the metrics used to measure a model’s latency, let’s first briefly examine how an LLM performs inference, which involves two stages: a prefill phase and a decoding phase.

Firstly, in the prefill phase, the LLM must process the text from a user’s input prompt by converting it into a series of prompt, or input, tokens. A token is a unit of text that represents a word or a portion of a word. For the English language, a token is 0.75 words – or four characters. The exact mechanism that an LLM uses to divide text into tokens, i.e., its tokenizer, varies between models. Once generated, each token is turned into a vector embedding, a numerical representation that the model can understand and make inferences from. These embeddings are then processed by the LLM in order to generate an appropriate output for the user.

From here, during the decoding phase, the LLM generates a series of vector embeddings that represent its response to the given input prompt. These are then converted into completion, or output, tokens, which are generated output one at a time until it reaches a stopping criterion, such as the token limit number or one of a list of stop words. At which time, it will generate a special end token to signal the end of token generation. As LLMs generate one token per forward propagation, i.e., pass or iteration, the number of propagations that a model requires to complete a response is the same as the number of completion tokens.

What Are the Most Important LLM Inference Performance Metrics?

To evaluate the inference capabilities of a large language model, the metrics that we’re most interested in are latency and throughput.

Latency

Latency is a measure of how long it takes for an LLM to generate a response to a user’s prompt. It provides a way to evaluate a language model’s speed and is mainly responsible for forming a user’s impression of how fast or efficient a generative AI application is. Consequently, low latency is important for use cases that involve real-time interactions, such as chatbots and AI copilots, but less so for offline processes. There are several ways to measure a model’s latency, including:

Time To First Token (TTFT):
Time Per Output Token (TPOT)
Total generation time

TTFT is the length of time it takes for the user to start receiving a response from a model after entering their prompt. It’s determined by the time it takes to process the user’s input and generate the first completion token. Factors that influence TFTT include:

Network speed: a system’s general bandwidth and, similarly, how congested the network is at the time of inference.
Input sequence length: the longer the prompt, the more processing required by the model before it can output the first token.
Model size: conventionally, the larger the model, i.e., the more parameters it has, the more computations it performs to generate a response, which prolongs the TFTT.

TPOT, alternatively, is the average time it takes to generate a completion token for each user querying the model at a given time. This can also occasionally be referred to as inter-token latency (ITL).

Total generation time refers to the end-to-end latency of an LLM: from when a prompt is originally entered by the user to when they receive the completed output from the model; often, when people refer to latency, they’re actually referring to total generation time. It can be calculated as follows:

Total generation time = TTFT + (TPOT x number of generated tokens)

An LLM’s total generation time varies according to a number of key factors:

Output Length: this is the most important factor because models generate output a token at a time. This is also why the LLM’s TPOT should also be measured.
Prefill Time: the time it takes for the model to complete the prefill stage, i.e., how long it takes the model to process all the input tokens from the user’s entered prompt and can generate the first completion token.
Queuing Time: there may be times when an LLM can’t keep up with user requests because of its hardware constraints – namely a lack of GPU memory. This means some input requests will be placed in a queue before they’re processed. This is the reason behind TTFT being such a commonly recorded metric, as it offers insight into how well the model’s server can handle varying numbers of user requests and, subsequently, how it might perform in a real-world setting.

Something else to be considered when measuring latency is the concept of a cold start. When an LLM is invoked after previously being inactive, i.e., scaled to zero, it causes a “cold” start as the model’s server must create an instance to process the request. This has a considerable effect on latency measurements – particularly TFTT and total generation time, so it’s crucial to note whether the published inference monitoring results for a model specify whether they include a cold start time or not.

Throughput

An LLM’s throughput provides a measure of how many requests it can process or how much output it can produce in a given time span. Throughput is typically measured in two ways: requests per second or tokens per second.

Requests per second: this metric is dependent on the model’s total generation time and how many requests are being made at the same time, i.e., how well the model handles concurrency. However, total generation time varies based on how long the model’s input and output are.

Tokens per second: because requests per second are influenced by total generation time, which itself depends on the length of the model’s output and, to a lesser extent, its input, tokens per second is a more commonly used metric for measuring throughput. Much like TFTT, the tokens per second metric is integral to the perceived speed of an LLM.

Additionally, tokens per second could refer to:

Total tokens per second: both input and output tokens
Output tokens per second: only generated completion tokens

Typically, total tokens per second is considered the more definitive measure of model throughput, while output tokens per second is applicable to measuring the performance of LLMs for use in real-time applications.

Request Batching

One of the most effective and increasingly employed methods for increasing an LLM’s throughput is batching. Instead of loading the model’s parameters for each user prompt, batching involves collecting as many inputs as possible to process at once – so parameters have to be loaded less frequently. However, while this makes the most efficient use of a GPU and improves throughput, it does so at the expense of latency – as users that made the initial requests that comprise a batch will have to wait until it’s processed to receive a response. What’s more, the larger the batch size, the bigger the drop-off in latency, although there are limits on the maximum size of a batch before causing memory overflow.

Types of batching techniques include:

Static batching: also called naïve batching, this is the default batching method with which multiple prompts are gathered and responses are only generated when all the requests in the batch are complete.
Continuous batching: also known as in-flight batching; as opposed to waiting for all the prompts within a batch to be completed, this form of batching groups requests at the iteration level. As a result, once a request has been completed, a new one can replace it, making it more compute-efficient.

What Are the Challenges of LLM Inference Performance Monitoring?

As beneficial as it is to gain insight into a model’s latency and throughputs, obtaining this data isn’t always straightforward. Some of the challenges associated with measuring LLM inference include;

Lack of testing consistency: there can be differences in the way inference tests are conducted, such as the type of GPU (and quantity used), the number and nature of prompts, whether the LLM is inferred locally or through an API, etc. These can all affect a model’s inference metrics and can make it tricky to make like-for-like comparisons as the tests were conducted under different conditions.
Different token lengths per model: inference performance tests typically present results in terms of token-based metrics, e.g., tokens per second – but token lengths vary per LLM. This means metrics aren’t always comparable across model types.
Lack of data: quite simply, inference metrics may not be available for particular models as they weren’t published by their vendors – and no one has sufficiently tested them yet.

How Do Popular LLMs Perform on These Metrics?

Now that we’ve covered how LLMs perform inference and how it’s measured, let’s turn our attention to how some of the most popular models score on various inference metrics.

To start, let’s look at tests performed by AI research hub Artifical Analysis, which publishes ongoing performance and benchmark tests for a collection of widely used LLMs. Although the site publishes a wide variety of inference metrics, we’re honing in on three:

Throughput (tokens per second)
Latency (total response time (TRT)): in this case, the number of seconds it takes to output 100 tokens
Latency (time to first chunk (TTFC)): the site opts to use TFCC as opposed to TFTT because some API hosts send out chunks of tokens instead of individually.

Another important note is that for the TRT and TFCR metrics, with the exception of the Gemini Pro, Claude 2.0, and Mistral Medium, the figures below are the mean average across multiple API hosts. In the case of the three OpenAI GPT models, this is the average of two API hosts, OpenAI and Azure. In contrast, for the Mixtral 8x7B and Llama 2 Chat, the average is derived from eight and nine API hosting providers, respectively.

Model	Throughput (tokens per second)	Latency (TRT) (seconds)	Latency (TFCR) (seconds)
Mixtral 8x7B	95	2.66	0.6
GPT-3.5 Turbo	92	1.85	0.65
Gemini Pro	86	3.6	2.6
Llama 2 Chat (70B)	82	3.16	0.88
Claude 2.0	27	4.8	0.9
GPT-4	22	7.35	1.9
GPT-4 Turbo	20	7.05	1.05
Mistral Medium	19	6.2	0.3

In addition to the summary provided above, the site features other inference measurements, including latency and throughput over time and the costs of inference.

The site GTP For Work features a latency tracker that continually monitors the performance of the APIs for several models from OpenAI and Azure OpenAI (GPT-4 and GPT-3.5 and Anthropic (Claude 1.0 and 2.0). They publish the average latency of each model over a 48-hour period, based on:

Generating a maximum of 512 tokens
A temperature of 0.7
10 minute intervals
Data from three locations

Lastly, let’s look at the results of a more comprehensive study conducted by the machine learning operations organisation Predera. Now, while this study only centres model types, the Mistral Instruct and Llama 2 (though both 7B and 70B models are tested further along in the experiment), it provides a larger array of inference metrics, like:

Throughput (tokens per second)
Throughput (requests per second)
Average latency (seconds)
Average latency per token (seconds)
Average latency per output token (seconds)
Total time (seconds)

Additionally, the experiment sees inference performed through the parallel utilization of a varying number of GPUs – which in this case is the NVIDIA L4 Tensor Core GPU. This offers an indication of each LLM’s scalability. Lastly, these results are based on feeding each model 1000 prompts.

1 x L4 GPU

Model	Throughput (tokens per second)	Throughput (requests per second)	Average latency (seconds)	Average latency per token (seconds)	Average latency per output token (seconds)	Total time (seconds)
Llama2-7B	558.54	1.17	449.88	1.71	10.87	897.23
Mistral-7B-instruct	915.48	1.89	277.19	0.97	7.12	552.44

2 x L4 GPUs

Model	Throughput (tokens per second)	Throughput (requests per second)	Average latency (seconds)	Average latency per token (seconds)	Average latency per output token (seconds)	Total time (seconds)
Llama2-7B	1265.17	2.65	179.85	0.63	3.81	397.65
Mistral-7B-instruct	1625.08	3.35	153.09	0.50	2.65	339.51

4 x L4 GPUs

Model	Throughput (tokens per second)	Throughput (requests per second)	Average latency (seconds)	Average latency per token (seconds)	Average latency per output token (seconds)	Total time (seconds)
Llama2-7B	1489.99	3.12	147.36	0.48	2.57	324.71
Mistral-7B-instruct	1742.70	3.59	136.49	0.44	2.68	285.03

8 x L4 GPUs

Model	Throughput (tokens per second)	Throughput (requests per second)	Average latency (seconds)	Average latency per token (seconds)	Average latency per output token (seconds)	Total time (seconds)
Llama2-7B	1401.18	2.93	153.09	0.50	2.65	339.51
Mistral-7B-instruct	1570.70	3.24	149.67	0.48	2.90	316.74
Llama2-70B	–	1.00	475.59	1.62	9.21	996.86

The first thing you’ll notice from the results above is that, as you’d reasonably expect, each model’s inference metrics improve across the board as more GPUs are utilized. This is, however, until they reach 8 GPUs – at which point, each model’s performance is worse when compared to 4 GPUs. This points to the fact that the models are only scalable up to a point – and dividing inference between additional GPUs offers little benefit while requiring additional time to distribute the workload.

You’ll also notice that the Llama2-70B only features in the experiment when there are 8 GPUs. This is due to the fact that a model requires enough space to store the number of a model’s stated parameters multiplied by the size of the data type in which its parameters are stored. In the case of the Llama-2-70B, which stores parameters as 16-bit floating point numbers, this equates to 70 x 2 (bytes) = 140B. As an L4 GPU is 24GB large, the fewest number of units that could accommodate the 70B model is 6 – though in keeping with the theme of doubling the amount of GPUs used each time, it was run on 8 units.

Conclusion

Inference performance monitoring provides a good indication of an LLM’s speed and is an effective method for comparing models against each other. However, when looking to select the most appropriate model for your organisation’s long-term objectives, it’s prudent to use inference metrics as a determining factor and not the sole determiner of your choice of LLM.

As detailed in this guide, the latency and throughput figures published for different models can be influenced by several things, such as the type and number of GPUs used and the nature of the prompt used during tests. Moreover, even the type of recorded metrics can differ – all of which makes it difficult to get the most comprehensive understanding of a model’s capabilities.

Plus, as alluded to at the start of this guide, there are benchmarking tests, such as HumanEval, which tests a model’s coding abilities, and MMLU, which assesses a model’s natural language understanding, that provide insight into how an LLM performs at specific tasks. Researching how a language model performs at various benchmarking tests in addition to its inference speed is a robust strategy for identifying the best LLM for your particular needs.

The post A Guide to LLM Inference Performance Monitoring appeared first on Symbl.ai.

A Guide to LLM Hyperparameters

Kartik Talamadupula — Tue, 05 Mar 2024 05:07:31 +0000

When selecting the best large language models for your organisation’s needs, there are many factors to consider. Undoubtedly, with there being a strong correlation between a model’s parameter count, looking at the size of an LLM is a wise strategy. Similarly, you might look at its performance at common benchmark or inference performance tests – which give you a quantitative measure of performance as well as indicate how well the LLMs that pique your interest compare against each other.

However, after selecting an LLM that seems to best suit your requirements, there are other ways to further mould a language model to fit your particular needs – hyperparameters. In fact, your choice of hyperparameters and how you choose to configure them could be the difference between an LLM failing to meet your expectations and exceeding them.

With this in mind, let’s take a look at the concept of LLM hyperparameters, why they’re important, and how particular hyperparameters affect a language model output.

What are LLM hyperparameters and why are they important?

Hyperparameters are configurations that you can use to influence/govern the process of training and LLM. Unlike the model parameters, or weights, hyperparameters aren’t altered by the training data as it’s passed through; instead, they’re external to the model and set before training begins. Subsequently, even though they govern the LLM’s training process, they won’t become a part of the resulting base model and you can’t determine which hyperparameters were used to train a model after the fact.

An LLM’s hyperparameters are important because they offer a controllable way to tweak a model’s behaviour to produce the outcome desired for a particular use case. Instead of going through the considerable effort and expense of developing a bespoke model, the process of hyperparameter tuning offers the chance to reconfigure a base model so it performs more in line with your expectations.

Exploring Different LLM Hyperparameters

Let’s move on to looking at some of the most commonly used LLM hyperparameters and the effect they have on a language model’s output.

Model Size

The first hyperparameter to consider is the size of the LLM you want to use. Generally speaking, larger models are more performant and are more capable of handling complex tasks, as they have more layers within their neural networks. Resultantly, they have more weights that can be learned from training data and better determine the linguistic and logical relationship between tokens.

However, a larger LLM costs more, requires larger datasets to train and more computational resources to run, and typically runs at a slower rate than smaller models. Additionally, the larger a model becomes, the more prone it becomes to overfitting, where a model becomes too familiar with its training data and fails to consistently generalise with previously unseen data.

Conversely, a small base LLM can perform as well as its larger equivalents on simple tasks while requiring fewer resources to both train and run. This is especially the case if the model has been quantized, i.e., a compression technique to reduce the size of its weights, and/or fine-tuned, i.e., further trained with additional data. Additionally, the smaller a LLM, the easier it will be to deploy and the more feasible it becomes on less powerful hardware, i.e., devices without several high-powered GPUs.

Ultimately, the optimal size of an LLM is dependent on the nature of the use case you’re looking to apply it to. The more complex the task – and the more computational resources and training data you have at your disposal – the larger your model can be.

Number of Epochs

An epoch refers to a complete iteration of an LLM processing an entire dataset. As a hyperparameter, the set number of epochs influences output by helping determine a model’s capabilities.

A greater number of epochs can help a model increase its understanding of a language and its semantic relationships. However, too many epochs can result in overfitting – where the model is too specific to the training data and struggles with generalisation. Alternatively, too few epochs can cause underfitting, where the LLM hasn’t learned enough from its training data to correctly configure its weights and biases.

Learning Rate

Learning rate is a fundamental LLM hyperparameter that controls how quickly the model is updated in response to the calculated loss function, i.e., how often it predicted an incorrect output label, during training. On one hand, a higher learning rate expedites the training process but may result in instability and overfitting. On the other hand, a lower learning rate increases stability and improves generalisation during inference – but lengthens training time.

Additionally, it’s often beneficial to reduce an LLM’s learning rate as its training progresses through the use of a learning rate schedule. Three of the most common learning rate schedules are time-based decay, step decay, and exponential decay.

Time-based decay: reduces the learning rate according to a preset time value.
Step decay: also known as linear decay, decreases the learning rate by a decay factor every few epochs.
Exponential decay: reduces the learning rate proportional to itself every epoch.

Batch Size

An LLM’s batch size parameter determines how much data the model processes each epoch. Creating a batch size requires dividing the dataset into portions, so larger batch sizes accelerate training compared to smaller batches. However, small batches require less memory and compute power and can help an LLM model process each data point of a corpus more thoroughly. With the computational demands in mind, batch size is often restricted to your hardware capabilities.

Max Output Tokens

Also often referred to as max sequence length, this is the maximum number of tokens that an LLM can generate as its output. While the number of tokens a model can ultimately output is determined by its architecture, this can be further configured as a hyperparameter to influence an LLM’s response.

Typically, the higher you set the max output tokens, the more coherent and contextually relevant the model’s response will be. The more output tokens an LLM is allowed to use in formulating a response, the better able it is to express its ideas and comprehensively address the ideas given to it in the input prompt. Naturally, however, this comes with a price – as the longer the output, the more inference is performed by the model – increasing computational and memory demands.

Subsequently, in contrast, setting a lower max token limit requires less processing power and memory, but in potentially not providing the model with sufficient room to craft the optimal response, you leave the door open for incoherence and errors. That said, there are scenarios in which setting a lower maximum sequence length would prove beneficial, such as:

When trying to boost other aspects of an LLM’s performance, such as throughput or latency, and want to expedite the process by lowering inference time.
Similarly, to better control inference costs, you might cap the length of a model’s response.
To constrain the amount of generated text so it conforms to a particular format, i.e., for a specific GenAI application.

Decoding Type

Within the transformer architecture that comprises most modern LLMs, there are two stages to inference: encoding and decoding. Encoding is where the user’s input prompt is converted into vector embeddings, i.e., words are turned into numerical representations, that can be processed by the model to generate the best response.

Decoding, on the other hand, is where the selected output is first converted from vector embeddings into tokens before being presented to the user as a response. There are two main types of decoding: greedy and sampling. With greedy decoding, the model simply chooses the token with the highest probability at each step during inference.

Sampling decoding, in contrast, sees the model choose a subset of potential tokens and select a token at random to add to the output text. This creates more variability – or randomness, to how tokens are selected, which is a desirable trait in creative applications of language models. Understandably, however, opting for sampling decoding increases the risk of incorrect or nonsensical responses.

Top-k and Top-p Sampling

When you opt for sampling rather than greedy decoding, you’ll have an additional two hyperparameters with which to influence a model’s output: top-k and top-p sampling values.

The Top-k sampling value is an integer that ranges from 1 to 100 (with a default value of 50) that specifies that the tokens sampled by the model should be those with the highest probabilities until the set value is reached. To better illustrate how top-k sampling works, let’s use a brief example.

Let’s say you have the sentence “I went to meet a friend…”.

Now, out of the vast number of ways to end this sentence, let’s look at the five examples provided below – each beginning with a different token:

at the library
for a brief work lunch
to discuss our shared homework assignment
in the centre of the city
on the other side of town

From there, let’s assign each of the initial tokens for each sentence a probability, as follows: `

Token	Probability
At	0.30
For	0.25
To	0.22
In	0.15
On	0.12

Now, if we set the top-k sampling value to 2, it will only add at and for to the sampling sunset from which it selects an output token. Setting it to 5, by contrast, would mean all options could be considered. So, in short, the higher the k-sampling value, the greater the potential variety in output.

Alternatively, the Top-p sampling value is a decimal number in the range of 0.0 to 1.0, that configures a model to sample the tokens with the highest probabilities until the sum of those probabilities reaches the set value.

Returning to the above table, if the top-p sampling value is set to 0.7, once again, at and for will be the only tokens included in the subset, as their combined probabilities are 0.55 (0.30 + 0.25). As at, for, and to have a cumulative probability of 0.77 (0.30 + 0.25 + 0.22), this breaches the set threshold of 0.7 and to is excluded from the subset as a result. As with top-k sampling, the higher the value, the more varied the output.

Lastly, in the event both sampling values are set, top-k takes precedence – with all probabilities outside the set threshold set to 0.

Temperature

Temperature performs a similar function to the above-described top-k and top-p sampling values, providing a way to vary the range of possible output tokens and influence the model’s “creativity”. It is represented by a decimal number between 0.0 (which is effectively the same as greedy decoding, whereby the token with the highest probability is added to the output) and 2.0 (maximum creativity).

The temperature hyperparameter influences output by changing the shape of the token probability distribution. For low temperatures, the difference between probabilities is amplified, so tokens with higher probabilities become even more likely to be output compared to less-likely tokens. Consequently, you should set a lower temperature value when you want your model to generate more predictable or dependable responses.

In contrast, high temperatures cause token probabilities to converge closer to one another, so less likely or unusual tokens receive an increased chance of being output. In light of this, you should set a higher temperature value when you want to increase the randomness and creativity of responses.

Stop Sequences

Aside from the max output tokens hyperparameter, the other way to influence the length of an LLM’s response is by specifying a stop sequence, i.e., a string composed of one or more characters, which automatically stops a model’s output. A common example of a stop sequence is a period (full stop).

Alternatively, you can specify the end of a sequence by setting a stop token limit – which is an integer value rather than a string. For instance, if the stop token limit is set to 1, the generated output will stop at a sentence. If it’s set to 2, on the other hand, the response will be constrained to a paragraph.

A reason you might set a stop sequence or stop token limit is that, similar to the max output tokens parameter, you have greater control over inference, which may be a concern if budget is a consideration.

Frequency and Presence Penalties

A frequency, or repetition, penalty, which is a decimal between -2.0 and 2.0, is a an LLM hyperparameter that indicates to a model that it should refrain from using the same tokens too often. It works by lowering the probabilities of tokens that were recently added to a response, so they’re less likely to be repeated to produce a more diverse output.

The presence penalty works in a similar way but is only applied to tokens that have been used at least once – while the frequency is applied proportionally to how often a specific token has been used. In other words, the frequency penalty affects output by preventing repetition, while the presence penalty encourages a wider assortment of tokens.

What is LLM Hyperparameter Tuning?

LLM hyperparameter tuning is the process of adjusting different hyperparameters during the training process with the goal of finding the combination that generates the optimal output. However, this inevitably can involve considerable trial and error: meticulously tracking the application of each hyperparameter and recording the corresponding results on the output. Consequently, performing this manually is time-consuming. In response to this, methods of automated LLM hyperparameter tuning have emerged to streamline this process considerably.

The three most common methods of automated hyperparameter tuning are random search, grid search, and Bayesian Optimisation.

Random Search: as suggested, this type of hyperparameter tuning method randomly selects and evaluates combinations of hyperparameters from a range of values. This makes it a simple yet efficient method capable of traversing a large parameter space. However, in its simplicity, it sacrifices performance and it may not find the optimal combination of hyperparameters while being computationally expensive.
Grid Search: in contrast to random search, this method exhaustively searches each possible combination of hyperparameters from a range of values. While, like random search, it’s resource intensive, it offers a more systematic approach that ensures finding the method that guarantees the optimal choice of hyperparameters.
Bayesian Optimisation: differs from the above two methods in that it employs a probabilistic model to predict the performance of different hyperparameters and chooses the best ones in response. This makes it an efficient tuning method that can both better handle large parameter spaces and is less resource-intensive than grid search. The downside, however, is that it’s its more complex to set up and is less reliable at identifying the optimal set of hyperparameters than grid search.

Another advantage offered by automated hyperparameter tuning is that it makes the development of multiple language models, each with a unique combination of hyperparameters, more feasible. By training them on the same dataset, you’re then in a position to compare their output and determine which is best for your desired use case. Similarly, each model tuned on a different set of hyperparameters and value ranges could prove better suited to different use cases.

Conclusion

Though often falling under the broader category of fine-tuning, hyperparameter fine-tuning is an important discipline that should be considered separately – and as an important part of an AI strategy. By configuring the different hyperparameters detailed in this guide, and observing how your chosen LLM modifies its output in response, you can improve the performance of base models to better suit your desired real-world scenarios.

The post A Guide to LLM Hyperparameters appeared first on Symbl.ai.