AI/ML Archives | Symbl.ai

Reading with Intent: Equipping LLMs to Understand Sarcasm in Multimodal RAG Systems

Toshish Jawale — Tue, 27 Aug 2024 15:55:46 +0000

Retrieval Augmented Generation (RAG) has emerged as a powerful approach for enhancing the knowledge and capabilities of Large Language Models (LLMs). By integrating external information sources like Wikipedia or even the open internet, RAG systems empower LLMs to tackle a wider range of tasks with increased accuracy. However, as we increasingly rely on these systems, a critical challenge arises: the inherent ambiguity of human language.

While LLMs excel at processing factual information, they often struggle to grasp the nuances of emotionally inflected text, particularly sarcasm. This can lead to misinterpretations and inaccurate responses, hindering the reliability of multimodal RAG systems in real-world scenarios.

In this article, we describe the main findings of our recent research, where we explore this challenge in depth and propose a novel solution: Reading with Intent.

The Pitfalls of Literal Interpretation

Human communication transcends mere words on a page. Tone of voice, facial expressions, and subtle cues all contribute to the intended meaning. When LLMs – trained primarily on factual data – encounter sarcasm, they often fail to recognize the underlying incongruity between the literal meaning and the intended message. Imagine an LLM interpreting a sarcastic comment like “Oh, that’s just great” as a genuine expression of positivity!

Poisoning the Well: Creating a Sarcasm-Aware Dataset

We employed Sarcasm Poisoning to assess a given language model’s ability to detect and interpret sarcastic tones. Fact-Distortion was introduced to challenge the LLMs’ ability to handle misleading information when sarcasm is present, simulating more complex real-world scenarios.

To study this phenomenon, we first needed a dataset that reflects the realities of online communication, where sarcasm is prevalent. Such datasets are hard to curate manually. We thus generated our own dataset by taking the Natural Questions dataset, a benchmark for open-domain question answering, and strategically injecting different types of sarcastic passages into its retrieval corpus.

Our methodology involved:

Sarcasm Poisoning: Rewriting factually correct passages with a sarcastic tone using a large language model (Llama3-70B-Instruct).
Fact-Distortion: Creating intentionally misleading passages by distorting factual information, followed by rewriting in a sarcastic tone.

This two-pronged approach allowed us to investigate how sarcasm affects both comprehension and accuracy, regardless of the underlying information’s veracity.

Reading with Intent: A Prompt-Based Approach

Our proposed solution, Reading with Intent, centers around equipping all varieties of LLMs with the ability to recognize and interpret the emotional intent behind the text. We achieve this through a two-fold strategy:

Intent-Aware Prompting: We explicitly instruct the LLM to pay attention to the connotation of the text, encouraging it to move beyond a purely literal interpretation.
Intent Tags: We further guide the LLM by incorporating binary tags that indicate whether a passage is sarcastic or not. These tags, generated by a separate classifier model trained on a sarcasm dataset, provide valuable metadata that helps contextualize the text.

With Intent-Aware Prompting, the LLM receives explicit instructions to consider emotional undertones, akin to teaching it to ‘read between the lines.’ Intent Tags, on the other hand, function as markers that flag potentially sarcastic passages, giving the model a heads-up that not everything should be taken at face value.

Promising Results and Future Directions

Our experiments demonstrate that Reading with Intent significantly improves the performance of LLMs in answering questions over sarcasm-laden text. The results were consistent across various LLM families, highlighting the generalizability of our approach. Our approach was tested on the Llama-2, Mistral/Mixtral, Phi-3, and Qwen-2 families of LLMs; across models ranging from 0.5B to 72B and 8x22B parameters in size.

While this research marks an important step towards sarcasm and deception aware LLMs, several avenues for future exploration remain:

Enhancing Sarcasm Detection: Developing more robust and nuanced sarcasm detection models that can handle subtle and context-dependent instances of sarcasm.
Beyond Binary Tags: Exploring the use of multi-class intent tags that capture a wider range of emotions beyond just sarcasm.
Instruction-Tuning: Explicitly fine-tuning LLMs specifically on sarcasm-infused data to further enhance their ability to understand and respond to emotionally charged language.

These advancements can drastically improve understanding and user interactions in customer service, virtual assistance, contact centers, and any scenario where understanding human intent is critical.

By addressing these challenges, we can build more robust and reliable multimodal RAG systems that are better equipped to navigate the full complexity of human communication.

Want to read more? Check out our full research paper [link to research paper], where you can explore our methodology, experimental setup, and detailed analysis of the results.

Want to experiment yourself? We have released our sarcasm dataset as well as the code for creating it, and our Reading with Intent prompting method! You can find the repository on Github here: https://github.com/symblai/reading-with-intent, and on Huggingface 🤗 here: https://huggingface.co/datasets/Symblai/reading-with-intent.

The post Reading with Intent: Equipping LLMs to Understand Sarcasm in Multimodal RAG Systems appeared first on Symbl.ai.

Building Performant Models with The Mixture of Experts (MoE) Architecture: A Brief Introduction

Team Symbl — Wed, 24 Jul 2024 17:00:00 +0000

Understanding The Mixture of Experts (MoE) Architecture

Mixture of experts (MoE) is an innovative machine learning architecture designed to optimize model efficiency and performance. The MoE framework utilizes specialized sub-networks called experts that each focus on a specific subset of data. A mechanism known as a gating network directs input to the most appropriate expert for addressing the given query.

This results in only a fraction of the model’s neural network being activated at any given time, which reduces computational costs, optimizes resource usage, and enhances model performance.

While the MoE architecture has gained popularity in recent years, the concept is not a new one, having first been introduced in the paper Adaptive Mixture of Local Experts (Robert A. Jacobs et al, 1991). This pioneering work proposed dividing an AI system into smaller, separate sub-systems, with each specializing in different training cases. This approach was shown to not only improve computation efficiency but also decrease training times – achieving target accuracy with fewer training epochs than conventional models.

How Mixture of Experts (MoE) Models Work

MoE models comprise multiple experts within a larger neural network – with each expert itself being a smaller neural network with its own parameters, i.e., weights and biases, allowing them to specialize in particular tasks. The MoE model’s gating network is responsible for choosing the best-suited expert(s) for each input, based on a probability distribution – such as a softmax function.

This structure enforces sparsity, or conditional computation: only activating relevant experts and, subsequently, selecting portions of the model’s overall network. This contrasts with the density of conventional neural network architectures, in which a larger amount of layers and neurons are required to process every input. As a result, MoEs can maintain a high capacity without proportional increases in computational demands.

The Benefits and Challenges of MoE Models

The MoE architecture offers several benefits over traditional neural networks, which include:

Increased Efficiency: By only activating a fraction of the model for each input, MoE models can be efficient and reduce overall computational demands.
Scalability: MoE models can successfully scale to large sizes, as adding more experts allows for more capacity without having to increase the computational load for each inference.
Specialization: with experts specializing in different areas or domains, MoE models can handle an assortment of tasks or datasets more effectively than conventional models.

Despite these advantages, however, implementing the MoE architecture still presents a few challenges:

Increased Complexity: MoE models introduce additional complexity in terms of architecture, dynamic routing, optimal expert utilization and training procedures.
Training Considerations: the training process for MoE models can be more complex than for standard neural networks due to having to train both the experts and the gating network. Consequently, there are a number of aspects to keep in mind:
- Load Distribution: if some experts are disproportionately selected early on during training, they will be trained more quickly – and continue to be chosen more often as they offer more reliable predictions than those with less training. Techniques like noisy top-k gating mitigate this by evenly distributing the training load across experts.
- Regularization: Adding regularization terms, i.e., load balancing loss, which penalizes an overreliance on any one expert, and expert diversity loss, which rewards the equal utilization of experts, facilitates balanced training and improves model generalization.

Applications of MoE Models

Now that we’ve covered how the Mixture of Experts models work and why they’re advantageous, let us briefly take a look at some of the applications of MoE.

Natural Language Processing (NLP): MoE models can significantly increase the efficacy of NLP models, with experts specializing in different aspects of language processing. For instance, an expert could focus on particular tasks (sentiment analysis, translation), domains (coding, law), or even specific languages.
Computer Vision: sparse MoE layers in vision transformers, such as V-MoE, achieve state-of-the-art performance with reduced computational resources. Additionally, like NLP tasks, experts can be trained to specialize in different image styles, images taken under certain conditions (e.g., low light), or to recognize particular objects.
Speech Recognition: the MoE architecture can be used to solve some of the inherent challenges of speech recognition models. Some experts can be dedicated to handling specific accents or dialects, others to parsing noisy audio, etc.

Conclusion

The Mixture of Experts (MoE) architecture offers an approach to building more efficient, capable, and scalable machine learning models. By leveraging specialized experts and gating mechanisms, MoE models provide a tradeoff between the greater capacity of larger models and the greater efficiency of smaller models – achieving better performance with reduced computational costs. As research into MoE continues, and its complexity can be reduced, it will pave the way for more innovative machine learning solutions and the further advancement of the AI field.

The post Building Performant Models with The Mixture of Experts (MoE) Architecture: A Brief Introduction appeared first on Symbl.ai.

How to Implement WebSocket and SIP-based Integration with Symbl.ai

Team Symbl — Tue, 23 Jul 2024 17:00:00 +0000

In today’s increasingly competitive landscape, applications that provide real-time data exchange and communication are crucial for enhancing user experiences, carving out market share, and, ultimately, driving business success. WebSockets and SIP (Session Initiation Protocol) are fundamental technologies for facilitating smooth, reliable online interactions.

In this guide, we explore the concepts of WebSockets and SIP and the role they play in developing performant modern applications. We also detail how to use these protocols to integrate your application with Symbl.ai’s conversational intelligence capabilities to draw maximum insights from your messages, calls, video conferences, and other interactions.

What is WebSocket?

WebSocket is a widely used protocol for facilitating the exchange of data between a client and a server. It is well suited for any application that requires real-time, two-way communication between a web browser and a server, such as messaging applications, collaborative editing tools, stock tickers, displaying live sports results, and even online gaming.

How do WebSockets Work?

WebSockets sit on top of the Transmission Control Protocol/Internet Protocol (TCP/IP) stack and use it to establish a persistent connection between a client and server. To achieve this, WebSockets first use the Hypertext Transport Protocol (HTTP – as used to serve websites to browsers) to establish a connection, i.e., a “handshake”. Once the connection is established, WebSockets replace HTTP as the application-layer protocol to create a persistent two-way, or “full-duplex” connection, and the server will automatically send new data to the client as soon as it is available.

This is in contrast to how HTTP transmits data – whereby the client continually has to request data from the server and only receives it is new data is available, i.e., HTTP long-polling. By maintaining a persistent handshake, WebSockets eliminate the technical overhead of continually having to establish connections and send HTTP request/response headers, significantly reducing latency and opening the door for the development of a wider range of applications that rely on real-time communication.

What Are the Benefits of WebSockets?

Speed: as a low-latency protocol, WebSockets are ideal for applications that need to exchange data instantaneously.
Simplicity: as WebSockets sits atop TCP/IP and uses HTTP to establish an initial connection, it does not require the installation of any additional hardware or software.
Constant Ongoing Updates: WebSockets enable the server to transmit new data to the client without the need for requests, i.e. GET operations, allowing for continuous updates.

What is SIP?

As useful as WebSockets are for general-purpose, bi-directional communication, they lack the mechanisms for real-time media transmission; this is where the Session Initiation Protocol (SIP) comes into play. SIP is a signaling protocol that’s used to establish interactive communication sessions, such as phone calls or video meetings. As an essential component of Voice over Internet Protocol (VoIP), SIP can be used in a variety of multimedia applications, including IP telephony and video conferencing applications.

How Does SIP Work?

SIP functions much like a call manager: establishing the connection between endpoints, handling the call, and closing the connection once it is finished. This starts with one of the endpoints initiating the call by sending an invite message to the other endpoint(s), which includes their credentials and the nature of the call, e.g., voice, video, etc. The other endpoints receive the invite message and respond with an OK message, comprising their information so the connection can be established. Upon receiving the OK message, the initiating endpoint sends an acknowledgement (ACK) message and the call can begin.

These messages can be sent via TCP, as with WebSockets, as well as UDP (User Datagram Protocol) or TLS (Transport Layer Security). Once the connection is established, SIP hands over the transmission of media to another protocol such as Real-Time Transfer Protocol (RTP) or Real-Time Transport Control Protocol (RTCP) (hence being called the Session Initiation Protocol, as its role is to solely establish communication between endpoints).

What Are the Benefits of SIP?

Interoperability: SIP is protocol-agnostic when it comes to the type of media being transmitted, with the ability to handle voice, video, and multimedia calls.
Adaptability: SIP is compatible with a large variety of devices and components. Additionally, it works with legacy systems such as Public Switched Telephone Network (PSTN) and is designed in such a way to accommodate emerging technologies.
Scalability: SIP can be used in both small and large-scale communication networks, with the ability to establish and terminate connections as necessary to utilize resources efficiently.

How to Integrate Your Application with Symbl.ai via WebSocket

Now that we have explored WebSockets and how they work, let us move on to how to integrate your application with Symbl.ai’s conversational intelligence capabilities via WebSocket – which is accomplished through Symbl.ai’s Streaming API.

In this example, the code samples are in Python, using functions from the Symbl.ai Python SDK; however, Symbl.ai also provides SDKs in Javascript and Go.

Prepare Environment

Before you begin, you will need to install the Symbl.ai Python SDK, as shown below:

# For Python Version < 3
pip install symbl

# For Python Version 3 and above
pip install symbl

Additionally, to connect to Symbal.ai’s APIs, you will need access credentials, i.e., an app id and app secret, which you can obtain by signing into the developer platform.

Create WebSocket Connection

The first step is establishing a connection to Symbl.ai‘s servers this will create a connection object, which can have the following parameters:

Parameter	Description
credentials	Your app id and app secret from Symbl.ai’s developer platform
speaker	Speaker object containing a name and userId field.
insight_types	The insights to be returned in the WebSocket connection, i.e., Questions and Action Items.
config	Optional configurations for configuring the conversation. For more details, see the config parameter in the Streaming API documentation.

The code snippet below is used to start a connection:

connection_object = symbl.Streaming.start_connection(
    credentials={app_id: , app_secret: }
    insight_types=["question", "action_item"],
    speaker={"name": "John", "userId": "john@example.com"},
)

Receive Insights via Email

You can opt to receive insights from the interactions within your application via email. This will provide you with a link to view the conversation transcripts, as well as details such as the topics discussed, generated follow-ups, action items, etc., through Symbl.ai’s Summary UI.

To receive the insights via email, add the code below to the instantiation of the connection object:

actions = [
        {
          "invokeOn": "stop",
          "name": "sendSummaryEmail",
          "parameters": {
            "emails": [
              emailId #The email address associated with the user’s account in your application 
            ],
          },
        },
      ]

Which results in a connection object like that shown below:

connection_object = symbl.Streaming.start_connection(
    credentials={app_id: , app_secret: }
    insight_types=["question", "action_item"],
    speaker={"name": "John", "userId": "john@example.com"},
),
actions = [
        {
          "invokeOn": "stop",
          "name": "sendSummaryEmail",
          "parameters": {
            "emails": [
              emailId #The email address associated with the user’s account in your application 
            ],
          },
        },
      ]

Subscribe to Events

Once the WebSocket connection is established, you can get live updates on conversation events such as generation of a transcript, action items or questions, etc. Subscribing to events is how the WebSocket knows to send the client new information without explicit requests from the server.

The .subscribe is a method of the connection object that listens for events from an interaction and allows you to subscribe to them in real time. It takes a dictionary parameter, where the key can be an event and its value can be a callback function that should be executed on the occurrence of that event.

The table below summarizes the different events you can subscribe to:

Event	Description
message_response	Generates an event whenever a transcription is available.
message	Generates an event for live transcriptions. This will include the `isFinal` property, which is `False` initially, signifying that the transcription is not finalized.
insight_response	Generates an event whenever an action_item or question is identified in the transcription.
topic_response	Generates an event whenever a topic is identified in the transcription.

An example of how to set up events is shown below, with the events stored in a dictionary before being passed to the subscribe method:

events = {
    "message_response": lambda response: print(
        "Final Messages -> ",
        [message["payload"]["content"] for message in response["messages"]],
    ),
    "message": lambda response: print(
        "live transcription: {}".format(
            response["message"]["punctuated"]["transcript"]
        )
    )
    if "punctuated" in response["message"]
    else print(response),
    "insight_response": lambda response: [
        print(
            "Insights Item of type {} detected -> {}".format(
                insight["type"], insight["payload"]["content"]
            )
        )
        for insight in response["insights"]
    ],
    "topic_response": lambda response: [
        print(
            "Topic detected -> {} with root words, {}".format(
                topic["phrases"], topic["rootWords"]
            )
        )
        for topic in response["topics"]
    ],
}


connection_object.subscribe(events)

Send Audio From a Mic

This allows you to send data to WebSocket directly via your mic. It is recommended that first-time users use this function when sending audio to Symbl.ai, to ensure that audio from their application works as expected.

connection_object.send_audio_from_mic()

Send Audio Data

You can send custom binary audio data from some other library using the following code.

connection_object.send_audio(data)

Stop the Connection

Lastly, you need to close the WebSocket, with the code below:

connection_object.stop()

How to Integrate your Application with Symbl.ai via SIP

In this section, we will take you through the process of integrating your application with SIP through Symbl.ai’s Telephony API. As with our implementation of WebSocket above, the code snippets are Python but the Symbl.ai SDK is available in Javascript and Go.

Prepare Environment

Before you begin, you will need to install the Symbl.ai Python SDK, as shown below:

# For Python Version < 3
pip install symbl

# For Python Version 3 and above
pip install symbl

Additionally, to connect to Symbal.ai’s APIs, you’ll need access credentials, i.e., an app id and app secret, which you can obtain by signing into the developer platform.

Create SIP Connection

After setting up your environment accordingly, the initial step requires you to establish a SIP connection. You will need to include a valid SIP URI to dial out to.

The code snippet below allows you to start a Telephony connection with Symbl.ai via SIP:

connection_object = symbl.Telephony.start_sip(uri="sip:8002@sip.example.com")

Receive Insights via Email

As with a WebSocket integration, you can choose to receive insights from the interactions from the call via email. This will provide you with a link to view the conversation transcripts, as well as details such as the topics discussed, generated follow-ups, action items, etc., through Symbl.ai’s Summary UI.

To receive the insights via email, add the code below to the instantiation of the connection object:

actions = [
        {
          "invokeOn": "stop",
          "name": "sendSummaryEmail",
          "parameters": {
            "emails": [
              emailId #The email address associated with the user’s account in your application 
            ],
          },
        },
      ]

Which results in a connection object like that shown below:

connection_object = symbl.Telephony.start_sip(uri="sip:8002@sip.example.com"),
actions = [
        {
          "invokeOn": "stop",
          "name": "sendSummaryEmail",
          "parameters": {
            "emails": [
              emailId #The email address associated with the user’s account in your application 
            ],
          },
        },
      ]

Subscribe to Events

Once the SIP connection is established, you can get live updates on conversation events such as the generation of a transcript, action items, questions, etc.

The connection_object.subscribe is a function of the connection object that listens to the events of a live call and lets you subscribe to them in real time. It takes a dictionary parameter, where the key can be an event and its value can be a callback function that should be executed on the occurrence of that event.

The table below summarizes the different events you can subscribe to:

Event	Description
message_response	Generates an event whenever transcription is available.
insight_response	Generates an event whenever an action_item or question is identified in the message.
tracker_response	Generates an event whenever a tracker is identified in the transcription.
transcript_response	Also generates transcription values; however, these will include an isFinalproperty that will be False initially, meaning the transcription is not finalized.
topic_response	Generates an event whenever a topic is identified in any transcription.

An example of how to set up events is shown below, with the events stored in a dictionary before being passed to the subscribe method:

events = {
    'transcript_response': lambda response: print('printing the first response ' + str(response)), 
    'insight_response': lambda response: print('printing the first response ' + str(response))
    }

connection_object.subscribe(events)

Stop the Connection

Finally, to end an active call, use the code below:

connection_object.stop()

Querying the Conversation Object

Whether implementing a WebSocket or SIP connection, you can use the conversation parameter associated with the Connection object to query Symbl.ai’s Conversation API to access specific elements of the recorded interaction.

The table below highlights a selection of the functions provided by the Conversation API and their purpose.

Function	Description
connection_object.conversation.get_topics(conversation_id))get_conversation_id()	Returns a unique conversation_Id for the conversation being processed. This can then be passed to the other functions described below.
connection_object.conversation.get_messages(conversation_id)	Returns a list of messages from a conversation. You can use this to produce a transcript for a video conference, meeting or telephone call.
connection_object.conversation.get_topics(conversation_id))	Returns the most relevant topics of discussion from the conversation that are generated based on the combination of the overall scope of the discussion.
connection_object.conversation.get_action_items(conversation_id)	Returns action items generated from the conversation.
connection_object.conversation.get_follow_ups(conversation_id)	Returns follow-up items generated from the conversation, e.g., sending an email, making subsequent calls, booking appointments, setting up a meeting, etc.
connection_object.conversation.get_members(conversation_id)	Returns a list of all the members in a conversation.
connection_object.conversation.get_questions(conversation_id)	Returns explicit questions or requests for information that come up during the conversation.
connection_object.conversation.get_conversation(conversation_id)	Returns the conversation meta-data like meeting name, member name and email, start and end time of the meeting, meeting type and meeting id.
connection_object.conversation.get_entities(conversation_id)	Extracts entities from the conversation, such as locations, people, dates, organization, datetime, daterange, and custom entities.
conversation_object.conversation.get_trackers(conversation_id)	Returns the occurrence of certain keywords or phrases from the conversation.
conversation_object.conversation.get_analytics(conversation_id)	Returns the speaker ratio, talk time, silence, pace and overlap from the conversation.

Conclusion

To recap:

WebSocket is a widely used protocol for facilitating the exchange of data between a client and a server
SIP is a signaling protocol that is used to establish interactive communication sessions, such as phone calls or video meetings
The benefits of WebSockets include:
- Speed
- Simplicity
- Constant ongoing updates
The benefits of SIP include:
- Interoperability
- Adaptability
- Scalability
Integrating your application via Websocket is done through Symbl.ai’s Streaming API and includes:
- Preparing the environment
- Creating a WebSocket connection
- Subscribing to events
- Sending audio from a mic, or prepared binary audio data
- Stopping the connection
Integrating your application via Websocket is done through Symbl.ai’s Telephony API and includes:
- Preparing the environment
- Creating a SIP Connection
- Subscribing to Events
- Stopping the Connection
You can use the conversation parameter associated with the Connection object to query Symbl.ai’s Conversation API to access specific elements of the recorded interaction.

To discover more about Symbl.ai’s powerful APIs and how you can tailor them to best fit the needs of your application, visit the Symbl.ai documentation. Additionally, sign up for the development platform to gain access to the innovative large language model (LLM) that powers Symbl.ai’s conversational intelligence solutions, Nebula, to better understand how you can extract more value from the interactions that take place throughout your organization.

The post How to Implement WebSocket and SIP-based Integration with Symbl.ai appeared first on Symbl.ai.

How to Build LLM Applications With LangChain and Nebula

Team Symbl — Mon, 22 Jul 2024 17:00:00 +0000

With millions of monthly downloads and a thriving community of over 100,000 developers, LangChain has rapidly emerged as one of the most popular tools for building large language model (LLM) applications.

In this guide, we explore LangChain’s vast capabilities and take you through how to build a question-and-answer (QA) application – using Symbl.ai’s proprietary LLM Nebula as its underlying language model.

What is LangChain?

LangChain is an LLM chaining framework available in Python and JavaScript that streamlines the development of end-to-end LLM applications. It provides a comprehensive library of building blocks that are designed to seamlessly connect – or “chain” – together to create LLM-powered solutions that can be applied to a large variety of use cases.

Why Use LangChain to Build LLM Applications?

Some of the benefits of LangChain include:

Expansive Library of Components: LangChain features a rich selection of components that enable the development of a diverse range of LLM applications.
Modular Design: LangChain is designed in a way that makes it easy to swap out the components within an application, such as its underlying LLM or an external data source, which makes it ideal for rapid prototyping.
Enables the Development of Context-Aware Applications: one of the aspects at which LangChain excels is facilitating the development of context-aware LLM applications. Through the use of prompt templating, document retrieval, and vector stores, LangChain allows you to add context to the input passed to an LLM to produce higher-quality output. This includes the use of proprietary data, domain-specific information that an LLM hasn’t been trained on, and up-to-date information.
Large Collection of Integrations: LangChain includes over 600 (and growing) built-in integrations with a wide variety of tools and platforms, making it easier to incorporate an LLM application into your existing infrastructure and workflows.
Large Community: as one of the most popular LLM frameworks, LangChain boasts a large and active user base. This has resulted in a wealth of resources, such as tutorials and coding notebooks, that make it easier to get started with LangChain, as well as forums and groups to assist with troubleshooting.

Just as importantly, LangChain’s developer community consistently contributes to the ecosystem, submitting new classes, features, and functionality. For instance, though officially available as a Python or JavaScript framework, the LangChain community has submitted a C# implementation.

LangChain Components

With a better understanding of the advantages it offers, let us move on to looking at the main components within the LangChain framework.

Chains: the core concept of LangChain, a chain allows you to connect different components together to perform different tasks. As well as a collection of ready-made chains tailored for specific purposes, you can create your own chains that form the foundation of your LLM applications.
Document Loaders: classes that allow you to load text from external documents to add context to input prompts. Document loaders streamline the development of retrieval augmented generation (RAG) applications, in which the application adds context from an external data source to an input prompt before passing it to the LLM – allowing it to generate more informed and relevant output.

LangChain features a range of out-of-the-box loaders for specific document types, such as PDFs, CSVs, and SQL, as well as for widely used platforms like Wikipedia, Reddit, Discord, and Google Drive.
Text Splitters: divide large documents, e.g., a book or extensive research paper, into chunks so they can fit into the input prompt. Text splitters overcome the present limitations of context length in LLMs and enable the use of data from large documents in your applications.
Retrievers: collect data from a document or vector store according to a given text query. LangChain contains a selection of retrievers that correspond to different document loaders and types of queries.
Embedding Models: these convert text into vector embeddings, i.e., numerical representations that an LLM can process efficiently. Embeddings capture different features of text from documents that allow an LLM to compare their semantic meaning with the user’s input query.
Vector Stores: used to store documents for efficient retrieval after they have been converted into embeddings. The most common type of vector store is a vector database, such as Pinecone, Weaviate, or ChromaDB.
Indexes: separate data structures associated with vector stores and documents that pre-sort, i.e., index, embeddings for faster retrieval.
Memory: modules that allow your LLM applications to draw on past queries and responses to add additional context to input prompts. Memory is especially useful in chatbot applications, as it allows the bot to access previous parts of its conversation(s) with the user to craft more accurate and relevant responses.
Prompt Templates: allow you to precisely format the input prompt that is passed to an LLM. They are particularly useful for scenarios in which you want to reuse the same prompt outline but with minor adjustments. Prompt templates allow you to construct a prompt from dynamic input, i.e., from input provided by the user, retrieved from a document, or derived from an LLM’s prior generated output.
Output Parsers: allow you to structure an LLM’s output in a format that’s most useful or presentable to the user. Depending on their design, LLMs can generate output in various formats, such as JSON or XML, so an output parser allows you to traverse the output, extract the relevant information, and create a more structured representation.
Agents: applications that can autonomously carry out a given task using the tools it is assigned (e.g., document loaders, retrievers, etc.) and use an LLM as its reasoning – or decision-making – engine. LangChain’s strength in loading data from external data sources enables you to provide agents with more detailed, contextual task instructions for more accurate results.
Models: wrappers that allow you to integrate a range of LLMs into your application. LangChain features two types of models: LLMs, which take a string as input and return a string, and chatbots, which take a sequence of messages as input and return messages as output.

How to Build a QA Bot: A Step-By-Step Implementation

We are now going to explore the capabilities of LangChain by building a simple QA application with Nebula LLM.

Our application will use a prompt template to send initial input to the LLM. The model’s response will then feed into a 2nd prompt template, which will also be passed to the LLM. However, instead of making separate calls to the LLM to achieve this, we will simply construct a chain that will execute all the actions with a single call.

Additionally, to access Nebula’s API, you will need an API key, which you can obtain by signing up to the Symbl.ai platform and using your App ID and App Secret.

Setting Up Your Environment

First, you need to set up your development environment by installing LangChain. There are three options depending on what you intend to use the framework for:

install langchain: the bare minimum requirements
install langchain [llms]: to include all the modules required to integrate common LLMs
install langchain [all]: to include all the modules required for all integrations.

While we do not require the dependencies required for all integrations, we do want those related to LLMs, so we are going with option 2 as below:

The following code will install the requisite libraries:

# For Python Version < 3

pip install langchain [llms]

# For Python Version 3 and above

pip3 install langchain [llms]

Loading the LLM

With your development environment correctly configured, the next step is loading our LLM of choice – which, in this case, is Nebula LLM.

To use Nebula LLM, we are first going to leverage LangChain’s extensibility and create a custom LLM wrapper: extending LangChain’s LLM class to create our own NebulaLLM class. Our custom wrapper includes a _call method, which sends an initial system prompt (to establish context for the LLM) and the user’s input prompt – and returns Nebula’s response. This will enable us to call Nebula LLM in LangChain in the same way as an OpenAI model or an LLM hosted on HuggingFace.

import requests
import json
from langchain.llms.base import LLM
from typing import Optional

class NebulaLLM(LLM):
    def __init__(self, api_key: str):
        self.api_key = api_key
   self.url = "https://api-nebula.symbl.ai/v1/model/chat"

    #Implement the class’ call method 


    def _call(self, prompt: str, stop: Optional[list] = None)  -> str:

   #Constructing the message to be sent to Nebula
    
payload = json.dumps({
         "max_new_tokens": 1024,
         "system_prompt": "You are a question and answering assistant. You are professional and always respond politely.",
         "messages": [
                {
                    "role": "human",
                    "text": prompt
                }
            ]
        })

    #Headers for the JSON payload

    headers = {
            'ApiKey': self.api_key,
            'Content-Type': 'application/json'
        }

    #POST request sent to Nebula, containing the model URL, #headers, and message, then assigned to response variable


    response = requests.request("POST", url, headers=headers, data=payload)
    
    return response['messages'][-1]['text']

    #Property methods expected by LangChain

    @property
    def _identifying_params(self) -> dict:
        return {"api_key": self.api_key}

    @property
    def _llm_type(self) -> str:
        return "nebula_llm"

The two properties at the end of the code snippet are getter methods required by LangChain to manage the attributes of the class. In this case, they provide access to the Nebula instance’s API key and type.

Additionally, in the _call method, we passed an optional list, which is intended to contain a series of stop sequences that Nebula should adhere to when generating the response. However, in this case, the list is empty and is included to ensure compatibility with LangChain’s interface.

Creating Prompt Templates

Next, we are going to create the prompt templates that will be passed to the LLM and specify the format of its input.

The first prompt template takes a location as an input and will be passed to Nebula LLM. It will then generate a response containing the most famous dish from said location that will be used as part of the second prompt passed to Nebula LLM. The second prompt then takes the dish returned from the initial prompt and generates a recipe.

from langchain import PromptTemplate


#Creating the first prompt template

location_prompt = PromptTemplate(

input_variables=["location"],

template = "What is the most famous dish from {location}? Only return the name of the dish",
)

#Creating the second prompt template

dish_prompt = PromptTemplate(

input_variables=["dish"],

template="Provide a short and simple recipe for how to prepare {dish} at home",
)

Creating the Chains

Finally, we are going to create a chain that takes a series of prompts and runs them in sequences in a single function call. For your example, we will use LLMChain and a SimpleSequentialChain that combines both chains and runs them in sequence.

As well as the two chains, we’ve also passed the sequential chain the argument Verbose = True, which will cause the chain will show its process and how it arrived at its output.

# Create the first chain

chain_one = LLMChain(llm=llm, prompt=location_prompt)

# Create the second chain

chain_two = LLMChain(llm=llm, prompt=dish_prompt)

# Run both chains with SimpleSequentialChain

overall_chain = SimpleSequentialChain(chains=[chain_one, chain_two], verbose=True)

final_answer = overall_chain.run("Thailand")

Note, that when calling SimpleSequentialChain, the order in which you pass the chains to the class is important. As chain_one determines the output for chain_two – it must come first.

Potential Use Cases for a QA bot

Here are a few ways that a question-and-answer LLM application could add value to your organization.

Knowledge Base: through fine-tuning or RAG, you can supply a QA bot with a task or domain-specific domain to create a knowledge base.
FAQ System: similarly, you can customize a QA bot to answer questions that are frequently asked by your customers. As well as addressing a customer’s query, the QA bot can direct them to the appropriate department for further assistance, if required. By delegating your FAQs to a bot, human agents have more availability for issues that require their expertise – and more customers can be served in less time.
Recommendation Systems: alternatively, by asking pertinent questions as well as answering them, a QA bot can act as a recommendation system, guiding customers to the most suitable product or service from your range. This allows customers to find what they are looking for in less time, boosts conversion rates, and, through effective upselling, can increase the average revenue per customer (ARPC).
Onboarding and Training Assistant: QA bots can be used to streamline your company’s onboarding process – making it more interactive and efficient. A well-designed question-and-answer LLM application can replace the need for tedious forms: taking answers to questions as input and asking the employee additional questions if they didn’t supply sufficient information. Similarly, it can be used to handle FAQs regarding the most crucial aspects of your company‘s policies and procedures.

Additionally, a QA bot can help with your staff’s ongoing professional development needs, allowing an employee to learn at their own pace. Through the quality of answers given by the user, the application can determine their rate of progress and supply training resources that match: providing additional material if the user appears to be struggling while glazing over concepts with which they are familiar.

Conclusion

In summary:

LangChain is an LLM chaining framework available that enables the efficient development of end-to-end LLM applications
Reasons to use LangChain to develop LLM applications include:
- An expansive library of components
- Modular design
- Enables the development of context-aware applications
- Large collection of integrations
- Large community
The core LangChain components include:
- Chains
- Document loaders
- Text splitters
- Retrievers
- Embedding models
- Vector Stores
- Indexes
- Memory
- Prompt templates
- Output parsers
- Agents
- Models
The steps for creating a QA bot with LangChain and Nebula include:
- Setting up your environment
- Loading the LLM
- Creating prompt templates
- Creating the chains
Potential Use Cases for a QA bot include:
- Knowledge base
- FAQ system
- Recommendation systems
- Onboarding and training assistant

LangChain is a powerful and adaptable framework that provides everything you need to develop performant and robust LLM applications. We encourage you to develop your comfort with its ecosystem by going through the LangChain documentation, familiarizing yourself with the different components on offer, and better understanding which could be most applicable to your intended use case.

Additionally, to discover how Nebula LLM can automate a variety of customer service tasks and transform your company’s unstructured interaction data into valuable insights, trends, and analytics, visit the Nebula Playground and gain exclusive access to our innovative proprietary LLM.

Additional Resources

The post How to Build LLM Applications With LangChain and Nebula appeared first on Symbl.ai.

How to Fine-Tune Llama 3 for Customer Service

Team Symbl — Fri, 19 Jul 2024 17:00:57 +0000

Despite the immense advantages, until recently, the cost and time involved meant that developing a bespoke large language model (LLM) was only reserved for companies with the deepest resources. However, with all the tools and frameworks that streamline building an LLM from scratch, and place tailored language models within reach of most organizations – there’s no escaping the fact that it remains a time and resource-intensive endeavor.

Fortunately, there’s an alternative to creating your own LLM: fine-tuning an existing base, or foundation, model. By fine-tuning a base LLM, you can leverage the considerable work undertaken by a skilled AI development or research team and reap the benefits of a personalized LLM – all while avoiding the required time and expense.

With this in mind, this guide takes you through the process of fine-tuning an LLM, step-by-step. We will demonstrate how to download a base model (Llama 3), acquire and prepare a fine-tuning dataset, and configure your training options.

What is Fine-Tuning?

Fine-tuning is the process of taking a pre-trained base LLM and further training it on a specialized dataset for a specific task or knowledge domain. The pre-training stage involves feeding the LLM vast amounts (typically, terabytes) of unstructured data from various internet sources, i.e., “big web data”. In contrast, fine-tuning an LLM requires a smaller, better-curated, and labelled domain or task-specific dataset.

After its initial training, an LLM will have developed an enormous range of parameters (billions or even trillions; the larger the model, the greater the number of parameters) that it uses to predict the best output for a given input sequence. However, as the LLM is exposed to previously unseen fine-tuning data, many of its output predictions will be incorrect. The model must then calculate the loss function, i.e., the difference between its predictions and the correct output, and adjust its parameters in relation to the fine-tuning data. After the fine-tuning data has passed through the LLM several times, i.e., several epochs, it will result in a new neural network configuration with updated parameters that correspond to its new task or domain.

Why Do You Need to Fine-Tune a Base LLM?

After a model’s pre-training, it has a detailed, general understanding of language – but lacks specialized knowledge. Fine-tuning an LLM exposes it to new, specialized data that prepares it for a particular use case or use in a specific field. This presents several benefits, which include:

Task or Domain-Specificity: fine-tuning an LLM on the distinct language patterns, terminology, and contextual nuances of a particular task or domain makes it more applicable to a specified purpose. This increases the potential value that an organization can extract from AI applications powered by the model.
Customization: similarly, fine-tuning an LLM to adopt and understand your company’s brand voice and terminology enables your AI solutions to offer a more consistent and authentic user experience.
Reduced costs: fine-tuning allows you to create a bespoke language model without having to train one from the ground up. This represents a huge saving in computation costs, personnel expenses, energy output (i.e., your carbon footprint), and time.

Why Fine-Tune an LLM for Customer Service Use Cases?

Now that we’ve looked at the general advantages of fine-tuning, the question is, why fine-tune an LLM for customer service in particular? Here are some examples of the potential capabilities of an LLM that’s been optimized as a customer service agent.

Authentic chatbots: an LLM fine-tuned for customer service tasks can be used as a bespoke chatbot tailored to the specific needs of your customers. This includes speaking in your company’s defined brand voice, using the same distinct phrases and questions as your human agents. Similarly, it could understand the specific nature of your customers’ queries, including the associated terminology.
Sentiment analysis: a customer service LLM can detect the sentiment of a conversation, i.e., how a customer really feels and to what extent. This can be performed in real-time, to aid human agents in communicating more effectively, or afterwards, in a training capacity, to improve their skills going forward.
Content generation: an LLM can generate a variety of content related to customer interactions, which can help organizations improve their customer service levels. This includes:
- Call summaries: summarizing a conversation, or multiple conversations, so human agents can quickly determine the nature of an interaction – or best prepare themselves for subsequent conversations.
- Key insights: similarly, condensing conversations down into a few salient points for quick analysis.
- Follow-up questions: formulating follow-up questions to guide a conversation towards a successful resolution.

Applying an LLM to customer service tasks in this way offers an organization several benefits, which include:

Saving Time: by automating common parts of your organizational workflow, a customer service LLM saves time for your customers and staff. Customers won’t have to wait for assistance from a human agent as often, allowing them to successfully resolve their query in less time. Similarly, human agents won’t have to spend time addressing simple queries and can dedicate themselves to problems that warrant their skills.
Productivity: with simpler tasks handled by AI agents, your staff can undertake more complicated, and value-adding, activities, which increase the overall productivity of your team.
Customer satisfaction: addressing a customer’s queries quickly and effectively – without potentially waiting for someone to get back to them – boosts customer satisfaction rates. The happier your customers are, the greater their loyalty, strengthening your brand connection. This will boost customer retention rates and lower your marketing spend in the long term – because it costs more to attract new customers than to keep existing ones.

Fine-tuning Llama 3 For Customer Service: A Step-By-Step Implementation

Now it is time to take you through how to fine-tune an LLM for customer service – step by step.

For our example, we’re going to use HuggingFace’s Transformer library as it offers several features that streamline the process of fine-tuning an LLM. Firstly, HuggingFace provides access to a huge variety of pre-trained models (over 650,000) without having to download and host them locally. It also contains the powerful Trainer class that is optimized for training and fine-tuning transformer-based models. HuggingFace’s Trainer supports a vast range of training configurations for customizing the fine-tuning rocess – without requiring you to write your own training loop.

We’re going to use the recently released Llama 3 as our base model because, as with all models in the Llama family, it was designed with fine-tuning in mind – and is more adaptable than other open-source language models.

Install Libraries

The first step is configuring your development environment by installing the appropriate Python libraries. To fine-tune our Llama 3 model for customer service, we will need to install:

Transformers: the main library that provides the functionality for fine-tuning.
Pytorch: HuggingFace’s libraries integrate with the PyTorch, TensorFlow, and Flax machine learning libraries, so you must also install one of them to use the Transformers library. In our case, we will be using PyTorch.
Datasets: a HuggingFace library that grants us access to its over 140,000 datasets.
Evaluate: another HuggingFace library that provides metrics to assess the model’s performance during fine-tuning.

The following code will install the requisite libraries:

# For Python Version < 3

pip install torch transformers datasets evaluate 

# For Python Version 3 and above

pip3 install torch transformers datasets evaluate

Download Base Model

With your environment configured, the next step is downloading the Llama 3 base model. We will be using the Instruct variation, as opposed to the pure base model, as it has been optimized for dialogue and, consequently, is better suited for our customer service use case.

Downloading Llama 3 with the Transformers library is simple and is accomplished with the following code:

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct”)

Prepare Fine-Tuning Data

Next, you need to prepare the data that you will use to fine-tune the base LLM, which requires you to do three things:

Acquire data
Tokenize the dataset
Divide the data into training and evaluation subsets

Acquiring Data

For our fine-tuning example, we are going to use a dataset hosted on HuggingFace, telecom-conversation-corpus, which contains over 200,000 customer service interactions. Alternatively, if you had your own fine-tuning dataset, you would only need to replace the name of the dataset with the directory path where it is saved.

from datasets import load_dataset

#If you have your own dataset, replace name of HuggingFace dataset with file path

ft_dataset = load_dataset("talkmap/telecom-conversation-corpus"

)

Create a Tokenizer

After loading our dataset, we need to tokenize the data it contains, i.e., convert it into sub-word tokens that are easier for the LLM to process. We will use the tokenizer associated with the Llama 3 model, as this ensures the text is split in the same way as during pre-training – and uses the same corresponding tokens-to-index, i.e., vocabulary.

We are then going to create a simple tokenizer function that takes the text from the dataset, tokenizes it, and applies padding and truncation where necessary. Padding or truncation is often required because input sequences aren’t always the same length; this is problematic because tensors, high-dimensional arrays that text sequences are converted to in order to be processed by the LLM, need to be the same shape. Padding ensures uniformity by adding a padding token to shorter sequences while, conversely, truncating a longer sequence will make it shorter so it conforms to the tensor’s shape.

Finally, to apply the tokenizer_function to the entire dataset, we will use the map method from the Datasets library. Additionally, by passing batched=True as an argument, we enable the dataset to be tokenized in batches.

from transformers import AutoTokenizer

# Load tokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

# Define tokenizer function

def tokenize_function(examples):

    return tokenizer(examples["text"], padding="True", truncation=True)

tokenized_dataset = ft_dataset.map(tokenize_function, batched=True)

Divide the Dataset

The last step in preparing our fine-tuning data is dividing it into training and testing subsets. This will ensure that the model sees different data points during its fine-tuning and evaluation, which will help to avoid overfitting, i.e., where the model can’t generalize to unseen data.

While many HuggingFace datasets are already divided into training and testing subsets, ours is not. Fortunately, we can use the train_test_split() function to divide the dataset for us. By specifying a test_size of 0.1, we create a testing subset that’s 10% the size of the training subset.

tokenized_dataset = tokenized_datasets.train_test_split(test_size=0.1, shuffle=True)

Lastly, because our chosen dataset has over 200,000 interactions, we are going to reduce the size of our own datasets for the sake of expediency, by selecting a range of 1000.

training_dataset = tokenized_datasets["train"].select(range(1000))

testing_dataset = tokenized_datasets["test"].select(range(1000))

Set Hyperparameters

We are now going to set the hyperparameter configurations for fine-tuning our model by creating a TrainingArguments object that contains our training options.

It is important to note that the TrainingArguments class accepts a lot of parameters – 109 to be exact. The only required parameter is output_dir, which specifies where to save your model and its checkpoints, i.e., snapshots of the model’s state after specified intervals. Aside from setting the output_dir, you can choose not to specify any other parameters – and use the default set of hyperparameters.

To provide an example of some other training options, however, we will pass the following parameters to our TrainingArguments object.

Learning rate: controls how quickly the model is updated in response to its loss function, i.e., how often it made incorrect predictions. A higher learning rate expedites fine-tuning but can cause instability and overfitting, while a lower learning rate increases stability and reduces overfitting but increases training time.
Weight decay: the rate at which the learning rate decreases.
Batch size: determines how much data the model processes at each interval: larger batch sizes accelerate training while small batches require less memory and computation power.
Number of training epochs: how many times the entire dataset is passed through the model.
Evaluation strategy: how often the model is evaluated during fine-tuning, with the options being:
- no: No evaluation is performed
- steps: evaluation is performed after every number of training steps (as determined by another hyperparameter eval_steps.
- epoch: evaluation is performed at the end of each epoch.
Save strategy: how often the model is saved during fine-tuning, with the options being:
- No: the model isn’t saved.
- steps: the model is saved after every number of training steps (as determined by another hyperparameter save_steps).
- epoch: the model is saved at the end of each epoch
Load best model at end: if the best model found during fine-tuning is loaded at completion.

With our chosen hyperparameters, our TrainingArguments will be configured as shown below:

from transformers import TrainingArguments

#Define hyperparameter configuration
training_args = TrainingArguments(
	output_dir="llama_3_ft",
	learning_rate=2e-5,
	per_device_train_batch_size=16,
	per_device_eval_batch_size=16,
	num_train_epochs=2,
	weight_decay=0.01,
	evaluation_strategy="epoch",
	save_strategy="epoch",
	load_best_model_at_end=True
)

Establish Evaluation Metrics

HuggingFace’s Evaluate library offers a selection of tools that allow you to assess your model’s performance during and after fine-tuning. There are three types of evaluation tools:

Metric: used to evaluate a model’s performance. Examples include accuracy, precision, and perplexity.
Comparison: used to compare two models. Examples include exact match and the Mcnemar test.
Measurement: allows you to investigate the properties of a dataset. Examples include text duplicates and toxicity.

We are going to choose accuracy as our performance metric, which will tell us how often the model predicts the correct outputs from the fine-tuning dataset. We will write our evaluation strategy as a simple function, compute_metrics, that we can pass to our trainer object.

import numpy as np
import evaluate

metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    metrics, labels = eval_pred
    predictions = np.argmax(metrics, axis=-1)

    return metric.compute(predictions=predictions, references=labels

Fine-Tune the Base Model

With the elements of our trainer object configured, all that’s left is putting it all together and calling the train() function to fine-tune our Llama 3 base model.

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Common pitfalls when fine-tuning an LLM

Despite its immense benefits and the availability of tools like the HuggingFace libraries that streamline the process, LLM fine-tuning can still present several challenges. Here are some of the pitfalls you are likely to encounter when fine-tuning a language model.

Catastrophic Forgetting: due to its parameters being altered by the fine-tuning data, the LLM may “forget” its prior knowledge and capabilities acquired during pre-training.
Overfitting: where an LLM consistently makes accurate predictions on its training dataset but fails to perform as well on testing data. This often occurs because the testing data is similar, or the same as the training data – so the model learns it too well and can’t generalize to new data points. Other factors include the dataset being too small, being of poor quality (inaccuracies, biases, etc.), or the model being trained too long.
Underfitting: where an LLM displays poor predictive abilities during both training and testing This could be due to a model being too simple (too few layers), a lack of data, poor quality data, or a lack of training time.
Difficulty Sourcing Data: the success of fine-tuning a model depends on the amount and quality of data at your disposal. Depending on the proposed use case and the specificity of the knowledge domain, it can be difficult to source sufficient amounts of fine-tuning data.
Time-Intensive: with the time it takes to gather the requisite datasets, as well as to implement the fine-tuning process, evaluate the model, etc., fine-tuning an LLM can require substantial amounts of time.
Increasing Costs: although far less expensive than training an LLM from scratch, when considering the costs involved in sourcing data in addition to computational and staff costs, fine-tuning can still be a costly process.

How Nebula LLM Has Been Fine-Tuned

Nebula LLM is Symbl.ai’s proprietary large language model specialized for human interactions. Fine tuned on well-curated datasets containing over 100,000 business interactions across sales, customer success, and customer service and on 50 conversational tasks such as chain of thought reasoning, Q&A, conversation scoring, intent detection, and others, Nebula is ideal for customer service use cases: .

Real time Agent Assistance: Extract key insights and trends to help human agents on live calls and enhance customer support. For instance: generating conversational summaries, generating responses to address objections, handling moments of customer frustration by suggesting script changes, and more.
Call scoring: With Nebula LLM you can score conversations, based on performance criteria such as communication & engagement, question handling, forward motion, , and others. This can be used to assess a human agent’s performance and enable targeted coaching.
Automated customer support: Nebula LLM can be used to power chatbots to perform common customer support tasks, such as Q&A.

Conclusion

In summary:

Fine-tuning is the process of taking a pre-trained base LLM and further training it on a specialized dataset for a specific task or knowledge domain.
Fine-tuning an LLM exposes it to new, specialized data that prepares it for a particular use case or use in a specific field. The benefits of fine-tuning include:
- Task or domain-specificity
- Customization
- Reduced costs
Use cases for LLMs fine-tuned for customer service include:
- Realistic chatbots
- Sentiment analysis
- Content generation
Applying an LLM to customer service tasks can offer an organization several benefits, such as:
- Saving time
- Productivity
- Customer satisfaction
The steps for fine-tuning a base LLM include:
- Installing libraries
- Downloading a base model
- Preparing fine-tuning data
- Acquiring data
- Tokenizing the dataset
- Dividing the data into training and evaluation subsets
- Setting hyperparameters
- Establishing evaluation metrics
- Fine-tuning the base model

Common pitfalls when fine-tuning an LLM include:

Catastrophic forgetting
Overfitting
Underfitting
Difficulty sourcing data
Time requirements
Increasing costs

Fine-tuning is an intricate process but can transform the potential of AI applications when applied correctly. We encourage you to develop your understanding and skills with further experimentation. This could include setting different hyperparameters, using different datasets, and attempting to fine-tune a variety of base models You can learn more by referring to the resources we have provided below.

Alternatively, if you’d prefer to sidestep the process of fine-tuning an LLM altogether, Nebula LLM is specialized to support your organization’s customer service use cases. To learn more about the model, sign up for access to Nebula Playground.

Additional Resources

The post How to Fine-Tune Llama 3 for Customer Service appeared first on Symbl.ai.

How to Build a Multi-Agent Bot with Autogen and Nebula

Team Symbl — Thu, 18 Jul 2024 17:00:50 +0000

While large language models (LLMs) have firmly pushed AI into the public consciousness in recent years, “AI agents” are poised to dramatically increase the adoption of AI applications. In fact, while the market for AI agents currently sits at just under $5 billion annually, it is projected to reach a staggering $110 billion by 2032.

In this guide, we explore the vast capabilities of AI agents by taking you through how to build a multi-agent chatbot, step-by-step.

What Are AI Agents and How Do They Work?

An AI agent, or autonomous AI agent, is an application or system capable of executing a given task without direct human intervention. When given a task, an AI agent will assess its environment, evaluate its assigned tools, and develop a plan to complete its given goal.

Typical components of an AI agent include:

AI Model: most commonly an LLM, which is the “brain” of an AI agent, helping to understand tasks, make decisions, create content, etc. The AI model processes the data collected by sensors, makes decisions based on it, and performs actions in pursuit of its assigned task.
Sensors: an agent’s input mechanisms that enable it to “perceive” its environment, whether digital or physical, and best determine how to complete its task. Sensors collect data from the environment, process it, and take appropriate actions.
Actuators: an agent’s means of output; for a software agent, these include applications or devices, such as monitors or printers.

How an AI agent operates can be broken down into three stages:

Task Definition and Planning: giving an agent a task and the tools to accomplish it. With these, the agent can devise a plan to achieve its given goal, which typically involves dividing it into sub-tasks.
Decision-Making: analyzing the available data from the environment, as well as past experiences, if applicable, and undertaking actions that maximize the chances of completing the task.
Feedback and Adaptation: monitoring the outcome of actions and evaluating whether they brought it closer to accomplishing the task. The agent can use acquired feedback to adjust its plan and, if instructed, an agent can ask for human intervention if gets stuck.

Multi-Agent Systems

When you connect two or more AI agents, you create a multi-agent system in which agents can collaborate to complete more complicated tasks than a single agent is capable of. Agents within a multi-agent system can be assigned different roles in accordance with their proposed function. For example, one agent can be designated as the planner, which devises the best way to execute the given task, while others are given the role of a coder, analyst, etc.

Frameworks for Creating Multi-Agent Chatbots

AutoGen

AutoGen is a Python-based open-source framework that specializes in the development of applications with AI agents. It enables you to connect multiple components, such as LLMs and data sources, together through agent interactions to streamline the creation of complex systems.

AutoGen allows you a range of diverse conversation patterns to create increasingly intricate systems. This includes complex dynamic conversational capabilities that alter agent topology depending on the conversational flow and the agents’ success at executing their tasks – especially useful when agent interactions can’t be fully anticipated.

Because it abstracts most of its programming logic as agent interactions, AutoGen is intuitive and has an easier learning curve than other frameworks, making it a good choice for non-technical users as well as developers. However, AutoGen is also highly extensible, allowing for customized agent development if one of its ready-made components cannot execute your required task.

crewAI

Like AutoGen, crewAI is a Python-based framework that specializes in AI agents and has gained popularity due to its simplicity. It allows you to create multi-agent systems by assembling crews of agents, which are assigned roles, equipped with tools, and instructed on how to collaborate to achieve a task through backstories.

It enables the development of production-level deployments through its crewAI+ platform, which allows you to convert crews to APIs, incorporate webhooks, and gain insights through performance metrics.

LangChain

LangChain is a comprehensive open-source framework that enables the development of a wide range of AI applications. Its extensive library includes classes for creating autonomous agents, which can be equipped with a diverse selection of tools to create end-to-end systems.

However, while it provides the most functionality, LangChain doesn’t specialize in AI agents in the same way as AutoGen and crewAI, so it’s not straightforward to create multi-agent systems as with the other two frameworks. That said, LangChain is the best choice for creating more intricate AI applications – so it is common for developers and researchers to combine features from LangChain when using AutoGen or crewAI as their primary framework.

Building a Multi-Agent Chatbot: Step-By-Step Implementation

Now that you’ve gained a better understanding of AI agents and multi-agent systems, let us turn our attention to how to build a multi-agent chatbot – with step-by-step instructions.

Choose a Framework

The first step in building a multi-agent chatbot is choosing the right framework. For our multi-agent chatbot, we will use AutoGen because it is specifically designed for building multi-agent systems and offers an intuitive framework that makes an excellent starting point for getting to grips with AI agents.

Choose an LLM

For an AI agent to complete a specified task, you need to provide it with tools, i.e., applications and other resources, with which it can undertake it. To build a chatbot, we’re going to provide our agents with LLMs for their natural language processing (NLP) capabilities.

We are going to use two LLMs to build our multi-agent chatbot, which allows us to harness the capabilities of each to create a more robust and performant system that produces a more diverse range of outputs.

For our LLMs, we will use Gemma, a lightweight LLM developed by Google, and Nebula, Symbl.ai’s proprietary large language model that is specialized to understand human conversations.

Install Packages

Having chosen a framework and deciding which LLMs you’re going to use, you need to prepare your environment by installing and/or importing the appropriate libraries. To build the the multi-agent chatbot, we are going to need to install and import the autogen library. Additionally, to send requests to Nebula, we need to install the request library, to send POST requests, and the json library, to process JSON objects

pip install pyautogen

import autogen 
import requests
import json

Agent Configurations

Gemma

With your environment set up, the next step is creating our agent configurations: providing the instructions for how to connect with the LLMs.

Let us start with Gemma – for which we will use LM Studio: a powerful desktop application that grants access to all the open-source LLMs hosted on HuggingFace. As well as being able to download Gemma, we can run the model on LM Studio’s built-in server.

Type Gemma into the search bar at the top of the interface, which will bring up a list of models as shown below.

As you will see, the model comes in various sizes, e.g., 2B, 7B, and 9B, with most offering further options in regard to parameter precision, i.e., quantized models. In this example, we are opting for the 2B model, as it will work on a wider range of devices. However, feel free to use a larger model if you have the requisite GPU resources.

Once the model has finished downloading, click Local Server from the menu on the left, which will take you to the interface displayed below.

Click on Select a model to load at the top of the interface, where you will see Gemma; select it and the server will start automatically.

The default port server port is 1234, which results in a base URL of “http://localhost:1234/v1” to connect to the model. Also, because LM studio exposes an API that is identical to that of Open AI, conveniently, you can set the API type to open_ai. Finally, the API key should be set to lm-studio.

This results in a config_list like the one below:

config_list = [
   {
       "api_type" : "open_ai"
       "api_base" : "http://localhost:1234/v1"
       "api_key" : "lm-studio""
   }
]

Nebula LLM

To access Nebula LLM, we will connect to it via its API and send it an input prompt as a JSON object via a POST request. For our chatbot, we will pass our initial prompt to Gemma, which will generate output that will then be passed to Nebula LLM for further processing. For this example, Nebula LLM will be used to generate a summary of the content, in this case, a biography, which will then be returned as the final output.

To achieve this, we will take the code used to connect to Nebula LLM (as provided by the Symbl.ai API documentation) and wrap it in a function.

Additionally, to access Nebula’s API, you’ll need an API key, which you can obtain by signing up to the Symbl.ai platform and using your App ID and App Secret.

#set up for calling Nebula LLM as a function

def send_to_nebula(input_prompt):
    url = "https://api-nebula.symbl.ai/v1/model/chat"

    payload = json.dumps({
      "max_new_tokens": 1024,
      "top_p": 0.95,
      "top_k": 1,
      "system_prompt": "You are a text summarizer: you take pieces of text and return clear, concise summaries that are easy to comprehend.",
      "messages": [
        {
          "role": "human",
          "Text": f" take the following text and create a comprehensive summary:\n {input_prompt}",
        }
      ]
    })
    headers = {
      'ApiKey': '',
      'Content-Type': 'application/json'
    }

    response = requests.request("POST", url, headers=headers, data=payload)
    
    return response['messages'][-1]['text']

Additionally, we need to make the agents aware of the LLMs by including them in llm_config. In the text below, we have made it aware of Nebula LLM by defining the send_to_nebula in functions and by including the config_list with connection details for Gemma LLM.

""" configuration (llm_config) for AssistantAgent, which includes setting it up to use Gemma as its LLM, and recognise the above send_to_nebula function """

llm_config = {
   "functions": [
       {
           "name": "send_to_nebula",
           "description": "takes output from AssistantAgent and sends to Nebula LLM for summarization",
           "parameters": {
               "type": "object",
               "properties": {
                   "input_prompt": {
                       "type": "string",
                       "description": "input prompt to be passed to Nebula LLM",
                   }
               },
               "required": ["input_prompt"],
           },
       },
          ],
   "config_list": config_list,
}

Create and Initialize Agents

Next, you need to initialize the agents that will comprise your multi-agent chatbot; we will use two types of agents:

UserProxyAgent: takes user input and passes it to other agents to execute
AssistantAgent: takes instructions from the UserProxyAgent and executes the specified task.

First, we will initialize the AssistantAgent by connecting it to Gemma and the send_to_nebula function through the llm_config created above. Additionally, you must give the AssistantAgent a name, so you can refer to it when implementing the conversation logic, i.e., starting a conversation between agents: in this instance, we’ve named it chatbot.

# Initialize the AssistantAgent

chatbot = autogen.AssistantAgent(
   name="chatbot",
   llm_config=llm_config,
)

Next, you need to do the same thing for the UserProxyAgent, which in addition to its name (in this case, userproxy), requires a few other key parameters.

The most important of these is human_input_mode, which determines the level of human intervention the UserProxyAgent will seek. There are three options:

ALWAYS: the agent will ask for human input after receiving a message from other agents
TERMINATE: the agent will only ask for human input when it receives a termination message
NEVER: the agent will never ask for human input and will continue the conversation until the task is complete or it reaches the defined max_consecutive_auto_reply (which is set to 10, for this example).

While ALWAYS is the appropriate setting if you want the user to have a continuous conversation with your chatbot, for our purposes, to offer a simple example, we have set it to NEVER.

Lastly, there is code_execution_config, which specifies how the UserProxyAgent will execute code that it receives in the course of completing its defined task. work_dir is the directory in which it will save any created files (in this instance, it will save them to a folder called “coding”. use_docker, meanwhile, specifies if it will execute the code with Docker, i.e., in a container – which is set to false.

# Initialize the UserProxyAgent

user_proxy = autogen.UserProxyAgent(
   name="user_proxy",
   human_input_mode="NEVER",
   max_consecutive_auto_reply=10,
   code_execution_config={
       "work_dir": "coding",
       "use_docker": False,
   },

Additionally, although the AssistantAgent is aware of the send_to_nebula function, its the UserProxyAgent that will execute it, so it also needs to be made aware of it. To do this, we must register the function, as shown below:

# Register the send_to_nebula function with the UserProxyAgent

user_proxy.register_function(
   function_map={

       "send_to_nebula": send_to_nebula,
   }
)

Implement Conversation Logic

Finally, with both agents initialized, you can start the conversation between them. The UserProxyAgent initiates the conversation, so it must be passed the name of the AssistantAgent, i.e., chatbot, with which to converse. Additionally, crucially, it must also be given a prompt containing instructions for the task you want it to perform, which is stated in the message parameter.

# Initiate conversation between agents, i.e., launch the chatbot 

user_proxy.initiate_chat(
   chatbot,
   message="Write a biography for Alan Turing and then send to Nebula for summarization",
)

And with that, you’ve successfully built a multi-agent chatbot!

Use Cases for Multi-Agent Chatbots

Let’s explore some of the applications of multi-agent chatbots.

Q&A: chatbots can leverage an LLM’s acquired knowledge and NLP capabilities to answer users’ questions. They’re an excellent way to address users’ frequently asked questions (FAQs), for instance, freeing up human agents to deal with more complex queries or perform other value-adding duties. Additionally, if equipped with an LLM that has been fine-tuned for a specific domain, they can act as a knowledge base for a particular subject, such as law, finance, etc.
Customer Support: in addition to answering FAQs, multi-agent chatbots can enhance other aspects of customer service. They can be used to onboard new customers, ensuring they understand your full range of products and services and which best fit their needs. This helps to increase customer satisfaction and loyalty, which in turn helps boost customer retention.
Customer Service: as well as customer support, chatbots can be used to improve customer service, offering assistance at every stage of the buying journey. AI agents can guide your customers through your marketing funnel, directing them to the appropriate information, or products, depending on their readiness to make a purchase. When necessary, they can be directed to the appropriate human agent to address any outstanding queries that stand in the way of making a purchasing decision.
Semantic Search: multi-agent chatbots can be equipped with LLMs and vector databases to offer semantic search capabilities. This can be applied to a range wide of use cases to help users find what they’re looking for with greater accuracy and speed, whether that is products, services, or data.
Sentiment Analysis: multi-agent chatbots can analyse the sentiment of a conversation in real-time, helping to read between the lines to determine how a customer really feels and, subsequently, how you can best help them. As well as its customer service benefits, this can be used to provide insight to human agents to improve their service or sales skills, thereby increasing their effectiveness and productivity.
Content Generation: AI agents are extremely useful for creating content, whether using an LLM to create content from scratch or synthesizing multiple pieces of content to create something new. Better yet, AI agents with multi-modal capabilities, can produce content comprised of various types of media, including text, images, audio, video, etc.
Education Assistant: chatbots can be used to enhance education and training applications by delivering resources that match the user’s competence, rate of progress, and particular needs. This could include providing additional material or changing learning strategies for areas the user finds difficult while skipping topics they’re familiar with.

Conclusion

In summary:

An AI agent is an application capable of executing a given task without direct human intervention.
Connecting two or more AI agents creates a multi-agent system that allows agents to collaborate on more complicated tasks
The steps for building a multi-agent chatbot include:
- Choosing a framework
- Choosing an LLM
- Installing the appropriate packages
- Configuring the agents
- Creating and initializing the agents
- Implementing the conversation logic, i.e, establishing communication between agents
Use Cases for multi-agent chatbots Include:
- Q&A
- customer support
- customer service
- semantic search
- sentiment analysis
- content generation
- education assistant

To further your understanding of AI agents, we encourage you to experiment by changing parameters, trying different configurations, using different LLMs, and adding more agents, i.e., creating different conversational hierarchies. You can learn more by referring to the resources we have provided below.

Additional Resources

The post How to Build a Multi-Agent Bot with Autogen and Nebula appeared first on Symbl.ai.

Can Conversational Feature Transfer in LLMs Help Detect Deception?

Kartik Talamadupula — Wed, 17 Jul 2024 06:01:44 +0000

Introduction

Large Language Models (LLMs) have revolutionized natural language processing, demonstrating impressive capabilities in sentiment analysis and emotion detection. However, the way most LLMs learn and interpret language differs significantly from human language acquisition. This discrepancy raises an important question: do LLMs trained with multimodal features and across different forms of data, such as conversation and text, effectively utilize those features when processing data from a single modality? To answer this, we set up experiments to compare general LLMs against LLMs that are specialized on multimodal data. Specifically, we compare the Llama-2-70B general model against a version of that model which is fine tuned on human conversation data (Llama-2-70B-conversation).

Human communication and conversation is inherently multimodal, involving both verbal and non-verbal cues. We learn to interpret conversational communication first, including intonation and modulation, and then transfer those skills to written communication. Additionally, the conversation modality encodes distinct differences from other text data, such as turn-taking, context dependency, dialog and speech acts, real-time interaction etc. The question of whether the skills needed to excel at different modalities (text vs. conversation) are transferable across those modalities is what we aim to explore in our new research paper, currently under submission at the Association for Computational Linguistics’ rolling review cycle.

To test our hypothesis, we pick one of the most challenging use cases in conversation understanding and NLP in general: deceptive communication. This includes sarcasm, irony, and condescension, and serves as an illustrating test case for multimodal feature transfer. These forms of covert deception are challenging to detect in text representations of media, as they often rely on multi-turn modulation and prosody changes that are absent in just plain text data.

At Symbl.ai, our animating purpose has been to investigate the nuances and complexities that make human conversation distinct from the mere processing of text data on the web. This line of work extends those investigations by examining whether there are inherent features in conversational data that can be utilized by LLMs to better detect and understand one of the most complex human conversational behaviors – deceptive communication.

Motivation

The motivation for this research stems from the observation that LLMs, until recently, primarily learned language through vast amounts of text-only data on the web. While this approach has yielded impressive results, it fails to capture the inherently multimodal nature of human communication.

The ability to detect deceptive communication is a complex task for both humans and machines, especially in the text-only modality. We focus on this specific aspect of communication to evaluate the multimodal transfer of skills in LLMs. By comparing the performance of multimodal models (conversation+text) with unimodal models, we aim to gain insights into how LLMs interpret and utilize multimodal features.

Results

Our experiments involved comparing the performance of two types of models: text-only models, and text models trained with a special emphasis on human-to-human conversations. These models are exemplified respectively in our current work by the Llama-2-70B model – a very popular openly available LLM; and a fine tuned version of that LLM which specializes in conversational data. We also varied the prompting approach, using both basic prompts and prompts designed to emphasize the model’s conversational features.

Table 1: Average percentage difference between Llama-2-70B-chat and Llama-2-70B-conversation

Table 2: Average percentage difference between the basic prompt and the conversational-features-emphasized prompt.

The results, presented in Tables 1 and 2, offer valuable insights into multimodal feature transfer in LLMs. Table 1 highlights the advantage of using conversation+text models over unimodal text models for identifying deceptive communication such as snark, irony, and condescension. The Llama-2-70B-conversation model achieves higher accuracy and precision in identifying such deceptive communication, with impressive improvements in accuracy and F1-score. This supports our central hypothesis that adding the additional features that come from the conversation modality improves the performance of the language model on challenging use cases and data.

Table 2 reveals the impact of changes in prompting techniques. Emphasizing conversational features in prompts yields mixed results, with a slight improvement in accuracy and precision but a decline in recall. This suggests that while the model may better identify deceptive communication correctly when it is guided to pay special attention to features from the conversation modality via the input prompt, this sharpened focus may also cause it to miss more instances of such communication from an overall set.

Conclusion

Our findings suggest that the phenomenon of multimodal feature transfer occurs in LLMs, as conversation-tuned models outperform unimodal models in deceptive communication detection – a traditionally challenging use case for language models. Additionally, prompts emphasizing speech and conversation features can enhance performance in certain cases.

These results have important implications for future research and applications, indicating that models are capable of transferring what they learned on multimodal data to single-modality data, improving LLM performance on specific tasks that may require multimodal training. We are currently further investigating the effect of other modalities associated with human conversation data on the feature transfer phenomenon in LLMs, and on the overall accuracy of tasks that are challenging to today’s large language models.

The post Can Conversational Feature Transfer in LLMs Help Detect Deception? appeared first on Symbl.ai.

Build unified compliance for human and AI agents with Call Score API.

Team Symbl — Mon, 15 Jul 2024 19:36:49 +0000

Measure performance of your human & AI agents with custom criteria aligned to your business

SEATTLE- Symbl.ai — July 15, 2024

Symbl announced new customized scoring features to Call Score API, its GenAI-powered evaluation tool that scores today’s hybrid workforce of human and AI agents using text and voice signals.

In an era when AI agents are augmenting and replacing human agents, ensuring quality is key to achieving better customer experiences (CX). In recent news, there have been multiple incidents of voice agents hallucinating while interacting with customers, thereby negatively impacting customer trust. With Call Score API, businesses can keep their hybrid workforce in check. They can also transform their AI-assisted frontline workforce in customer support, sales and other functions by assessing conversations on specific processes, personas and other key criteria.

Unified API for customized criteria across human & AI agents

With ‘Custom Criteria’ & ‘Scorecards’, customers can define evaluation criteria, build different scoring logics for their human & AI agents and directly integrate call scores into their CRM, BI tools or custom applications – all with a single API! This significantly reduces the engineering effort and accelerates time to market.

Whether it’s assessing nuanced traits like ‘empathy’ and ‘confidence’ or prioritizing adherence to internal processes, ‘Custom Criteria’ offers unparalleled flexibility to conduct tailored assessments. Additionally, ‘Scorecards’ allow users to combine multiple criteria (both custom defined & Symbl managed) to build comprehensive and consistent evaluations, gaining holistic insights into call performance for coaching.

Ensuring high quality customer interactions with contextual evaluations

Call Score helps business leaders make data-driven decisions by providing them in-depth visibility into the performance of their customer-facing teams by analyzing “what they are saying” and “how they are saying it”. This helps them ensure high quality interactions and an overall improved customer experience. For e.g., in contact center settings, customer support teams can automatically evaluate 100% of the calls and prioritize attributes like problem solving and confidence to boost CSAT.

Conquer.io, a sales engagement platform provider that uses Call Score API said: “Symbl.ai’s GenAI-powered Call Score API is a game-changer for how our customers at Conquer track and improve their sales reps’ performance. While automating numerical scoring, Symbl.ai’s Call Score API also offers detailed feedback and reasons behind the scores.

With the latest addition of custom criteria and scorecards, our customers can now set their own standards for scoring sales calls. This allows them to tailor feedback to their specific needs, providing detailed, personalized coaching that helps their reps improve quickly.”

Ensure compliance & safety of AI agents to prevent CX damage

The reliance on AI voice agents for customer interactions has inherent risks around model hallucination leading to incorrect or undesired responses in some instances. With Call Score, these risks can be mitigated by having a human in the loop define and measure appropriate scoring criteria and specific traits to regulate the AI agent. For e.g. custom criteria on ‘humanness’ and applicable AI safety regulations can be defined and measured.

Diverse Use Cases Across Industries

Call Score API is designed to cater to numerous use cases across different sectors such as:

Recruiter Efficiency and Candidate Experience: Measure biases, empathy, and job description alignment to enhance candidate experience.
Patient Engagement in Telehealth: Evaluate empathy, clarity, and communication quality during telehealth consultations.
Product Pitch Assessment: Ensure alignment to core value propositions in customer-facing teams’ product pitches.
Sales Process Alignment: Check adherence to sales methodologies like BANT and MEDDIC.
Cold Call Analysis: Assess key criteria needed to convert cold calls within the first 30 seconds of the conversation.

Why Customization of Call Score is a game-changer

The ability to tailor evaluation metrics to specific business needs and combine them into comprehensive scorecards for consistent evaluation makes Call Score a powerful tool for businesses. It ensures every customer interaction is not only assessed with precision but also enriched with actionable and contextual insights, driving continuous improvement. All of this is possible by seamlessly integrating Call Score into your existing environment without any rip and replace needed.

To learn more about building your own scoring criteria, see our documentation here. If you’d like to see a personalized demo of the Call Score API or connect with us to build your own contact our team.

Developers can get started by creating a free Symbl platform account, define Custom Criteria & Scorecards, and use the Call Score API to generate scores.

About Us

Symbl.ai offers state-of-the-art understanding and generative models with an end-to-end platform for building real-time voice intelligence in your applications.

For more information, visit symbl.ai.

The post Build unified compliance for human and AI agents with Call Score API. appeared first on Symbl.ai.

Introducing ‘Custom Criteria’ & ‘Scorecard’ for Call Score API

Team Symbl — Fri, 12 Jul 2024 15:26:15 +0000

We are excited to announce new enhancements to our Call Score API, a GenAI powered, low-code API that evaluates participant performance and overall call quality at scale. We’ve added two (2) new features – ‘Custom Criteria’ and ‘Scorecards’. With these latest enhancements, the Call Score API now provides unparalleled customization and programmability, allowing you to precisely tailor assessments to meet your organization’s specific needs.

Customization for precise evaluation

‘Custom Criteria’ and ‘Scorecard’, offer flexibility and control, enabling you to craft assessment frameworks that reflect the unique challenges and requirements of your industry. Whether you’re evaluating ‘empathy’ in customer service, negotiation skills in sales, or effectiveness of support provided, the Call Score API adapts to your specific needs. Create detailed, actionable assessments tailored to the unique demands of your sector—from healthcare to retail, and beyond.

Core Features

Custom Criteria: Custom Criteria are user-defined elements created to meet your organization’s unique needs in evaluating conversations. This feature allows you to develop detailed and precise assessment criteria such as empathy, technical clarity, or objection handling, ensuring evaluations are relevant to your business goals.

Key Elements
- Checklists and Priority
  Develop comprehensive checklists with questions to ensure thorough assessment of each conversation. Assign priority levels to each question to ensure that critical areas are weighted appropriately. This provides a more granular evaluation framework focusing on the most impactful aspects of the conversation.
- Descriptions and Tags
  Add descriptions to each criterion which helps understand what the criterion pertains to. Tags help categorize and organize criteria, making management easier and more efficient.
Scorecard: A scorecard in the Call Score API serves as an abstraction layer that combines various criteria for evaluating conversations. Mix and match custom criteria defined by you and managed criteria defined by Symbl.ai to create a comprehensive view of performance.

Key Elements
- Criteria IDs
  Use unique identifiers of each criteria to ensure accurate application during evaluations. This helps maintain consistency and precision across all assessments.

Applications Across Various Verticals

The applications of the Call Score API extend beyond sales, customer service, adapting to diverse industries such as finance, education, hospitality, legal, real estate, media and entertainment. Examples listed below:

Sales: Enhance negotiation and persuasion skills to improve conversion rates and deal closure efficiency.
Customer Service: Develop specific criteria for empathy and problem-solving to increase customer satisfaction and loyalty.
Recruitment: Standardize communication assessments during interviews to ensure objective hiring decisions based on clear, measurable competencies.
Healthcare: Monitor and improve interactions with patients, focusing on clarity and empathy to boost patient satisfaction and compliance with treatment plans.
Retail: Optimize customer interactions by evaluating and training staff on effective communication strategies, enhancing the overall shopping experience.

Key Benefits of ‘Custom Criteria’ & ‘Scorecard’

Refining Evaluation Standards: Customize criteria to emphasize focus areas like customer satisfaction or compliance adherence.
Adapting to Changes: Quickly update criteria to reflect new business priorities or regulatory changes.
Improving Training Programs: Focus scorecards on specific training outcomes for continuous improvement.

Using Call Score through your preferred integration channel

Previously, Call Score supported only asynchronous conversations such as meeting recordings that were processed with Async API to generate transcripts. With the latest launch, we have added Streaming API support for Call Score to generate call scores for real-time conversations right upon completion of the call without any additional processing needed. Additionally, we have also added Webhook support for both asynchronous and real-time conversations to publish real-time call-score status updates to customers. This removes the onus from the user to continuously check the status of their Call Score job and automatically publishes the call-score status to notify them.

Streaming API support for Call Score: Support for Streaming API provides a way for users to automatically obtain call scores and other key insights as soon as a streaming connection has been disconnected with any further processing needed.

Key Elements
- Actions to trigger Call Score & Insights UI processing
  There are new actions to trigger the Call Score and Insights UI processing after the streaming session has ended. Below consists sample code on how to implement the same:

actions: [
{
name: 'generateCallScore',
parameters: {
"conversationType": "string",
"salesStage": "string",
"scorecardId": "string",
"prospectName": "string",
"callScoreWebhookUrl": "string"
}
},
{
name: 'generateInsightsUI',
parameters: {
"prospectName": "string"
}
}
]

Webhook for Processing Status: Users can optionally define a webhook URL that automatically notifies them whenever the status of the Call Score job changes. This means that the call-score status endpoint does not need to be continuously polled for users to determine if the call-score process is complete.

Key Elements
- callScoreWebhookUrl
  The webhook URL for your application. When the status of the processing job is updated, the Call Score Status API sends an HTTP request to the URL that you specify. This is applicable for both Async API and Streaming API

The addition of Streaming API support for Call Score and Webhooks for Call Score improves overall customer experience and streamlines the processing of conversations so as to obtain call scores and other key insights for immediate post-call feedback and debriefs.

Getting Started

You can try out Call Score API and the different features listed above by creating a free platform account and following the steps listed below:

Authenticate: Generate your authentication token (AUTH_TOKEN) as per our Authentication Guide.
Create Custom Criteria: Use the Management API to define criteria tailored to your needs.

POST https://api.symbl.ai/v1/manage/callscore/criteria

{
 "name": "Custom criteria name ",
 "tags": ["tag-1", "tag-2", "tag-3"],
 "checklist": [
   { 
"question": "Was the technical issue clearly identified and resolved?", 
"priority": "high" 
   },
   { 
"question": "Did the representative confirm customer understanding?", 
"priority": "medium" 
   },
   { 
"question": "Was the representative patient and understanding?",
"priority": "low" 
   }
 ]
}

You will receive a criteria id in the response. To learn more about custom criteria, refer here.
Create Scorecard: Integrate custom criteria into your scorecard using the Management API. Use the criteria ids created and pass them to the criteriaList parameter to create a scorecard.

POST https://api.symbl.ai/v1/manage/callscore/scorecards

{
 "name": "Scorecard Name",
 "tags": ["tag-1", "tag-2", "tag-3"],
 "criteriaList": ["684654556453497", "5476135688435435", "675462462156455"]
}

You will receive a scorecard id in the response. To learn more about scorecard, refer here.
Process an Async conversation: Process a conversation via Async API, and pass scorecardId along with “callscore” in the features.

Here is a sample Async Audio URL API request:

POST /v1/process/audio/url

{
    "url": "https://storage.googleapis.com/abc/abc.wav", // Update the URL with a conversation you like to score
    "languageCode": "en-US",
    "enableSpeakerDiarization": true,
    "diarizationSpeakerCount": 2,
    "features": {
        "featureList": [
            "callScore",
            "insights"
        ],
        "callScore": {
            "scorecardId": "5463513598865445" // Add your scorecard ID
        }
    },
    "conversationType": "sales",
    "metadata": {
        "salesStage": "qualification",
        "prospectName": "Audio File LLC"
    }
}

You will receive a conversationID and jobID as the response. To learn more about processing a conversation, refer here.
Adding Call Score Webhook URL (OPTIONAL): Users can create a new webhook to get updates on Call Score processing status by defining a new ‘callScoreWebhookUrl’ parameter.

Here is a sample Async Audio URL API request to create a new Webhook:

{
    "url": "https://abc.bucket-beta.s3.amazonaws.com/ACe48733795c5f601528e0073beccdbd4a/RE0363adf72ebcc4a7439fe1e88ba22442",
    "name": "AE Test 1 AE 2 call",
    "callScoreWebhookUrl": "https://api.beta.abc.ai/symblai/jobWebhook",
    "languageCode": "en-US",
    "features": {
        "featureList": [
            "callScore",
            "insights"
        ]
    },
    "conversationType": "sales"
}

Get call score: Using the conversation id received, make an API call to GET Call Score API to receive call score as the response. To learn more about call score, refer here.

GET https://api.symbl.ai/v1/conversations/{{conversationId}}/callscore

To learn more about Call Score, please read Symbl.ai Call Score’s technical documentation.

Follow our API reference and try out the APIs on our platform.

The post Introducing ‘Custom Criteria’ & ‘Scorecard’ for Call Score API appeared first on Symbl.ai.

A Guide to Building an LLM from Scratch

Kartik Talamadupula — Fri, 31 May 2024 19:21:43 +0000

Up until recently, building a large language model (LLM) from scratch was a difficult and involved process – only reserved for larger organizations able to afford the considerable computational resources and highly skilled engineers that are required.

Today, with an ever-growing collection of knowledge and resources, developing a custom LLM is increasingly feasible. Organizations of all sizes can harness the power of a bespoke language model to develop highly-specialized generative AI applications that will boost their productivity, enhance their efficiency and sharpen their competitive edge.

In this guide, we detail how to build your own LLM from the ground up – from architecture definition and data curation to effective training and evaluation techniques.

Determine the Use Case For Your LLM

The first – and arguably most important – step in building an LLM from scratch is defining what it will be used for: what its purpose will be.

This is crucial for several reasons, with the first being how it influences the size of the model. In general, the more complicated the use case, the more capable the required model – and the larger it needs to be, i.e., the more parameters it must have.

Subsequently, the more the number of parameters, the more training data you will need. The LLM’s intended use case also determines the type of training data you will need to curate. Once you have a better idea of how big your LLM needs to be, you will have more insight into the amount of computational resources, i.e., memory, storage space, etc., required.

In an ideal scenario, clearly defining your intended use case will determine why you need to build your own LLM from scratch – as opposed to fine-tuning an existing base model.

Key reasons for creating your own LLM can include:

Domain-Specificity: training your LLM with industry-specific data that aligns with your organization’s distinct operations and workflow.
Greater Data Security: incorporating sensitive or proprietary information without fear of how it will be stored and used by an open-source or proprietary model.
Ownership and Control: retaining control over confidential data, you can improve your own LLM over time – as your knowledge grows and your needs evolve.

Create Your Model Architecture

Having defined the use case for your LLM, the next stage is defining the architecture of its neural network. This is the heart, or engine, of your model and will determine its capabilities and how well it performs at its intended task.

The transformer architecture is the best choice for building LLMs because of its ability to capture underlying patterns and relationships from data, handle long-range dependencies in text, and process input of variable lengths. Additionally, its self-attention mechanism allows it to process different parts of input in parallel, allowing it to utilize hardware, i.e., graphics processing units (GPUs), more efficiently than architectures that preceded it, e.g., recurrent neural networks (RNNs) and long short-term memory (LSTMs). Consequently, the transformer has emerged as the current state-of-the-art neural network architecture and has been incorporated into leading LLMs since its introduction in 2017.

Previously, an organization would have had to develop the components of a transformer on its own, which requires both considerable time and specialized knowledge. Fortunately, today, there are frameworks specifically designed for neural network development that provide these components out of the box – with Pytorch and Tensorflow being two of the most prominent.

PyTorch is a deep learning framework developed by Meta and is renowned for its simplicity and flexibility, which makes it ideal for prototyping. TensorFlow, created by Google, is a more comprehensive framework with an expansive ecosystem of libraries and tools that enable the production of scalable, production-ready machine learning models.

Creating The Transformer’s Components

Embedding Layer

This is where input enters the model and is converted into a series of vector representations that can be more efficiently understood and processed.

This occurs over several steps:

A tokenizer breaks down the input into tokens. In some cases, each token is a word but the current favored approach is to divide input into sub-word tokens of approximately four characters or ¾ words.
Each token is assigned an integer ID and saved in a dictionary to dynamically build a vocabulary.
Each integer is converted into a multi-dimensional vector, called an embedding, with each characteristic or feature of the token represented by one of the vector’s dimensions.

A transformer has two embedding layers: one within the encoder for creating input embeddings and the other inside the decoder for creating output embeddings.

Positional Encoder

Instead of utilizing recurrence or maintaining an internal state to track the position tokens within a sequence, the transformer generates positional encodings and adds them to each embedding. This is a key strength of the transformer architecture as it can process tokens in parallel instead of sequentially, and keep better track of long-range dependencies.

Like embeddings, a transformer creates positional encoding for both input and output tokens in the encoder and decoder, respectively.

Self-Attention Mechanism

This is the most crucial component of the transformer – and what distinguishes it from other network architectures – as it is responsible for comparing each embedding against others to determine their similarity and semantic relevance. The self-attention layer generates a weighted representation of the input that captures the underlying relationships between tokens, which is used to calculate the most probable output.

At each self-attention layer, the input is projected across several smaller dimensional spaces known as heads – and is hence referred to as multi-head attention. Each head independently focuses on a different aspect of the input sequence in parallel, enabling the LLM to develop a richer understanding of the data in less time. The original self-attention mechanism contains eight heads, but you may decide on a different number, based on your objectives. However, the more the attention heads, the greater the required computational resources, which will constrain the choice to the available hardware.

Multiple attention heads enhance a model’s performance as well as its reliability: if one of the heads fails to capture important information from the input, the other heads can compensate, resulting in a more robust training process.

Both the encoder and decoder contain self-attention components: the encoder has one multi-head attention layer while the decoder has two.

Feed-Forward Network

This layer captures the higher-level features, i.e., more complex and detailed characteristics, of the input sequence, so the transformer can recognise the data’s more intricate underlying relationships. It is comprised of three sub-layers:

First Linear Layer: this takes the input and projects it onto a higher-dimensional space (e.g., 512 to 2048 in the original transformer) to store more detailed representations.
Non-Linear Activation Function: this introduces non-linearity into the model, which helps in learning more realistic and nuanced relationships. A commonly used activation function is the Rectified Linear Unit (ReLU).
Second Linear Layer: transforms the higher-dimensional representation back to the original dimensionality, compressing the additional information from the higher-dimensional space back to a lower-dimensional space while retaining the most relevant aspects.

Normalization Layers

This layer ensures the input embeddings fall within a reasonable range and helps mitigate vanishing or exploding gradients, stabilizing the language model and allowing for a smoother training process.

In particular, the transformer architecture utilizes layer normalization, which normalizes the output for each token at every layer – as opposed to batch normalization, for example, which normalizes across each portion of data used during a time step. Layer normalization is ideal for transformers because it maintains the relationships between the aspects of each token; and does not interfere with the self-attention mechanism.

Residual Connections

Also called skip connections, they feed the output of one layer directly into the input of another, so data flows through the transformer more efficiently. By preventing information loss, they enable faster and more effective training.

During forward propagation, i,e., as training data is fed into the model, residual connections provide an additional pathway that ensures that the original data is preserved and can bypass transformations at that layer. Conversely, during backward propagation, i,e., when the model adjusts its parameters according to its loss function, residual connections help gradients flow more easily through the network, helping to mitigate vanishing gradients, where gradients become increasingly smaller as they pass through more layers.

Assembling the Encoder and Decoder

Once you have created the transformer’s individual components, you can assemble them to create an encoder and decoder.

Encoder

The role of the encoder is to take the input sequence and convert it into a weighted embedding that the decoder can use to generate output.

The encoder is constructed as follows:

Embedding layer
Positional encoder
- Residual connection that feeds into normalization layer
Self-attention mechanism
Normalization layer
- Residual connection that feeds into normalization layer
Feed-Forward network
Normalization layer

Decoder

The decoder takes the weighted embedding produced by the encoder and uses it to generate output, i.e., the tokens with the highest probability based on the input sequence.

The decoder has a similar architecture to the encoder, with a couple of key differences:

It has two self-attention layers, while the encoder has one.
It employs two types of self-attention
- Masked Multi-Head Attention: uses a causal masking mechanism to prevent comparisons against future tokens.
- Encoder-Decoder Multi-Head Attention: each output token calculates attention scores against all input tokens, better establishing the relationship between the input and output for greater accuracy. This cross-attention mechanism also employs casual masking to avoid influence from future output tokens.

This results in the following decoder structure:

Embedding layer
Positional encoder
- Residual connection that feeds into normalization layer
Masked self-attention mechanism
Normalization layer
- Residual connection that feeds into normalization layer
Encoder-Decoder self-attention mechanism
Normalization layer
- Residual connection that feeds into normalization layer
Feed-Forward network
Normalization layer

Combine the Encoder and Decoder to Complete the Transformer

Having defined the components and assembled the encoder and decoder, you can combine them to produce a complete transformer.

However, transformers do not contain a single encoder and decoder – but rather a stack of each in equal sizes, e.g., six in the original transformer. Stacking encoders and decoders in this manner increases the transformer’s capabilities, as each layer captures the different characteristics and underlying patterns from the input to enhance the LLM’s performance.

Data Curation

Once you have built your LLM, the next step is compiling and curating the data that will be used to train it.

This is an especially vital part of the process of building an LLM from scratch because the quality of data determines the quality of the model. While other aspects, such as the model architecture, training time, and training techniques can be adjusted to improve performance, bad data cannot be overcome.

Consequences of low-quality training data include:

Inaccuracy: a model trained on incorrect data will produce inaccurate answers
Bias: any inherent bias in the data will be learned by the model
Unpredictability: the model may produce incoherent or nonsensical answers with it being difficult to determine why
Poor resource utilization: ultimately, poor quality prolongs the training process, and incurs higher computational, personnel, and energy costs.

As well as requiring high-quality data, for your model to properly learn linguistic and semantic relationships to carry out natural language processing tasks, you also need vast amounts of data. As stated earlier, a general rule of thumb is that the more performant and capable you want your LLM to be, the more parameters it requires – and the more data you must curate.

To illustrate this, here are a few existing LLMs and the amount of data, in tokens, used to train them:

Model	# of parameters	# of tokens
GPT-3	175 billion	0.5 trillion
Llama 2	70 billion	2 trillion
Falcon 180B	180 billion	3.5 Trillion

For better context, 100,000 tokens equate to roughly 75,000 words – or an entire novel. So GPT-3, for instance, was trained on the equivalent of 5 million novels’ worth of data.

Characteristics of a High-Quality Dataset

Let us look at the main characteristics to consider when curating training data for your LLM.

Filtered for inaccuracies
Minimal biases and harmful speech
Cleaned – that the data has been filtered for:
- Misspellings
- cross-domain homographs
- Spelling variations
- Contractions
- Punctuation
- Boilerplate text
- Markup, e.g., HTML
- Non-textual components, e.g., emojis
Deduplication: removing repeated information, as it could increase bias in the model
Privacy redaction: removing confidential or sensitive data
Diverse: containing data from a wide range of formats and subjects, e.g., academic writing, prose, website text, coding samples, mathematics, etc.

Another crucial component of creating an effective training dataset is retaining a portion of your curated data for evaluating the model. If you use the same data with which you trained your LLM to evaluate it, you run the risk of overfitting the model – where it becomes familiar with a particular set of data and fails to generalize to new data.

Where Can You Source Data For Training an LLM?

There are several places to source training data for your language model. Depending on the amount of data you need, it is likely that you will draw from each of the sources outlined below.

Existing Public Datasets: data that has been previously used to train LLM made available for public use. Prominent examples include:
- The Common Crawl: a dataset containing terabytes of raw web data extracted from billions of pages. It also has widely-used variations or subsets, including RefinedWeb and C4 (Colossal Cleaned Crawled Corpus).
- The Pile: a popular text corpus that contains data from 22 data sources across 5 categories:
  - Academic Writing: e.g., arXiv
  - Online or Scraped Resources: e.g., Wikipedia
  - Prose: e.g., Project Gutenberg
  - Dialog: e.g., YouTube subtitles
  - Miscellaneous: e.g., GitHub
- StarCoder: close to 800GB of coding samples in a variety of programming languages.
- Hugging Face: an online resource hub and community that features over 100,000 public datasets.
Private Datasets: a personally curated dataset that you create in-house or purchase from an organization that specializes in dataset curation.
Directly From the Internet: naturally, scraping data directly from websites en-masse is an option – but this is ill-advised because it won’t be cleaned, is likely to contain inaccuracies and biases, and could feature confidential data. Additionally, there are likely to be data ownership issues with such an approach.

Training Your Custom LLM

The training process for LLMs requires vast amounts of textual data being passed through its neural network to initialize its parameters, i.e., weights and biases. This is composed of two steps: forward and backward propagation.

During forward propagation, training data is fed into the LLM, which learns the language patterns and semantics required to predict output accurately during inference. The output of each layer of the neural network serves as the input to another layer, until the final output layer, which generates a predicted output based on the input sequence and its learned parameters.

Meanwhile, backward propagation updates the LLM’s parameters based on its prediction errors. The model’s gradients, i.e., the extent to which parameters should be adjusted to increase accuracy, are propagated backwards through the network. The parameters of each layer are then adjusted in a way that minimizes the loss function: this is the algorithm that calculates the difference between the target output and actual output, providing a quantitative measure of performance.

This process iterates over multiple batches of training data, and several epochs, i.e., a complete pass-through of a dataset, until the model’s parameters converge to output that maximizes accuracy.

How Long Does It Take to Train an LLM From Scratch?

The training process for every model will be different – so there is no set amount of time taken to train an LLM. The amount of training time will depend on a few key factors:

The complexity of the desired use case
The amount, complexity, and quality of available training data
Available computational resources

Training an LLM for a relatively simple task on a small dataset may only take a few hours, while training for more complex tasks with a large dataset could take months.

Additionally, two challenges you will need to mitigate while training your LLM are underfitting and overfitting. Underfitting can occur when your model is not trained for long enough, and the LLM has not had sufficient time to capture the relationships in the training data. Conversely, training an LLM for too long can result in overfitting – where it learns the patterns in the training data too well, and doesn’t generalize to new data. In light of this, the best time to stop training the LLM is when it consistently produces the expected outcome – and makes accurate predictions on previously unseen data.

LLM Training Techniques

Parallelization

Parallelization is the process of distributing training tasks across multiple GPUs, so they are carried out simultaneously. This both expedites training times in contrast to using a single processor and makes efficient use of the parallel processing abilities of GPUs.

There are several different parallelization techniques which can be combined for optimal results:

Data Parallelization: the most common approach, which sees the training data divided into shards and distributed over several GPUs.
Tensor Parallelization: divides the matrix multiplications performed by the transformer into smaller calculations that are performed simultaneously on multiple GPUs.
Pipeline Parallelization: distributes the transformer layers over multiple GPUs to be processed in parallel.
Model Parallelization: distributes the model across several GPUs and uses the same data for each – so each GPU handles one part of the model instead of a portion of the data.

Gradient Checkpointing

Gradient checkpointing is a technique used to reduce the memory requirements of training LLMs. It is a valuable training technique because it makes it more feasible to train LLMs on devices with restricted memory capacity. Subsequently, by mitigating out-of-memory errors, gradient checkpointing helps make the training process more stable and reliable.

Typically, during forward propagation, the model’s neural network produces a series of intermediate activations: output values derived from the training data that the network later uses to refine its loss function. With gradient checkpointing, though all intermediate activations are calculated, only a subset of them are stored in memory at defined checkpoints.

During backward propagation, the intermediate activations that were not stored are recalculated. However, instead of recalculating all the activations, only the subset – stored at the checkpoint – needs to be recalculated. Although gradient checkpointing reduces memory requirements, the tradeoff is that it increases processing overhead; the more checkpoints used, the greater the overhead.

LLM Hyperparameters

Hyperparameters are configurations that you can use to influence how your LLM is trained. In contrast to parameters, hyperparameters are set before training begins and aren’t changed by the training data. Tuning hyperparameters is an essential part of the training process because it provides a controllable and measurable method of altering your LLM’s behavior to better align with your expectations and defined use case.

Notable hyperparameters include:

Batch Size: a batch is a collection of instances from the training data, which are fed into the model at a particular timestep. Larger batches require more memory but also accelerate the training process as you get through more data at each interval. Conversely, smaller batches use less memory but prolong training. Generally, it is best to go with the largest data batch your hardware will allow while remaining stable, but finding this optimal batch size requires experimentation.
Learning Rate: how quickly the LLM updates itself in response to its loss function, i.e., its frequency of incorrect prediction, during training. A higher learning rate expedites training but could cause instability and overfitting. A lower learning rate, in contrast, is more stable and improves generalization – but lengthens the training process.

Temperature: adjusts the range of possible output to determine how “creative” the LLM is. Represented by a value between 0.0 (minimum) and 2.0 (maximum), a lower temperature will generate more predictable output, while a higher value increases the randomness and creativity of responses.

Fine-Tuning Your LLM

After training your LLM from scratch with larger, general-purpose datasets, you will have a base, or pre-trained, language model. To prepare your LLM for your chosen use case, you likely have to fine-tune it. Fine-tuning is the process of further training a base LLM with a smaller, task or domain-specific dataset to enhance its performance on a particular use case.

Fine-tuning methods broadly fall into two categories: full fine-tuning and transfer learning:

Full Fine-Tuning: where all of the base model’s parameters are updated, creating a new version with altered weighting. This is the most comprehensive way to train an LLM for a specific task or domain – but requires more time and resources.
Transfer Learning: this involves leveraging the significant language knowledge acquired by the model during pre-training and adapting it for a specific domain or use case. Transfer learning requires many or all of the base LLM’s neural network layers to be “frozen” to limit which parameters can be tuned. The remaining layers – or, often, newly added – unfrozen layers are fine-tuned with the smaller fine-tuning dataset – requiring less time and computational resources than full fine-tuning.

Evaluating Your Bespoke LLM

After training and fine-tuning your LLM, it is time to test whether it performs as expected for its intended use case. This will allow you to determine whether your LLM is ready for deployment or requires further training.

For this, you will need previously unseen evaluation datasets that reflect the kind of information the LLM will be exposed to in a real-world scenario. As mentioned above, this dataset needs to differ from the one used to train the LLM to prevent it from overfitting to particular data points instead of genuinely capturing its underlying patterns.

LLM Benchmarks

An objective way to evaluate your bespoke LLM is through the use of benchmarks: standardized tests developed by various members of the AI research and development community. LLM benchmarks provide a standardized way to test the performance of your LLM – and compare it against existing language models. Also, each benchmark includes its own dataset, satisfying the requirement of using different datasets than during training to help avoid overfitting.

Some of the most widely used benchmarks for evaluating LLM performance include:

ARC: a question-answer (QA) benchmark designed to evaluate knowledge and reasoning skills.
HellaSwag: uses sentence completion exercises to test commonsense reasoning and natural language inference (NLI) capabilities.
MMLU: a comprehensive benchmark comprised of 15,908 questions across 57 tasks that measure natural language understanding (NLU), i.e., how well an LLM understands language and, subsequently, can solve problems.
TruthfulQA: measuring a model’s ability to generate truthful answers, i.e., its propensity to “hallucinate”.
GSM8K: measures multi-step mathematical abilities through a collection of 8,500 grade-school-level math word problems.
HumanEval: measures an LLM’s ability to generate functionally correct code.
MT Bench: evaluates a language model’s ability to effectively engage in multi-turn dialogues – like those engaged in by chatbots.

Conclusion

In summary, the process of building an LLM from scratch can roughly be broken down into five stages:

Determining the use case for your LLM: the purpose of your custom language model
Creating your model architecture: developing the individual components and combining them to create a transformer
Data curation: sourcing the data necessary to train your model
Training: pre-training and fine-tuning your model
Evaluation: testing your model to see if it works as intended; evaluating its overall performance with benchmarks

Understanding what’s involved in developing a bespoke LLM grants you a more realistic perspective of the work and resources required – and if it is a viable option.

However, though the barriers to entry for developing a language model from scratch have been significantly lowered, it is still a considerable undertaking. So, it is crucial to determine if building an LLM is absolutely essential – or if you can reap the same benefits with an existing solution.

The post A Guide to Building an LLM from Scratch appeared first on Symbl.ai.