Co-founder & CTO, Symbl.ai

Reading with Intent: Equipping LLMs to Understand Sarcasm in Multimodal RAG Systems

Toshish Jawale — Tue, 27 Aug 2024 15:55:46 +0000

Retrieval Augmented Generation (RAG) has emerged as a powerful approach for enhancing the knowledge and capabilities of Large Language Models (LLMs). By integrating external information sources like Wikipedia or even the open internet, RAG systems empower LLMs to tackle a wider range of tasks with increased accuracy. However, as we increasingly rely on these systems, a critical challenge arises: the inherent ambiguity of human language.

While LLMs excel at processing factual information, they often struggle to grasp the nuances of emotionally inflected text, particularly sarcasm. This can lead to misinterpretations and inaccurate responses, hindering the reliability of multimodal RAG systems in real-world scenarios.

In this article, we describe the main findings of our recent research, where we explore this challenge in depth and propose a novel solution: Reading with Intent.

The Pitfalls of Literal Interpretation

Human communication transcends mere words on a page. Tone of voice, facial expressions, and subtle cues all contribute to the intended meaning. When LLMs – trained primarily on factual data – encounter sarcasm, they often fail to recognize the underlying incongruity between the literal meaning and the intended message. Imagine an LLM interpreting a sarcastic comment like “Oh, that’s just great” as a genuine expression of positivity!

Poisoning the Well: Creating a Sarcasm-Aware Dataset

We employed Sarcasm Poisoning to assess a given language model’s ability to detect and interpret sarcastic tones. Fact-Distortion was introduced to challenge the LLMs’ ability to handle misleading information when sarcasm is present, simulating more complex real-world scenarios.

To study this phenomenon, we first needed a dataset that reflects the realities of online communication, where sarcasm is prevalent. Such datasets are hard to curate manually. We thus generated our own dataset by taking the Natural Questions dataset, a benchmark for open-domain question answering, and strategically injecting different types of sarcastic passages into its retrieval corpus.

Our methodology involved:

Sarcasm Poisoning: Rewriting factually correct passages with a sarcastic tone using a large language model (Llama3-70B-Instruct).
Fact-Distortion: Creating intentionally misleading passages by distorting factual information, followed by rewriting in a sarcastic tone.

This two-pronged approach allowed us to investigate how sarcasm affects both comprehension and accuracy, regardless of the underlying information’s veracity.

Reading with Intent: A Prompt-Based Approach

Our proposed solution, Reading with Intent, centers around equipping all varieties of LLMs with the ability to recognize and interpret the emotional intent behind the text. We achieve this through a two-fold strategy:

Intent-Aware Prompting: We explicitly instruct the LLM to pay attention to the connotation of the text, encouraging it to move beyond a purely literal interpretation.
Intent Tags: We further guide the LLM by incorporating binary tags that indicate whether a passage is sarcastic or not. These tags, generated by a separate classifier model trained on a sarcasm dataset, provide valuable metadata that helps contextualize the text.

With Intent-Aware Prompting, the LLM receives explicit instructions to consider emotional undertones, akin to teaching it to ‘read between the lines.’ Intent Tags, on the other hand, function as markers that flag potentially sarcastic passages, giving the model a heads-up that not everything should be taken at face value.

Promising Results and Future Directions

Our experiments demonstrate that Reading with Intent significantly improves the performance of LLMs in answering questions over sarcasm-laden text. The results were consistent across various LLM families, highlighting the generalizability of our approach. Our approach was tested on the Llama-2, Mistral/Mixtral, Phi-3, and Qwen-2 families of LLMs; across models ranging from 0.5B to 72B and 8x22B parameters in size.

While this research marks an important step towards sarcasm and deception aware LLMs, several avenues for future exploration remain:

Enhancing Sarcasm Detection: Developing more robust and nuanced sarcasm detection models that can handle subtle and context-dependent instances of sarcasm.
Beyond Binary Tags: Exploring the use of multi-class intent tags that capture a wider range of emotions beyond just sarcasm.
Instruction-Tuning: Explicitly fine-tuning LLMs specifically on sarcasm-infused data to further enhance their ability to understand and respond to emotionally charged language.

These advancements can drastically improve understanding and user interactions in customer service, virtual assistance, contact centers, and any scenario where understanding human intent is critical.

By addressing these challenges, we can build more robust and reliable multimodal RAG systems that are better equipped to navigate the full complexity of human communication.

Want to read more? Check out our full research paper [link to research paper], where you can explore our methodology, experimental setup, and detailed analysis of the results.

Want to experiment yourself? We have released our sarcasm dataset as well as the code for creating it, and our Reading with Intent prompting method! You can find the repository on Github here: https://github.com/symblai/reading-with-intent, and on Huggingface 🤗 here: https://huggingface.co/datasets/Symblai/reading-with-intent.

The post Reading with Intent: Equipping LLMs to Understand Sarcasm in Multimodal RAG Systems appeared first on Symbl.ai.

Symbl.ai LLM – Nebula Private Beta Invitation

Toshish Jawale — Fri, 14 Jul 2023 16:00:00 +0000

Symbl.ai is excited to announce a Private Beta launch of Nebula, our LLM for natural human conversations. Nebula is intended for businesses and developers who are interested in building generative AI powered experiences and workflows that involve human conversations including sales calls, meetings, customer calls, interviews, emails, chat sessions and other scenarios.

In mid July Symbl.ai will start making the Nebula LLM available to developer communities through our Private Beta program. During this Private Beta program, developers can experience a hands-on preview of Nebula’s performance under a variety of use case scenarios (visit our technical documentation for a full list of use case scenario descriptions here).

Use Case Examples:

Prompt: “What could be the customer’s pain points based on the conversation?”

Response:

Prompt: “What sales opportunities can be identified from this conversation?”

Response:

Prompt: “What best practices can be derived from this conversation for future customer interactions?”

Response:

The Symbl.ai Nebula model takes two inputs in its prompt – an instruction which is a command or question to model, and a conversation transcript. Nebula is a large language model (LLM) trained to understand nuances in human conversations and perform instructed tasks in the context of the conversation. Human conversations typically involve multiple participants and complex interactions between them across long and distant dialogues.

Key Highlights:

Nebula takes into consideration input instructions, questions, conversation transcripts, and context to generate the output that reflects the intended task result provided in the instruction. This enables the model to generate responses in context of the provided conversation.
Nebula can create human-like text, based on the inputs provided, which makes Nebula effective in handling a wide variety of use case scenarios.
Nebula provides developers the ability to adjust the generation parameters of the model that control characteristics such as diversity, randomness, and repetition to change the model’s behavior to satisfy a specific use case.
Developers can easily integrate Nebula through an easy REST API

import requests
import json

url = 'https://api-nebula.symbl.ai/v1/model/generate'

headers = {
    'Content-Type': 'application/json',
    'ApiKey': 'YOUR_API_KEY',
}

data = {
    "prompt": {
        # Your question or instruction
        "instruction": "What are the customer pain points based on this conversation?",
        # Your conversation transcript
        "conversation": {
            "text": "Representative: Hello, How are you?nCustomer: Hi, good. I am trying to get access to my account but ..."
        }
    }
}

response = requests.post(url, headers=headers, data=json.dumps(data))

print(response.text)

Model Playground

We’ve created a Model Playground that allows developers to test the Nebula LLM, without writing any code, against various conversations and tasks. Model Playground is a great way to start exploring Nebula’s capabilities.

Get started with already available conversation transcripts and prompt suggestions.
Use your transcript by pasting your transcript or uploading a text file.
Fine tune parameters to find the right combination of generation parameters for your use case.

To request access to Model Playground, please visit our sign up page here

Using Nebula in your applications

Nebula can analyze conversation transcripts and generate responses based on the conversation along with the instruction or question in the prompt. Furthermore, developers can provide Nebula with instructions such as request summaries, follow-up questions, draft emails, issues to review, qualify sales leads, identify and recommend resolutions to customer issues, or even recommend business opportunities. Nebula can also respond to specific questions about the conversation as part of the instruction. Check out docs to learn more.

For example, Nebula powers generative AI capabilities in Symbl.ai’s Sales Intelligence solution that provides context-aware sales coaching experiences to business organizations. In this example, Nebula performs various tasks via Model API to analyze a conversation with a prospect, identifying themes, sentiments, next steps, questions and answers, and objections, to help generate follow-up responses and identify potential sales opportunities.

Call for Developers

At Symbl.ai, we believe that there’s tremendous value in human conversations especially in businesses, and Nebula can help harness that value across various conversation types. We are excited to see what you build with Nebula. We have been working with developers and businesses who have been pushing boundaries by leveraging advanced generative AI for conversation intelligence and we’re eager to offer the power of Nebula to this inspired community. In the same spirit, we encourage you to apply for our Startup program if you are working at a startup or sign up to become a model tester and accelerate your access to the Nebula model. We’d love to see our current and future developers push the boundaries of conversation understanding with Nebula use cases in following areas:

– Sales

– Customer Support

– Meeting Productivity

– Recruitment

– Training and Education

– Workflow Automation

– Data Analytics and Sciences

– Healthcare Technology

– Finance and Insurance Technology

To register for the Nebula LLM Private Beta Preview, please visit our sign up page.

For more information on use case scenarios, see our technical API documentation page.

We’re very excited to work with more AI developers and businesses to bring your vision to life.

The post Symbl.ai LLM – Nebula Private Beta Invitation appeared first on Symbl.ai.

Tips for Improving speech to text accuracy in real-time transcription

Toshish Jawale — Sat, 04 Sep 2021 01:08:54 +0000

Speech to text accuracy in real-time transcription leaves a lot to be desired. But integrating your application with the right APIs and setting yourself up for success – capturing clear audio, using lossless audio formats and using technique like speaker diarization (or better yet, separate channels) – can significantly improve it.

Real-time transcription (and transcription in general) can be filled with moments like this where you see something appear in closed captions like, “I’m not sure what we can do about the muffin. It’s been sitting in the sink all day.”

You have no idea who was trying to say what, but you know it had nothing to do with muffins or sinks.

Sure, these moments can be humorous, but they’re not helpful. They are an unfortunate by-product of transcription, especially real-time transcription.

Even the most accurate transcription setup is going to capture some nonsense simply because human-to-human (H2H) conversations can be hard to follow. We have accents, we talk over each other, we have voices that sound the same, we use words that either the transcriptionist doesn’t recognize or, if you’re using AI-assisted transcription, that the AI doesn’t know.

How to improve real-time speech to text accuracy

Unlike batch transcription, which is done after the fact and at a much slower pace, real-time transcription requires a keen ear, attention to detail, and the ability to focus on sometimes complex conversations without distraction (which is why real-time transcription can be expensive).

The good news is that transcription with enhanced AI can help. Not only is it more accurate, but it’s getting increasingly cheaper, although it’s still not perfect.

AI can help you improve the accuracy of real-time transcription when combined with factors like clear audio, sentence boundary detection, punctuation detection and speaker diarization.

Ensure clear audio

This is the best thing that you can do to improve the speech to text accuracy of your real-time transcription – and it’s also one of the things that you have the least amount of control over.

Clear audio is best because there’s less noise happening on the feed. This means there isn’t a lot of background noise, each speaker is coming in loud and clear, and the audio is being captured in a format that is usable.

Ideally, each person would have his/her own microphone (preferably wired) and the conversation would be happening in a space that minimizes background noise (e.g. a meeting room).

The challenge here is that, unless the audio is being captured in a studio setting that you can completely control, it can be hard to get a perfectly clear recording for everyone. Poor connectivity also impacts the accuracy of the transcription, but it’s tricky to guarantee a stable, high-bandwidth network connection for each participant.

Use speaker diarization

Speaker diarization is the process of sectioning an audio stream that contains multiple speakers into segments that are associated with each individual speaker. In other words, it blocks off chunks of audio based on who’s speaking at the time. This makes it easier for the AI to identify who each person is, when they’re talking, and what they’re saying.

Figure 1 – Speaker diarization creates segments for each speaker identified in a conversation (Source)

Speaker diarization doesn’t necessarily improve word error rate, but you do get better context for the conversation (which improves accuracy) and you’re able to better establish who said what during the conversation.

Speaker diarization typically follows a process like this:

Speech detection: This phase uses a Voice Activity Detection tool to identify which sections of the audio are speech and which sections aren’t. This allows the system to trim out things like silence and any sounds that aren’t obviously speech.
Speech Segmentation: Speech segments are extracted in small chunks, typically broken up by words, syllables, or phonemes (the smallest unit of sound within a word). These segments are pulled out one speaker at a time.
Embedding extraction: A neural network-based embedding of the segments is created as a vector representation of the data.
Clustering: The segments are then clustered together and labeled based on who’s talking. This helps identify who’s talking at any given time and also identifies the number of participants in the conversation.
Transcription: The final stage where the spoken conversation is converted into text.

Figure 2: A simplified version of the diarization process

Use separate channels

Another factor that may be out of your control but ultimately leads to better transcriptions is to capture each speaker in their own audio channel. What this does is provide a separate, clear audio channel for each person. It is, in a sense, a hardwired version of speaker diarization.

Use lossless audio formats

When capturing audio, even for real-time transcription, you need to make sure the audio stream is high quality. When audio is compressed into different formats you lose quality, meaning some sounds get muddled or lost completely. A lossless format, like FLAC or WAV, doesn’t compress the audio and, as a result, preserves the integrity of your audio.

Custom vocabulary

Most industries use words that are specific to their industry. These words can be challenging during real-time transcription because they’re either words that don’t exist outside of the industry (things like acronyms) or they’re commonly used words that have industry-specific meaning.

With AI-assisted transcription, you can train the AI to recognize and understand those custom words to help the system provide more accurate transcriptions.

Get better transcriptions using conversation intelligence

It might sound like there’s a lot of work involved in improving the accuracy of real-time transcription, but a lot of these features (not the ones that require control over the recording location) can be easily added via Symbl.ai APIs.

In fact, Symbl.ai has built the context-aware conversation intelligence platform that enables you to to use context to enable accurate real-time transcription. Our streaming API lets you capture conversations in real-time and you can easily add speaker diarization functionality, as well as custom vocabulary, to ensure that you get the most accurate transcriptions possible. And, our advanced contextual AI that understands the various dimensions of the conversation and uses it to further improve the recognition of the text and who said what – so you get the most accurate transcriptions and relevant insights possible.

Want to know more? Check out our documentation to learn how you can get started with our easy-to-use APIs.

Additional reading

What is Speaker Diarization?

What’s That, Human? The Challenges of Capturing Human to Human Conversations

The post Tips for Improving speech to text accuracy in real-time transcription appeared first on Symbl.ai.

Building a Conversation Intelligence System

Toshish Jawale — Wed, 05 May 2021 01:19:17 +0000

The conversation intelligence system for your product can range from a simple rule-based engine with speech recognition that is applied to a specific domain or use case, to a complex, continued contextual understanding system that can work across different (open) domains and sources of data.

After reading what conversation intelligence is, and the skyrocketing need of such a platform for every business or product, here we will share how you can build a conversation intelligence system effectively.

Conversation intelligence technology landscape

Here’s an overview of the entire conversation AI/intelligence landscape:

Text analysis for reviews, change-request emails, and social conversations

Analysis for things like text classification, entities, topics, and sentiment and your own custom keywords or intents. Conversation intelligence for text is great for mining short form conversations where sentence structure is more precise and organized because it can identify and structure facts, relationships, and assertions that would otherwise remain buried in the mass of natural conversation.

Human to machine conversations

These are chatbot or voicebot frameworks with custom intents and named entity recognition for domain or use-case specific intents (e.g. Rasa, DialogFlow). It’s great for building bots that solve specific problems by conversing with humans to do things like book a flight, check the weather, open a support ticket, find restaurants, etc.

Human to human conversations

These are ideal for making sense of human to human conversations, like sales conversations, brainstorming meetings, Slack chats, emails, etc. For example, during customer care calls, conversation intelligence can detect whether it’s a new or returning customer, if their voice is showing negative emotions, and suggest the agent’s next actions.

The conversation intelligence workflow

The conversation intelligence workflow has three stages:

Automatic Speech Recognition (ASR)
Natural Language Processing (NLP) or Natural Language Understanding (NLU)
A user experience layer (UX/UI) for voice intelligence capable of turning the NLP results into actionable insights and presenting them to the user in real-time or after the conversation has ended.

Open domain or closed domain system

When thinking about how to scale your system for other types of conversations, you’ll first need to consider what approach to build it with. Open domain and closed domain are two very different approaches to building a conversation understanding system (CUS). You can use either for H2H conversations, depending on the scope and complexity of the conversation and the data sources that exist in your product.

Open domain systems are ideal for free-flowing conversations that involve broad and versatile conversations (such as engineering discussions, product meetings, consulting sessions) that occur horizontally and the outcomes of which vary depending on the type of conversation.
Closed domain systems are better for known, scoped, conversations with limited outcomes where there is less unknown and a pattern can be easily recognized from a fixed data set.

A closed domain conversation intelligence system is built to understand only specific kinds of conversations, for instance, sales calls or customer support interactions. This is typically achieved using traditional supervised learning techniques which makes such systems practically useless when applied to other domains or types of conversations. And because training data gathering, data clean up, model research and development, benchmarking, deployment, and continuous maintenance of models become an overhead, it makes it almost impractical to scale for any product that can understand H2H conversations without being biased towards a specific domain or type of conversation. But an open domain system by its nature avoids all these challenges. It’s not built to understand only specific kinds of conversations, but rather to understand language and conversations at a fundamental level. Both have pros and cons – so, you should make a conscious decision early on when building this for your product, understanding the future roadmap for your business. Symbl.ai is an open-domain system that allows it to scale across any conversation intelligence use case. This means that whether you are building an application specifically for the sales domain, a general-purpose meeting, or collaboration application, you can leverage the benefits of Symbl’s AI capabilities equally without having to build and train machine learning models.

Getting started

Based on whether you decide to build a closed domain or an open domain system, there are three stages to building conversation intelligence system:

The three stages to building and implementing a conversation intelligence system.

1. Start with speech recognition – To begin, you must determine a number of key factors. First, you need to know whether your application will be used for real-time or asynchronous conversations. You can determine this from the types of communication you will be using; for example, a phone chat or team meeting will require real-time speech recognition; whereas a video recording or information from an email or document are asynchronous forms of communication. You also need to identify all the languages and dialects your application might encounter. You can find this out by considering historical data that you have available, and by asking your client/the business who and how the application will be recording, considering factors like the geographical conversation sources. Consider the channels it will be used for too, and their associated audio quality. You should determine whether/how much the recording files have been compressed to prioritize storage efficiency over audio quality. The most effective change you could make would be to ensure your recordings are capturing high quality audio; the ideal format would be lossless audio formats like LINEAR16 (PCM) or Free Lossless Audio Codec (FLAC) with a minimum of 16 kHz sample rate, which is optimal for human speech. In addition, you should assess the level of background noise and the subsequent Speech to Noise Ratio.

Source image: CallMiner

2. Build the required machine learning framework – This can be approached in several ways depending on the scope and required outcome of the conversation intelligence system you are building. The first step is to decide upon your strategy. Conversation intelligence generally uses intent based systems which is a classification model built with supervised learning. Supervised learning algorithms are trained using labeled data. You’ll need to gather the training data in huge amounts to get started which can be an expensive activity. The simplest way to build any model is to write rules. However, rules don’t scale, and they never generalize. As your system becomes more complex, designing and developing rules becomes an exponentially harder challenge, as does translating those rules into code. Then, you’ll need to tell your AI system specifically what to look for, and this is how the model is trained until it can detect the underlying patterns and relationships. These steps will enable it to provide good results when presented with new data. To make your system as sophisticated as possible, you need to incorporate deep understanding techniques. Deep learning alone is where the system is trained, or just programmed, but add deep understanding into that and the system is educated. Supervised learning is good at classification and regression problems, this might be determining sales volumes for a future date. The aim in supervised learning is to make sense of data toward specific measurements. A more advanced approach would be to build an open-domain system. Such a system has an inherent ability to learn on its own and automatically generalize the meaning of the concepts and the cause-effect relationships of them. This enables the system to take away most of the problems. 3. Continuously train and maintain the machine learning model – Once you’ve built your model, you need to train it to make it as accurate and robust as possible. This requires a vast set of training data to begin with and a feedback loop to help the model learn from previous iterations. You’ll continuously need to mitigate biases in the training data. Because the training data itself is created by human beings it has inherent human biases, and as more people start using it those biases need to be normalized. This is a long and intensive process in supervised learning-based models. Identifying and eliminating sources of bias are critical to model accuracy. In addition, be prepared to tackle challenges like data cleaning, and hyperparameter tuning. If you are able to successfully build an open-domain system, or choose a vendor that uses an open-domain system, your applications will scale and reach your users much faster than traditional machine learning techniques.

Build a smarter model

Any learning system always benefits by looking at more data. The way it uses the data to learn defines how good it will be at learning, generalizing that learning, and how fast it can learn. One of the exciting features of this type of AI is the ability for human conversations to be understood in absolute real-time: recording and processing the conversation as it happens. You’ll need an API interface to do this, and you need to bring a layer of structure onto that by integrating speech. Speech recognition is a completely analog structure upon which you build the intelligence. You can build the model data analysis to target certain problems. For example, if you want to build an application that listens and takes important information from a conversation, you need to:

Build the data set
Design your machine learning
Write the code

You’ll also need a way to ingest your conversation data, either textual or speech-based. Inevitably, you’ll have to devise solutions that can tackle problems during both real-time and recorded conversations. In either case you’ll need a speech recognition system that can do all of this and work with a high level of accuracy. Remember, this is always going to be a long process, but the more data you can gather and fine-tune, the more sophisticated your AI will be. You can experiment with Symbl.ai‘s standard version API to build your intelligence features for conversations, without any need to gather more data or worry about deploying and maintaining complex machine learning and deep learning models. The next post in our series looks into the different applications for your conversation intelligence system. Further reading:

The post Building a Conversation Intelligence System appeared first on Symbl.ai.

The Developer’s Introduction to Real-Time Passive Conversation Intelligence

Toshish Jawale — Thu, 29 Apr 2021 01:22:29 +0000

Real-time passive conversation intelligence analyzes human conversations as they happen and extracts meaningful insights – like topics, action items, intents, and sentiments – without having to use a wake word or give commands. Businesses can use this natively in their workflows or products to improve their sales, productivity, and customer experiences.

What’s real-time passive conversation intelligence?

Let’s start from the top: Conversation Intelligence (CI) uses AI and machine learning to transcribe and analyze human conversations — either with other humans or with bots – to pull valuable, actionable insights from it. These insights can be anything from the most important topics discussed in a company meeting, to what phrases convert the most customers on sales calls. Next: a conversation intelligence system can either be active or passive:

Active CI: Has to interact with a human and receive specific instructions before it can take action. For example, you have to say “Hey Google” for the AI to wake up and hear you yell for the next song on your playlist. Think of active CI as an obedient (but somewhat simple) droid that only does what you tell it.
Passive CI: Doesn’t need to interact with humans or be instructed. It runs quietly in the background, absorbing what’s being said with human-like understanding, and then surfaces relevant information at exactly the right time. Think of passive CI as the smart assistant who’s constantly taking notes and chimes in when they have something important or useful to add.

Lastly: a passive CI that works in real time means it ingests and analyzes incoming audio/video/text from a conversation and instantly pulls up insights. This is especially useful for customer support calls and e-learning, where the CI can pull up important notes about a customer or show definitions for confusing terms during a lesson. So, “real-time passive conversation intelligence” boils down to a software’s ability to extract meaningful data from human speech and text, while the conversation is happening. Voila.

Where can you use it?

Most developers aren’t aware of real-time passive conversation intelligence, let alone what it can do for them. To be fair, building something that silently listens to your conversations isn’t an easy sell. But, in the right context and especially where human to human conversations are involved, passive CI can be an incredibly powerful tool for just-in-time insights, improved productivity, and effective customer interactions. Here are some popular use cases to get your inspiration going:

Customer support

There’s no better source for what customers really want than the customers themselves. CI can instantly analyze customer care calls and chats, then unlock valuable insights that agents can use to deliver more efficient and relevant customer experiences. Here are a few things CI can do:

Use real-time contextual understanding to transcribe the conversation, highlight important topics, and even detect changes in the customer’s emotional state as they happen.
Surface relevant information that the agent can use to steer the call – like if the customer has had this issue before, or their preferred language is French.
Automate real-time actions, like redirecting a customer to the right agent.
Instantly update the CRM with useful post-call summaries about the customer to better prepare the next agent they interact with.

Sales calls

Businesses always want to understand their customers better so they can drive more conversions. CI makes this possible by mining meaningful data and insights from every sales call, then can do useful things like:

Identify what phrases work best in a sales call – and which phrases to avoid.
Suggest actions in real-time to help the agent guide the customer towards a sale.
Highlight important topics, questions, and phrases to improve sales scripts and coach new agents on best practices.
Analyze sentiment, context, and word placement to measure buyer’s intent so the agent can capitalize on key moments in the conversation.
Automate tasks and follow-ups, like scheduling a call with a customer when their trial period ends.

Meetings

Humans aren’t all that great at multitasking, so adding a note-taking AI to the meeting lets them focus on being present in the conversation – increasing participation and engagement. With CI software, you can:

Identify important topics, questions, action items, and decisions that can be analyzed for internal insights or surfaced in real-time.
Suggest contextual insights at exactly the right moment, like instantly pulling up the answer to someone’s question on-screen.
Create highly-accurate transcripts and add closed-captions during calls.
Automate actions and follow-ups, like sending post-meeting summaries or scheduling the next meeting on everyone’s calendars.
Compile useful analytics like positive and negative sentiments, silence, and talk ratios (so you know who could probably spend more time on mute).

Example of real-time passive conversation intelligence in a meeting

For a better idea of CI’s potential, let’s take the case of a team on a video conference to discuss their latest project. As they join the call, their CI platform connects right along with them, then sits quietly in the background, ready to transcribe the conversation and record important topics, questions, and action items. The CI logs the meeting date and time and identifies each participant on the call. It can also pull up the action items from their previous call for the team to review. If participants join the meeting late, the CI can show them an on-screen summary of what’s been discussed so far so they can catch up without interrupting the conversation. The CI is also capable of identifying questions in the conversation and responding in ways that add value. For example, it can quickly search the company’s knowledge base, retrieve the most relevant documentation for a question and then send it in a direct message to the person that asked it. When designed for contextual understanding and with access to chat and email conversations, a CI can pick up on vague references, like the phrase “this project,” and know exactly what the speaker is referring to. The CI can also catch little follow-ups that typically fall through the cracks, and then automatically assign a task to the participants involved. After the meeting, the CI automatically sends a post-meeting summary to all participants so they can revisit the main takeaways and their to-dos. This is just a glimpse at how a highly-accurate CI allows teams to put down their pens and fully focus on the conversation for better, more productive meetings.

Implementing real-time passive conversation intelligence

At this point, you have a good grasp of why passive CI is a valuable addition to any app that deals with human to human conversations. If you’re seriously thinking of implementing real-time passive CI, here a few capabilities to think about:

Speech recognition and contextual understanding for accurate transcriptions and closed-captioning.
Streaming conversations into your application for real-time insights.
Asynchronously updating information from the AI to your products to surface information, either for internal analysis or to act upon in the moment.
Capturing sentiment in real time and measuring how it changes throughout the conversation.
Defining what follow-ups or recommended actions to automate and integrate into your existing workflow.

As you can imagine, implementing any of these takes a tremendous amount of time, resources, and caffeine. You’d need to fiddle with things like speaker diarization, communication protocols, and battle with all the common challenges of capturing human to human conversations. Not to mention the hassle of constantly recalibrating your CI system so it can understand conversations in different domains. This is where APIs, like Symbl, help developers solve these problems faster. Symbl provides all the contextual AI capabilities and scalable infrastructure to make real-time passive CI easy to implement. With flexible APIs, SDKs, and out-of-the-box integrations, developers can quickly bring a human-level understanding of voice and text conversations across different domains – without upfront training data, wake words, or custom classifiers. To see Symbl in action, check this sample app of Symbl for Zoom that lets you invite a CI into your Zoom meetings for real-time transcription. For more goodies, browse the Symbl demo library on GitHub to sample integrating voice intelligence into existing applications. With a solid understanding of real-time passive CI and the help of done-for-you APIs to make it reality, you’ll find that transcribing conversations and unlocking useful insights are just the tip of the iceberg.

Additional reading

For more information on passive conversation intelligence and the tools you can use, check out these useful links:

The post The Developer’s Introduction to Real-Time Passive Conversation Intelligence appeared first on Symbl.ai.

State-of-the-Art Conversation Intelligence: Deep Learning and Deep Understanding

Toshish Jawale — Wed, 21 Apr 2021 10:33:03 +0000

State-of-the-art deep learning is needed for natural conversations with a conversation intelligence system and to be as close to the sophistication of a human brain as possible. But it’s not enough alone. The system also needs deep understanding to model, generalize, and then run analytics using all of its knowledge, just as a human would.

Deep learning for natural conversations

The human brain is one of the most efficient computing machines we know of — it’s an extremely sophisticated neural network capable of abstract thinking.

Artificial neural networks are the most cutting edge algorithms right now. They represent the structure of a human brain modeled on the computer, with neurons and synapses organized into layers.

Artificial neural networks and their present uses.

Deep learning differs from machine learning

Machine learning is a set of statistical analytics tools which help you model the patterns in data. It then learns based upon the rules that you formulate, and sometimes a human might intervene to correct its errors.

Deep learning uses large amounts of data, takes longer to train, and is computation heavy. But this means that it can model patterns in a sophisticated manner, as compared to traditional statistical methods of machine learning techniques. This increased scale the data provides a greater capacity to learn patterns; for example, finding out stock market trends is a less complicated task for a computer than recognizing speech or faces which is more than a mathematical formula. This makes deep learning a good approach for more complex modelling to recognize sophisticated patterns.

But deep learning in conversation intelligence doesn’t match the sophistication of a human brain. This is because deep learning is just that — learning. It doesn’t go to that next level of understanding. At this deep learning stage, the system learns and exploits statistical patterns in data (and is really good at doing that with lots of training) rather than learning/understanding its meaning in a flexible and generalized way as humans do.

Let’s consider an example …

When you say, “I can recognize faces”, you don’t need to first learn what a face is because you already know. An understanding of what a face is would require you to have a more conceptual understanding. Taking it to another level, you don’t need to consider why people have faces. This is more abstract and requires pattern recognition and for your brain to model it, which in turn leads to a deep understanding.

Machines can’t do this intuitively. They need a mechanism and high-level techniques which arise from deep learning. Even these things won’t lead to a human level of understanding, rather an ability to generalize knowledge to make more general decisions without relying purely on patterns. This is an example of operating in an open domain where the model can conceptualize data from one domain and use it intelligently in another.

Taking conversation intelligence to the next level with deep understanding

Techniques used in deep understanding intelligence.

Conversation intelligence systems solve problems, just like the human brain can. They learn how the human brain works by learning to recognize patterns. With just deep learning you can’t ask the AI to truly understand what it has recognized. For example, it’ll understand that some apples are red but not all of them are red. There’s no real understanding of what it means to be a red or a green apple, or even an apple, but it recognizes the words.

Imagine that you have a deep learning system that can detect human faces but it is not able to deal with the wider concepts – this is where it falls short. It has no capacity to reason the patterns it detects. To do this, a thinking mechanism is needed that has a knowledge of the world and is capable of remembering things.

Deep learning is sophisticated pattern recognition.
Abstract knowledge modelling provides the necessary knowledge required to have a “thinking” element.
Just modelling the knowledge is not enough, you also need a statistical system to generalize that knowledge. For example, the system can understand how a car and truck work but doesn’t understand the nuances of the difference between the two. An ability to generalize means the system can apply knowledge, such as they both contain an internal combustion engine, to something else, like, a motorbike.
Inferencing systems should be able to take all of the information and draw conclusions from it.

So, how do you build this in a more sophisticated way so your system has deep understanding, not just learning?

Well, it’s a combination of modeling matched with deep learning. Deep learning alone is where the system is trained, or just programmed, but add deep understanding into that and the system is educated. It needs to understand what an apple is, and that there are more colors, evolution factors, etc.

This is the knowledge acquired and then you need to build upon that baseline knowledge using conversation data. You can model and generalize and then run analytics from all of your knowledge just as a human would. After this, you would broaden your inferences based on what you learn from your knowledge base.

In following this process, the conversation intelligence will learn and understand how to find the logical steps needed and link them together to reach a conclusion. As you can see, it’s a complex system with lots of moving pieces working together.

Deep learning to deep understanding and how it all works together

Deep learning is just one element of a very complicated ensemble of techniques ranging from statistical to deductive reasoning. Deep learning, like techniques, can be applied with a huge amount of data to solve the low-level tasks which are simple enough to be modeled as a pattern recognition problem.

The brain is highly complex, involving multiple components with different architectures interacting synergistically. Current artificial deep neural networks are roughly based on models of a few different regions of the deep neural cortex of the human brain (although the brain does a lot more than recognize patterns in structured schemes of data).

In the same way, you need a generalized AI mechanism to model those patterns, in a connected and multi-dimensional way, with enough control over it. Enough to modify your hypothesis of the world, enrich them by adding more abstract relationships, and get rid of any which are no longer relevant.

You’ll need a mathematical approach for generalization and modeling to build a multi-dimensional knowledge graph that is sophisticated enough to capture the generalized relationships between the concepts and causes in the world that the CI system is exposed to. It has to generalize concepts at the most fundamental levels like space, time, and objects to achieve deep understanding.

But your knowledge is only useful if it can be applied and you have the ability to reason. A conversation intelligence system should be able to deduce and come to conclusions based on the knowledge it has.

Symbl.ai harnesses both deep learning and deep understanding. Symbl’s API can learn and pull insights in real time, without you having to code it yourself, and this brings you the most sophisticated Conversation Intelligence API available. Learn more here.

Applications For Your Conversation Intelligence System

Toshish Jawale — Wed, 21 Apr 2021 10:31:17 +0000

The API landscape of conversation intelligence (CI) covers text/document analysis, human to machine (H2M), and human to human (H2H) conversations. CI can be used to transcribe, surface insights, and automate actions for applications like online collaboration, sales/CRM intelligence, and E-learning. You can easily build and scale your CI application faster using an open domain API like Symbl, which will accept data, calibrate itself, and understand any conversation in any domain.

In the first two posts in this series, we explored what conversation intelligence is and how you start building your conversation AI System. To deploy the correct strategy, it’s helpful to know the different use cases. So, let’s look into those now.

Conversations take place in many different forms and for different purposes. They’re also quite complex. Humans communicate in a way that is information-rich, unstructured, and contextual — which makes using an API that’s already built to handle such complexity a good idea when building a conversation intelligence (CI) system.

You can build a CI system with an API for a variety of use cases, such as customer support, sales, collaboration apps, and for workflow automation. Your system can be used for single or multiple conversations to identify real-time growth opportunities, create indexed knowledge from conversations, and drive productivity. CI systems applied in the following use cases can be a real game-changer.

Collaboration: meetings and unified communications as a service (UCaaS)

If you or your client wants to capitalize on all of the meetings that you have, what can you do as a developer do to make that possible?

You can add real-time recommendations of action items and next steps as part of the existing workflow. This will meaningfully improve meeting productivity for your client by surfacing the things that matter while the meeting is in progress. Beyond real-time action items, you can automate meeting summaries delivered to your preferred channel, like email, chat, Slack, calendar, etc.

You can also use real-time contextual recommendations to enable participants to drive efficiencies in their note-taking, so they can save time and focus more on the meeting itself. Action items are surfaced contextually and in real-time and can be automated to trigger existing workflows.

When you set up post-meeting summaries, your client can get more involved in the conversation as it happens and then re-visit information and action items after the meeting.

All of this can be easily achieved with an open domain system, which can understand any topic and return relevant responses. You can take the system even further by calibrating it. Calibrating an open domain system is much leaner and simpler than training a domain-specific system from scratch.

Benefits:

Humans can be subjective when taking notes. Bias is removed as objective CI contextually surfaces what matters.
Increase participation and engagement by adding a highly accurate note-taking AI service to the meeting.
Access and search through complete meeting transcripts, meeting notes, action items, summaries, insights, contextual topics, questions, signals, etc.
Understand patterns and trends in your organization’s meeting culture – sentiment, talk ratios, most productive conversations, etc.

Customer care and customer care as a service (CCaaS)

Customer care performance can be measured with three proxy metrics:

Customer satisfaction
Time spent on call
Number of calls serviced

What if you could introduce real-time passive CI into each call and improve all three metrics for your client at once? You can! Adding real-time contextual understanding into your application can provide suggested actions that a customer care agent can act upon during a call, allowing the agent to:

Focus on the human connection with the customer
Come to a swifter resolution using task automation
Serve more customers in the same amount of time while providing a better customer experience

You can also automate post-call data collection. This enables analysis of support conversations over time, agents, shifts, and groups, which leads to a better understanding of pain-points, topics of customer support conversation, etc.

Benefits:

Better customer experience thanks to more engaged support conversations.
Reduced average call handling time thanks to automated real-time actions.
Better data for coaching and benchmarking support staff.
High-level understanding of topics and summaries of support conversation.
Emotional analysis of conversation data.

Sales enablement and customer relationship management (CRM) intelligence

You can use CI to empower customer care agents to focus on the organically growing conversation, rather than the sales process. You can capture conversation data for benchmarking performance, improve net sales, and identify and replicate the best-performing sales scripts.

In this use case, the call is routed and streams the audio to the system’s back-end services. Topic detection accelerates the sales cycle. In real-time, the system provides knowledge-based articles about topics arising in the call, related actions that are required, and can trigger processes. You can also provide real-time transcription, summary pages, send emails with action points, gain performance insights, and other analytics that can be customized to your client’s needs. Follow-ups can be created via outbound work tool integrations, and you can automate the post-call entry with useful summaries.

Benefits: Sales Agent

Real-time suggested actions.
Real-time analysis and insights from the conversation.
Auto-scheduling tasks and follow-ups through outbound work tool integrations.
Full, searchable transcript of sales conversations.
Automate the post-call entry into your CRM.

Benefits: Sales Enablement / VP of Sales

A high-level performance view of the sales function.
Customizable dashboard to view calls in different filters.
Understand what works best in a sales call: topics, questions, competitor mentions, etc.
Replicate the best performing scripts to train and coach your whole team to success.

Social media conversations

Customers interact a lot with brands on social media and other digital channels. These interactions include feedback, reviews, complaints, and a lot of other information. This is valuable data if used properly to derive insights for the business.

You can use a CI API along with social listening tools to extract and categorize all of the different conversations happening in social media channels into actionable insights. For example, you can abstract data from product reviews, conversation threads, and social media comments. You can also identify questions and requests from social interactions and forums to build a knowledge base and direct the customer conversations to the right resources.

With the right integrations to your CRM tools and knowledge base, insights from social conversations can lead to a better understanding of customer sentiment. This provides the brand the ability to deliver better and more efficient customer service through its social channels.

Benefits for brands

Extract topics from reviews based on different levels of ratings and identify what leads to good/bad ratings.
Evaluate influencers/affiliates to work with the brand and ensure the right messaging throughout the campaign.
Understand customer voice from comments and live interactions on Facebook, Youtube, and other channels.
Identify and document questions and requests from specific customers on product forums, comments, and replies to social media posts.
Guide customers to relevant knowledge base articles or support streams based on their complaints and queries on social media.
Enrich customer data on CRM based on insights identified from customer-specific social interactions.

E-learning

CI has a lot to offer e-learning. As well as transcribing lectures or session recordings, you can use CI to intelligently follow presentations to gain insights into how to make learning more intuitive by helping users choose, consume, and understand the content. You can naturally contextualize what a learner wants to accomplish, using content and context to reach data-driven inferences to create a proactive and personalized experience.

Benefits:

Learn intuitively by helping users understand content more deeply and creating a more personalized learning experience.
Efficiently add follow-up tasks and action items as they are surfaced in real-time so key details are not missed.
Pose and answer questions in context and real-time to maintain the information exchange flow and promote depth of learning.
Provide the learner with accurate transcribed notes so they can listen more critically and learn during the teaching, and refer back after.
Access information at any time after with the ability to search and navigate videos and transcription by context.

Want to start building conversation intelligence experiences into your product?

Symbl’s API can be used for all these use cases and more.

Here’s where you can learn how to do video processing with Symbl. It covers:

How to retrieve your credentials and authenticate to Symbl.
How to use Symbl’s Async Video API.
How to file upload and processing.
How to get the relevant parameters for video URL.
How to file and use Job API to poll for job status.

And this Video Summary User Interface provides users with:

The ability to interact with the Symbl elements (transcripts section, insights, filters) from audio and video.
A screen where users can select key elements like topics, transcripts, and insights.
An interface showing the timestamp where the key elements occurred and begin the playback from there.

Getting started is easy. Explore the Symbl API to see how you can build on top of the open-source models and frameworks to get your CI-powered app to market faster.

Additional reading

The post Applications For Your Conversation Intelligence System appeared first on Symbl.ai.

Why and How to Perform an Automatic Speech Recognition (ASR) Evaluation

Toshish Jawale — Wed, 22 Jul 2020 20:34:20 +0000

Why use an ASR evaluation?

An ASR Evaluation can help developers troubleshoot speech recognition issues and improve performance. In addition, an ASR Evaluation can help you identify commonly misrecognized words resulting in better customer experience.

Why we created this evaluation

Evaluating the results from various speech recognition vendors is a challenging task. To help make this process a little easier, we put together a utility, open-sourced on GitHub (link here) that will enable you to evaluate speech recognition vendors and results faster and goes beyond just Word Error Rate (WER) which you might see elsewhere. The utility automatically performs the pre-processing or normalization of the text to remove further manual efforts for the evaluation process.

Metrics from an ASR evaluation

This utility can perform an evaluation of the results generated by any Speech to Text (STT) or Automatic Speech Recognition (ASR) System.

You will be able to calculate these metrics:

Word Error Rate (WER), which is the most common metric for measuring the performance of a Speech Recognition or Machine translation system
Levenshtein Distance calculated to the word level
Number of Word-level insertions, deletions, and mismatches between the original and generated file
Number of Phrase level insertions, deletions, and mismatches between the original and generated file
Text Comparison to visualize the differences (color highlights)
Overall statistics for the original and generated files (bytes, characters, words, newlines, etc.)

Installation:

$ npm install -g speech-recognition-evaluation

What to expect

The simplest way to run your first evaluation is to pass your original and generated options to asr-eval command. The original file is the human-generated plain text file with the original transcript for reference. The generated file is also plain text but contains the generated transcript from the STT/ASR system.

$ asr-eval --original ./original-file.txt --generated ./generated-file

To perform your evaluation, visit the Speech Recognition Evaluation Library on GitHub.

Next steps

ASR evaluations can be confusing and time-consuming. We hope this utility makes the process easier and more convenient for you. We hope this was useful as you explore the benefits of conversational intelligence in your own products. If you have not already taken advantage, we have free trial credits so you can try Symbl’s Platform today.

Learn more about our conversational intelligence solutions by visiting our developer documentation.

The post Why and How to Perform an Automatic Speech Recognition (ASR) Evaluation appeared first on Symbl.ai.

Best Practices for Audio Integrations with Symbl

Toshish Jawale — Wed, 01 Jul 2020 19:44:13 +0000

Choosing the right integration approach and providing speech data to a Voice or Audio API is straight-forward if you take the time to understand the complexities and plan accordingly. These guidelines help decision-makers understand key considerations so you can move forward with a strategy that works for the long term.

Choosing the right API

Here’s a quick decision flow to help you choose the right API for your business.

General Best Practices

After choosing the best API for your needs, the next considerations for accuracy and efficiency are:

Sampling Rate

Capture the audio at the source with a sampling rate of 16,000 Hz or above when integrating over SIP.
Lower sampling rates may lead to reduced accuracy
If you cannot capture audio at the source with 16,000 Hz or higher, don’t re-sample the original audio to bump up the sample rate because this can reduce accuracy
Retain the original sample even if it is lower than 16,000 Hz. For example, in telephony the native rate is commonly 8000 Hz.

Audio Chunk (Buffer) Size

For live audio streaming use cases (Telephony with SIP and Real-time WebSocket API), use a single audio chunk or buffer size closer to 100-millisecond for a balanced latency vs efficiency tradeoff
Larger chunk size in the audio is better for accuracy but will add latency
For example:
Buffer or chunk Size of 4096 bytes (4k) for LINEAR16 Audio where each sample is 2 byte (16-bit), at 16,000 Sample Rate, the chunk size in milliseconds would be – (1 / ((2 * 16000) / 4096)) * 1000 = 128 ms.
Whereas, if buffer size is increased to 8192 bytes (8k) for the same configuration, the chunk size in milliseconds would be – (1 / ((2 * 16000) / 8192)) * 1000 = 256 ms.
For buffer size of 2048 bytes (2k), it would be (1 / ((2 * 16000) / 2048)) * 1000 = 64 ms.

Background Noise

It’s best to provide audio that is as clean as possible
Excessive background noise and echoes can reduce accuracy
When possible, position the user close to the microphone
If you are considering noise cancellation techniques, be aware they may result in information loss and reduced accuracy. If unsure, avoid noise cancellation.
Don’t use Automatic Gain Control (AGC)
Avoid audio clipping

Multiple People in a Single Channel

Ensure audio volume for each person is the same. Differing audio levels for speakers can be misinterpreted as background noise and ignored.
Where possible, avoid multiple speakers talking at the same time
Push Speaker Events to indicate the start and stop times for each person in the meeting or call.

For optimal results, consider using Real-time WebSocket API with speaker separated audio.

Calibration

Symbl provides the optional calibration phase that helps fine-tune the overall system to fit your preferences. Contact us to learn more.

Telephony API – Best Practices

SIP over PSTN

Avoid PSTN as quickly as possible and adapt to SIP.
In some uses cases, it can be very easy to integrate over PSTN by making a simple REST call with phoneNumber and DTMF code. This is great for doing early experiments or PoCs, but for a production-grade deployment if possible, PSTN should be avoided.
Apart from PSTN being an expensive option when it comes to scalability, PSTN audio quality is processed with narrowband mu-law encoding, which is a lossy encoding and reduces the overall accuracy of the speech-to-text.

Audio Codecs

In general, we recommend using Opus over SIP.
Enable Forward Error Correction (FEC) with Opus in your system to optimize for accuracy, especially if your application is under poor network connection. You can also consider AMR-WB (AMR wide-band) over Opus if that’s a feasible option. Note that, AMR-WB is patent protected, and requires the license to be used in commercial applications.
Alternatively, Speex can be used as a third option, but it reduces the accuracy by a small margin.
Considering the real-time nature of how audio is received over SIP possible, the use of lossless codecs FLAC or LINEAR16 is not recommended in real-life applications over SIP. This should be considered only if the network reliability and robustness are very high between Symbl’s endpoint and your application.
Note that, you cannot change audio encoding for PSTN connections you can only use mu-law.

Transmission Protocol

Choose between TCP and UDP based transmission based on the latency vs reliability needs of your application.
If you’re using TCP for RTP packet transmissions, the latency might be impacted by a small margin as compared to UDP based transmissions. However, if reliability is more important for your application, use TCP over UDP, especially if direct traffic is flowing from the poor network or mobile devices.

Secure SIP

Use secure channels with SIPS and SRTP.
Symbl supports dialing-in to insecure as well as secure SIP trunks. We recommend using secure mode in production. SIPS and SRTP are used when secure mode is enabled.

Real-time WebSocket API – Best Practices

Separate channel per Person

Consider capturing audio for each person on a separate channel and streaming it over separate WebSocket connections for most optimal results.
This avoids any issues caused by multiple speakers talking at the same time or different volume levels for each speaker in single-channel audio.
If it’s not possible to capture and send audio separated by the speaker, send the mixed audio in a single WebSocket connection.

Audio Codecs

If network bandwidth is not an issue in your application/use case, use lossless codecs – FLAC or LINEAR16 to capture and transmit audio to ensure higher accuracy.
If network bandwidth is a concern in your case, consider using Opus, AMR_WB, or Speex. See the Audio Codecs section in Telephony Best Practices for more details.

Async Audio API – Best Practices

Audio Codecs

Use lossless codecs – LINEAR16, usually .wav file containers.
The use of lossy codecs like mp3, mp4, m4a, mu-law, etc. is discouraged as it may reduce accuracy.
If your audio is in an encoding not supported by the API, transcode it to LINEAR16. You can consider using this opensource utility (https://github.com/symblai/symbl-media) for transcoding audio using command-line or in your NodeJS code.

You can learn about best practices for each API in our Documentation. Sign up for Symbl to get 100 minutes in free trial credits so you can put these best practices to work.

The post Best Practices for Audio Integrations with Symbl appeared first on Symbl.ai.