Product Manager - Applied AI/ML

Fine-tuning vs RAG: An opinion and comparative analysis

Suprabath Chakilam — Thu, 09 Nov 2023 22:02:22 +0000

Introduction

In recent times, I’ve had the enriching opportunity to immerse myself in the vibrant discourse around AI/ML at various conferences. Being a product manager, my interactions often veer towards the pragmatic aspects of leveraging AI. I’ve come to notice a persistent whirlpool of questions surrounding the application of Retrieval-Augmented Generation (RAG) and fine-tuning in enhancing the functionality of Large Language Models (LLMs). The curiosity doesn’t stem from a mere technical standpoint but dives deeper into the financial orbit as well.

This blog aims to unfold a comparative narrative on the technical aspects and costs associated with fine-tuning and RAG across various models.

Large language models (LLMs) are powerful tools that can generate natural language texts for various applications, such as chatbots, summarization, translation, and more. However, LLMs are not perfect and may have some limitations, such as:

LLMs may not have enough knowledge or domain expertise for specific tasks or datasets.
LLMs may generate inaccurate, inconsistent, or harmful outputs that do not match the user’s expectations or needs.

To overcome these challenges, two common techniques are used to enhance the performance and capabilities of LLMs: fine-tuning and retrieval-augmented generation (RAG).

Fine-tuning is the process of re-training a pre-trained LLM on a specific task or dataset to adapt it for a particular application. For example, if you want to build a chatbot that can answer questions about movies, you can fine-tune an LLM like GPT-4 with a dataset of movie reviews and trivia. This way, the LLM can learn the relevant vocabulary, facts, and style for the movie domain.

RAG is a framework that integrates information retrieval (or searching) into LLM text generation. It uses the user input prompt to retrieve external “context” information from a data store that is then included with the user-entered prompt to build a richer prompt containing context information that otherwise would not have been available to the LLM. For example, if you want to build a chatbot that can answer questions about a specific topic, you can use RAG to query your domain specific knowledge base and use the retrieved articles as additional input for the LLM. This way, the LLM can access the most current, reliable, and pertinent facts for any query.

So, when should you use fine-tuning versus RAG for your LLM application? Here are some factors to consider:

Considerations	Fine-tuning	RAG (Retrieval Augmented Generation)
Cost	High: Requires substantial computational resources and potentially specialized hardware like high-end GPUs or TPUs.	Moderate: Lower than fine-tuning as it requires less labeled data and computing resources. Main cost is associated with the setup of embedding and retrieval systems.
Complexity	High: Demands a deep understanding of deep learning, NLP, and expertise in data preprocessing, model configuration, and evaluation.	Moderate: Requires coding and architectural skills, but less complex compared to fine-tuning.
Accuracy	High: Enhances domain-specific understanding leading to higher accuracy in predictions or generated outputs.	Variable: Excels in up-to-date responses and minimizing hallucinations, accuracy may vary based on the domain and task.
Domain Specificity	High: Can impart domain-specific terminology and nuances to the LLM.	Moderate: May not capture domain-specific patterns, vocabulary, and nuances as effectively as a fine-tuned model.
Up-to-date Responses	Low: Becomes a fixed snapshot of its training dataset, requires regular retraining for evolving data.	High: Can ensure updated responses by retrieving information from external, up-to-date documents.
Transparency	Low: Functions more like a ‘black box’, obscuring its reasoning.	Moderate to High: Identifies the documents it retrieves, enhancing user trust and comprehension.
Avoidance of Hallucinations	Moderate: Can reduce hallucinations by focusing on domain-specific data, but unfamiliar queries may still cause erroneous outputs.	High: Reduces hallucinations by anchoring responses in retrieved documents, effectively fact-checking the LLM’s responses.

Is Fine-Tuning LLMs or Implementing RAG Expensive?

Both fine-tuning and RAG involve some costs and challenges that need to be considered before implementing them for your LLM application. Here are some examples:

The cost of compute power: Both fine-tuning and RAG require significant amounts of compute power to train and run your models. Depending on the size and complexity of your models and data, this may incur substantial expenses for cloud services or hardware resources. For example, according to OpenAI’s pricing¹, fine-tuning GPT-3.5 Turbo costs $0.008 per 1K tokens for training and $0.012 per 1K tokens for input usage ². Running RAG also requires additional compute power for embedding models and vector databases that are used for information retrieval ³ ⁴.
The cost of data acquisition and maintenance: Both fine-tuning and RAG require high-quality data that is relevant and up-to-date for your task or domain. Depending on the availability and accessibility of such data, this may involve expenses for data collection, cleaning, labeling, storage, and updating. For example, according to AWS’s pricing ⁵, storing 1TB of data on Amazon S3 costs $23.55 per month, and using 1TB of data transfer on Amazon EC2 costs $90 per month.
The technical feasibility and complexity: Both fine-tuning and RAG require advanced technical skills and knowledge to implement and optimize your models and data. Depending on the level of customization and sophistication you want to achieve, this may involve challenges such as choosing the right model architecture, hyperparameters, loss function, evaluation metrics, embedding methods, vector databases, etc. For example, according to a blog post by Experts Exchange³, implementing RAG involves several steps, such as loading data, chunking data, embedding data, indexing data, serving data, generating responses, etc.

Simulating an Example

To illustrate how fine-tuning and RAG can be used for an LLM application, let’s simulate an example of building a chatbot that can answer questions about cloud computing. Here are some sample numbers based on what is being offered in the market for fine-tuning pricing, vector database pricing, compute power examples and cost and sample timelines. These numbers are for illustrative purposes only and may not reflect the actual costs and timelines for your specific application.

Below is the simulated example based on the previous discussions and prompts for both fine-tuning and RAG with GPT-3.5 Turbo, LLAMA 2 and Claude 2 for 10 million tokens. The computations are based on a scenario where compute operations run for 15 days, 24 hours each day.

Fine-tuning:

GPT-3.5 Turbo:

LLM/Embedding Cost: $0.0080 per 1K tokens (for training) × 10,000 = $80 + $0.0120 per 1K tokens (for input usage) × 10,000 = $120 + $0.0160 per 1K tokens (for output usage) × 10,000 = $160; Total = $360
Compute Power Cost: $0.5 per hour × 15 days × 24 hours = $180
Total Cost: LLM/Embedding Cost + Compute Power Cost = $360 + $180 = $540

Claude 2 (Fine-tuning):

LLM/Embedding Cost: $1.63 per million tokens × 10 million tokens = $16.30 + $5.51 per million tokens × 10 million tokens = $55.10; Total = $71.40
Compute Power Cost: $0.5 per hour × 15 days × 24 hours = $180
Total Cost: LLM/Embedding Cost + Compute Power Cost = $71.40 + $180 = $251.40

RAG:

GPT-3.5 Turbo:

LLM Usage Cost: $280 (from Fine-tuning)
Embedding Model Cost: $0.0001 per 1K tokens × 10,000 = $1
Vector Database Cost: $70 (Standard Plan for Pinecone)
Compute Power Cost: $0.6 per hour (GPU + CPU) × 24 hours × 15 days = $216
Total Monthly Operating Cost: $280 + $1 + $70 + $216 = $567

GPT-4 (RAG):

LLM/Embedding Cost:
Input Usage: $0.03 per 1K tokens × 10,000 = $300
Output Usage: $0.06 per 1K tokens × 10,000 = $600
Total LLM/Embedding Cost: $300 + $600 = $900
Vector Database Cost (Pinecone):
$70 (Standard Plan)
Compute Power Cost:
Compute power for RAG setup: ($0.5 per hour for GPU + $0.1 per hour for CPU) × 15 days × 24 hours = $216
Total Cost:
Total Cost: LLM/Embedding Cost + Vector Database Cost + Compute Power Cost = $900 + $70 + $216 = $1,186

Claude 2:

LLM/Embedding Cost: $11.02 per million tokens (for prompt) × 10 million tokens = $110.20 + $32.68 per million tokens (for completion) × 10 million tokens = $326.80; Total = $437
Vector Database Cost (Pinecone): $70 (Standard Plan)
Compute Power Cost: ($0.5 per hour for GPU + $0.1 per hour for CPU) × 15 days × 24 hours = $216
Total Cost: LLM/Embedding Cost + Vector Database Cost + Compute Power Cost = $437 + $70 + $216 = $723

LLAMA 2:

LLM/Embedding Cost: Free
Vector Database Cost (Pinecone): $70 (Standard Plan)
Compute Power Cost: ($0.5 per hour for GPU + $0.1 per hour for CPU) × 15 days × 24 hours = $216
Total Cost: LLM/Embedding Cost + Vector Database Cost + Compute Power Cost = $0 + $70 + $216 = $286

Now, let’s present the above calculations in a tabular format for easier comparison:

	Fine-tuning			RAG
Component	GPT-3.5 Turbo	Claude 2	LLAMA 2	GPT-3.5 Turbo	Claude 2	LLAMA 2	GPT-4
LLM/Embedding Cost	360	71.4	0	280	437	0	900
Vector Database Cost	N/A	N/A	N/A	70	70	70	70
Compute Power Cost	180	180	180	216	216	216	216
Total Cost	540	251.4	180	566	723	286	1186
Total Time (days)	15	15	15	Monthly	Monthly	Monthly	Monthly

The comparison highlights varying cost structures across different models for fine-tuning and RAG. Fine-tuning incurs higher costs, especially with GPT-3.5 Turbo, while RAG presents a cost-effective approach, notably with LLAMA 2. GPT-4 has a higher cost in the RAG setup due to its advanced capabilities. The constant factor across both setups is the compute power cost. The choice between models and setups would hinge on budget considerations and the desired balance between customization and broad topical coverage. Fine-tuning requires more compute power and time than RAG, but it may result in more accurate and customized responses for the cloud computing domain. RAG requires less compute power and time than fine-tuning, but it may result in more creative and diverse responses that cover a wider range of topics.

The post Fine-tuning vs RAG: An opinion and comparative analysis appeared first on Symbl.ai.

Enhancing Churn Prediction with Symbl.ai’s Conversation Insights

Suprabath Chakilam — Wed, 30 Aug 2023 16:43:14 +0000

In today’s competitive business landscape, understanding and retaining customers is paramount. Churn prediction has always been a critical endeavor for companies. Though traditional methods have long played a role in churn prediction, acknowledging their limitations such as inconsistent responses from customers and lack of data in real-time, and exploring innovative alternatives is key. In this blog, we’ll explore how businesses can leverage Symbl.ai’s conversation insights to achieve early churn prediction, surpassing the limitations of traditional methods.

Understanding the Challenge

Traditional churn prediction methods primarily rely on numerical metrics like Customer Satisfaction (CSAT), Net Promoter Score (NPS), and issue resolution data. While these metrics offer valuable insights, they often miss the “why” behind customer decisions, for example the subtleties of customer sentiments hidden within conversations. Knowing the “why” enables companies to make informed improvements to their products, services, or customer experiences, addressing specific pain points and tailoring retention strategies more effectively. This knowledge not only helps in retaining existing customers but also informs future business decisions to prevent churn and foster long-term customer loyalty.

Businesses have relied on post-call analysis to gauge customer satisfaction, predict churn, and devise retention strategies. However, this approach is reactive by nature. Here are the problems with this approach:

Time delay due to survey based data collection: CSAT and NPS scores are often collected quarterly/annually after an interaction has occurred, which means that by the time a low score is received, the customer might already be on the brink of churn or churned. Predicting churn in such cases becomes more reactive than proactive.

Inconsistent Responses: Besides customers not providing responses in the survey, the surveys are incomplete and they might not always provide accurate or honest CSAT and NPS scores. Factors like mood, time constraints, and the phrasing of the survey questions can influence their responses, leading to skewed scores that don’t accurately reflect their true sentiments.

The Symbl.ai Advantage: Unveiling the Power of Conversation Insights

Elevate your customer retention strategy with Symbl.ai’s conversation insights. Symbl.ai goes beyond mere transcription, extracting vital emotions, intentions, topics, and sentiments from customer conversations. Imagine seamlessly integrating this invaluable data, including call scores and trackers, with domain-specific metrics. The result? A comprehensive and dynamic understanding of customer interactions that empowers businesses to proactively predict churn and more precisely, tailor actions to foster lasting customer relationships. This proactive approach empowers businesses to address concerns before they escalate.

Let’s explore how Symbl.ai’s conversation insights transforms churn prediction with an example:

A customer contacts a subscription-based company’s customer support regarding a delayed order delivery. Traditionally, the representative would address the issue, inquire about satisfaction, and conclude the call. The NPS or CSAT scores for these calls are analyzed periodically and this post-call analysis might identify a lower CSAT/NPS score which makes the customer likely to churn. On the contrary, conversation insights are collected for every interaction and provide more nuanced insights, context behind the insights such as the intent and unsaid emotions hidden between the lines which helps businesses tailor customer interactions. Conversation insights includes keywords pre-set by businesses such as competitor mentions and sentiment analysis which provides early dissatisfaction signs even before the customer provides a low CSAT/NPS scores. This dynamic analysis enables the generation of churn signals while the conversation is still ongoing leading personalized retention strategies.

With the churn signals from conversation insights, representatives can address the delayed delivery issue more promptly, offer solutions that specifically counter the competitor’s appeal, and ensure that the customer’s concerns are met before they escalate. This proactive approach not only resolves the immediate issue but also prevents potential churn, fostering customer loyalty and satisfaction.

By empowering businesses to intervene proactively, address concerns, and build enduring relationships, this technology heralds a new era in customer retention. With Symbl.ai, businesses can not only predict churn but also shape a future where customer satisfaction takes center stage, driving sustained growth and success.

The post Enhancing Churn Prediction with Symbl.ai’s Conversation Insights appeared first on Symbl.ai.

Introducing Symbl.ai Web SDK

Suprabath Chakilam — Thu, 14 Apr 2022 06:00:44 +0000

We are excited to announce the Symbl.ai Web SDK beta! The Web SDK is an open-source kit that makes it easier to support WebSocket connections and live streams and is backed by Symbl.ai. WebSDK supports JavaScript and TypeScript languages. Web SDK streamlines the development process for developers building applications with streaming audio using Symbl.ai’s Streaming API or Subscribe API. We received feedback that developers love using these APIs to deliver real-time conversation intelligence in browser-based applications, but that it was a struggle to create and manage WebSocket connections, handle devices with multiple audio stream formats, and ensure stability for web applications. This release of Web SDK addresses all of these concerns. Web SDK will be supported and maintained by the Symbl.ai team.

Let Symbl.ai take care of WebSocket connections

WebSocket is a communications protocol, where a bi-directional handshake is established between the client and server. Handling this handshake process and network related issues is a tiresome task. Further, the effort spent to do so distracts developers from developing core application functionality. Symbl.ai’s Web SDK takes care of the WebSocket connections within the Symbl.ai platform, in line with the RFC 6455 standardized approach. Apart from the standard threshold limits, simple configurations are provided for developers where they can set their own timeout thresholds. Managing multiple WebSocket connections is also easy with the Web SDK by creating multiple instances of Symbl() object.

Handle audio from all your devices

In web applications that capture and process live audio from end users who are speaking, it is common for end users to switch between devices such as from a laptop microphone to a headset. These device changes modify the baseline processing parameters for the audio data and must be reflected so Symbl.ai can adapt to the correct configurations. Symbl.ai’s Web SDK handles all of these device issues. Web SDK supports a wide range of audio codecs (OPUS & Linear16) and sample rates (8kHz, 16kHz, 24kHz, 44.1kHz, 48kHz). Web SDK also has an automatic device handling mechanism that takes care of updating codec and sample rates appropriately whenever input audio devices are switched. Developers can also now input custom audio streams with only a few lines of code.

Hello World application for Web SDK: Try it out

Let’s look at how this works in the example below. A prerequisite to utilize the Web SDK is to ensure that you have your API Credentials which are the App ID and App Secret handy. You can get them from the Symbl.ai Platform. Alternatively, you can use your access token for authentication as well, see the Authentication page to learn more.

Symbl.ai SDK Step-by-Step

Import the SDK from @symblai/symbl-web-sdk and initialize it with your App ID and App Secret. Start creating a web socket connection with the createConnection(). This method will establish a bi-directional handshake between your client and Symbl.ai’s server, thus creating a web socket connection. Configure your request with a wide range of input parameters. To learn more, take a look here.

import { Symbl } from "@symblai/symbl-web-sdk";
try {
  const symbl = new Symbl();
  await symbl.init({
    appId: '',
    appSecret: '',
  });

  const connection = await symbl.createConnection();
  await connection.startProcessing({
    config: {
      encoding: "OPUS"
    }
  });

  await symbl.wait(10000);
  await connection.stopProcessing();
  await connection.disconect();
} catch(e) {
  // Handle errors here.
}

Time to get started

To learn more about how Web SDK helps you to build applications, check out our full documentation as well as our GitHub repo. Also if you have not already, you can sign up for a Symbl.ai account here.

The post Introducing Symbl.ai Web SDK appeared first on Symbl.ai.