top of page

How to create Virtual AI Sales Promoter?

When you read the title, I am sure you are thinking that this will guide you on how to actually create an AI chatbot that will replace the job of sales promoters. Well hold on your horse, not so fast. Firstly it is not as easy as 'just connect to API' procedure, there will be lots of customization and contraption that you need to pull such a system off, and not to mention the amount of data size and training epochs, computational resources you need to invest in, not to forget the manpower costs too.

In theory, a Large Language Model (LLM) can be used to power a virtual sales promoter that can autonomously call potential customers to promote a product or service. However, there are several aspects to consider before making this a reality.

So let's go through theoretically how are we going to achieve that:

First, you need to have the followings:

1. Speech-to-Text (STT) synthesis: Instead of TTS, or Text-to-Speech, we need a Speech-to-Text system that can directly recognize speech inputs and translates that into text for the LLM to inference.

2. Integrating LLMs: here you can either use API from OpenAi or you clone open-source ones like Llama-3, which has open-sourced its base codes under permissive BSD-3-Clause license, where developers can use, modify and distribute the model for their own projects. Meta Inc, the creator and owner of Llama-3 however, do not share their training and fine-tuning data as it is proprietary. You can also use other open-sourced models from Huggingface like DistilBERT, BERT, AlBERT, Electra, BigBird all from Google or RoBERTa another earlier models by Meta Inc, or Microsoft's DeBERTa, etc.

Mingling with the codes of these open-sourced models requires adequate software development skills and knowledge, which are largely built from either Tensorflow or Pytorch libraries using Python programming language, although some might want to explore binding Python with C++ for boosting its computational speeds, especially when the model's computational demand surges due to more incoming users.

3. Integrating Text-to-Speech (TTS) : After the LLM inference and generates responses, it will be in text format, so it has to be sent to another receiving endpoint of a TTS synthesis system to generate realistic voices based on the generated text from the LLM. The TTS itself is a form of AI that is designed and trained to synthesize text to audio/speech. Its inner crucial component involves a 'speech decoder' module in the model's python codes. There are many types of speech decoders:

  • Statistical Speech Decoders like Hidden Markov Model (HMM) or Gaussian Mixture Model (GMM) that can generate realistic audio/speech through predicting acoustic features from the text generated by the text model.

  • Recurrent Neural Network (RNN) or Convolutional Neural Network (CNN) that predates transformer model, also can be trained and used for audio/speech generations. Each different types of Speech Decoders has its own pros and cons, but we're not covering in this article today, may in other article soon. How Speech Decoders work? Speech decoder module involves: (1) Linguistic Feature Extraction (2) Phoneme to Acoustic Feature Mapping (3) Acoustic Feature Generation (4) Final Waveform generation, that is the part that give the generated voice a realistic sounding voice. Each of these steps involves in building up speech decoder that requires lots of sample voice, speeches where the model learns to different speech patterns.

4.Computer Vision : To allow the AI model to interact with its customers, it has to have visual detection and a whole system of image detection via camera, image processing and feature extraction where the system scrutinize the object's edges, shapes, textures, colors and contrasts from the background. Here a new type of neural network model are required like Convolution Neural Network (CNN), Recurrent Neural Networks (RNN), Long-Short Term Memory (LSTM), even now we have Transformer architecture adapted for visual tasks but each type of neural network architecture has its own set of pros and cons to consider.

Challenges and limitations:

1.Ethical Issue: Creating an AI robot or even the disembodied type that appears on screen that can interact realistically with customers, raises concerns for certain sectors, particularly on the issue of potential job displacement. These needs to be handled separately between the creators, funders, and corporations promoting these AI systems and the related parties raising these concerns.

2.Emotion& Empathy: While an LLM can generate human-like speeches, it may struggle to understand and respond to emotional cues, such as tone of voice, sarcasm, or frustration. This could lead to ineffective or even counterproductive conversations.

3.Contextuality: LLM might not fully understand context of conversation, as it lacks causal-effect reasonings of a human that can lead to miscommunications.

4.Personalization & customization: A virtual sales promoter would need to be able to personalize its approach for each potential customer, which could be challenging without human intuition and creativity. Just imagine how much effort needs to be put in to train, fine tune and ensuring the model works correctly.

Potential solutions:

1.Hybrid approach: Combine the strengths of these AI models with human sales representatives. The virtual sales promoter could handle initial enquiries, and then transfer promising leads to human representatives for further discussion and closure.

2.Supervised learning: Use human feedback and supervision to fine-tune the AI model and improve its conversational skills and emotional intelligence.

3.Specialized training data: Create training data that focuses on sales conversations, customer interactions, and emotional intelligence to improve the AI model's performance.

In conclusion, while such a novel contraption between Speech-to-text-to-Speech (STS) and LLM can be used to power a virtual sales promoter, it's essential to address the challenges and limitations mentioned above. A hybrid approach, supervised learning, and specialized training data could help overcome these hurdles and create a more effective and efficient sales promotion system.


Direct Speech-To-Speech (DSTS): These are relatively new concept and is still under active research. Unlike the usual Text-to-Speech (TTS) or Speech-to-Text (STT) synthesis, where these synthesizers takes in audio/speech and converts it into text for LLMs to inference and then convert it back to speech like how Hanson Robotics' popular robot - 'Sophia'? a DSTS model directly process as well as inference raw visual and audio/speech inputs instead of going through text modality like most LLMs.

Theoretically, it would involve computer vision, speech recognition, processing, and synthesis, both image and audio feature extraction algorithms, really powerful, highly customized attention mechanisms, the customize encoder-decoder stacks and finally the speech synthesis stack before it can be used to power a robot like Sophia, without the need to resort to text modality. This will make the AI Robot respond much faster and human like.

9 views0 comments


bottom of page