Building a conversational AI for health & life insurance Part 2
This post was written by Anthony Attwood, Senior Software Engineer.
This is the second of a two-part series that looks at the assistant we’ve developed at Boundlss. The first part covers what we’re trying to achieve with our conversational engine and why it’s important to our business. The second part will cover some of the technical details around how it all works.
Our assistant combines an automated response system and real human coaches and lets us gradually improve the quality of the automated part over time. Pretty neat huh! So, how is it done?
under the hood
When a user sends a message, that message gets sent to our messaging subsystem for processing. The first step is to see if it's part of a structured conversation. If it is, then we respond with the next step in the script and the available options and mark that message as 'read' so our human coaches don't get notified. Job done.
If it's not part of a structured conversation, then the incoming message gets sent deeper into the messaging subsystem. At its heart, it's a fan-out system, where each branch is a different chat agent that tries to respond with a 'best' answer. We use Azure Service Bus topics to get incoming messages through the system. Each worker stage is implemented as an Azure Function within an Azure Function App.
User sends a message, and the app server posts the incoming message to an incoming message topic.
Message pre-processer picks up an incoming message from the incoming message topic.
The pre-processor checks to see if the incoming message is a structured conversation response, if it is, then short circuit the rest of the process. Pick the next message to send to the user and send it along with the available answers. Then mark the message as handled and finish. The pre-processor also spellchecks and analyses the sentiment of the incoming message. This helps to improve the chance of success of the downstream tasks by not having poor spelling (easy to do on a phone keyboard!) interfere with otherwise ‘good’ matches, and also allows us to skip the downstream agents if the user seems angry or upset.
Fan out. Write a copy of the incoming message to the processing topic, and enqueue a copy of the message on the output topic to be delivered in 3 seconds time. You'll see why in a moment.
Each chat agent picks up the incoming message off the processing topic, does its best to provide a response and a confidence rating, and writes its best response and associated confidence rating to the output topic.
Fan In. Finally, the aggregator is notified of each potential response and rating from the output topic. It stores each one temporarily, and waits for that one last copy of the 'original' message (that is delivered in 3 seconds time). Once it sees the 'original' message, it retrieves all the potential responses it's already seen from each agent that has responded, and picks the response with the highest confidence, as long as the confidence rating is above a pre-defined confidence threshold (usually 80%).
If an agent doesn't post its response before the 'original' message is delivered to the aggregator, too bad - the aggregator proceeds with the potential responses it's already seen.
The aggregator looks at the best potential response, and if there is one, it sends the message to the user and marks the users message as 'read'. If no agent had a potential response with a high enough confidence, the aggregator instead assigns the message to a human coach to be dealt with. The potential answers from the agents that didn’t meet the confidence threshold are displayed to the human coach to select as an alternative to writing a unique answer, to save time in case the potential answer is actually a really good one, but didn’t score highly enough.
To close the feedback loop, we also (manually) rate a subset of the human and automated responses. This gives us a couple of extra benefits. It acts as quality assurance and feedback to the human coaches, to help them improve their own work, it acts as QA and feedback to us about the performance of each agent so we can tweak its responses and confidence levels if necessary, and it ensures that when we (re)train our custom agents from our own message history, the agents only ‘learn’ from what we consider to be good responses.
As you can see, fanning out to multiple chat agents allows us to improve each agent separately, and even add new ones if we decide we need to. Then fanning in allows us to pick the best response, and fall back to the human coach if none of the agents are very confident in their answer.
So why bother with multiple agents?
- The first reason is simple. No one agent can do everything at the moment. Maybe in the future, we’ll have single-agent chatbots that are useful and realistic, but we’re not there yet.
- We can test and incorporate new third-party systems, and new and updated ML models fairly easily.
- Lastly, each model, service, and agent is good at recognising certain types of questions and statements and generating certain types of responses. We don’t need to try and force a particular service to do something that it’s not particularly suited to just because it’s the only one available.
- Currently, our agents include Microsoft QnaMaker, Google DialogFlow, a custom developed text matching model, and another that mines our message history for highly rated responses. Our custom matching model uses two main techniques, full-text search over the message history, and semantic matching using a combination of ML algorithms. We might go into this is more detail in a future blog post.
- By wrapping other services, we can make the most of the various publicly available state of the art services. We can take time to evaluate what each is best at, figure out the kinds of responses and conversations we can use them for, and craft our content and training data to make the most of each service. As we improve each agent and the overall ensemble, we expect to see the automated system handling a larger and larger proportion of messages from our users, freeing up the human coaches to have increasingly personalised and context-aware conversations with users
- Multi-language support – A large proportion of our users interact with us in Cantonese, but the state of the art chat bots and support services are geared heavily towards English. Our agents are currently better with English than Cantonese.
- Chat context – Answers that need user context, like what answers the user has given previously.
- Unstructured dialogs with a lot of back-and-forth with the user.