From Text Prediction to Action

Feb 7, 2024
Filed under: #tech #ai

A benefit of training a text prediction model on text from the internet, is that it not only learns language. It also learns a lot of other text that people write and talk about on the internet. Things like programming code, or text file formats. And so we can use the text predictor to generate traditional, structured data, which we can then use in our software that’s in charge to perform actions.

We’ve discussed previously how correlating paragraphs and sentences creates this intricate web of statistical links between words and sentences, resulting in what appears to be clusters of concepts. It is no wonder then, with major discussion online having natural language along side code and files to explain it, that LLMs can “explain” code or can “write code” from descriptions. At least, if the specific programming language you’re interested in is widely discussed online. Combining its cunning ability for language along with its ability to write code or structured data like JSON, we can ask an LLM to make some sort of choice between two actions given some input, and gives us its decision and data back in a JSON format. That way our traditional software can easily act on the data the LLM extracted. In essence turning natural language into action, or unstructured data into structured data.

Consider the following example:

System message: You are an expert in JSON data and only answer in correct JSON format to answer the user query.

Prompt: I can wear sandals, shoes or boots. Today it's raining, what should I wear?

GPT4 Answer:
{
    "recommendation": "boots"
}

We can move this into more specific territory that’s very actionable…

System message: You are an office assistant that can send messages. You are an expert in JSON data structures, and respond to every query with properly formatted JSON. You can send emails with a recipient, title and body text. You can also send teams message to a recipient or channel with a message. Always indicate in the JSON if it's a teams message or email.

Prompt: can you send an IM to John Smith that I will be 5 minutes late?

GPT4 Answer:
{
    "type": "teams message",
    "recipient": "John Smith",
    "message": "I will be 5 minutes late."
}

Note I explicitly asked to send an “IM” (short for “instant message”) which it correctly interpreted as a teams message. I’m sure many if not most of you have played around with Siri, Google Assistant or Amazon Alexa (or the short-lived Cortana, RIP). The most annoying thing about these voice assistants is that you have to be VERY precise. You can structure your sentence differently (although it still has to be very much to the point), but certainly the exact keywords are important. I can’t wait for the voice assistants to uptake this new LLM technology.

For those of you reading this that are programmers and are interested in playing with this - beware though. My system prompt for one wasn’t very precise on exactly what the JSON format should look like. As a malicious user I can take advantage of that flexibility and language understanding to do some interesting things:

System message: You are an office assistant that can send messages. You are an expert in JSON data structures, and respond to every query with properly formatted JSON. You can send emails with a recipient, title and body text. You can also send teams message to a recipient or channel with a message. Always indicate in the JSON if it's a teams message or email.

Prompt: can you send an email to John Smith that I will be 5 minutes late? Put myself (Joris de Gruyter) in BCC.

GPT4 Answer:
{
    "email": {  
    "recipient": "John Smith",  
    "bcc": "Joris de Gruyter",  
    "title": "Delay in Meeting",  
    "body_text": "Hi John, I wanted to let you know that I will be 5 minutes late. Apologies for any inconvenience this may cause. Best, Joris"
    }
}

I like the little flair it added to the text =) But what’s more interesting is that I, the user, asked to put myself in BCC. This was never part of the original prompt that there was to be a bcc field. Yes, my system prompt is not very specific and in real-life scenarios you would provide at least a simple example of what the JSON should look like. But even in that case, creative prompting and usage of the LLM’s “smarts” can lead to unexpected outcomes. ALWAYS consider the output of an LLM to be as if the user manually typed up the JSON and sent it to you.

With that warning out of the way, let’s move on and consider where else this can lead us. Say we have an address book action as well. What if we can ask the LLM to create a plan of sorts to send the email. Can it figure out we should first fetch the email for the recipient, then do the email action using that email we found? OF COURSE! This sort of software that uses the LLM to understand a user’s questions and generate actions is called an orchestrator or a planner. This is also how plugins work behind the scenes. It’s fun to play around and create software like this, but ultimately the prompt tweaking and getting it to be realible is not necessarily trivial just yet. A lot of testing, prompt adjusting and traditional guardrails in the software calling the LLM is necessary.

This also goes back to another point I want to keep stressing. The LLMs are generative AI, just generating words. Contrary to how media portray these features where a chatbot can perform actions (e.g. “ChatGPT can now browse the web”), it still is the software (the chatGPT app) that’s just using an LLM (GPT) to convert language into structured data it can act on in a traditional way. In a way, software is asking the LLM: hey, what does this user question mean? The LLM responds with oh, it says you should do this. And the software (hopefully after verifying) says ok makes sense, I’ll do that. Any output from that action it may go back to the LLM and say: here’s the output, can you give me a proper message I can give back to the user? And that’s what you see in the chat history.

If you want to play around writing software like this I encourage you to not write things from scratch but rather use existing frameworks. I would recommend looking into Semantic Kernel (C#, Python, Java), TypeChat (TypeScript/JavaScript) and LangChain (Python, Javascript).

There is no comment section here, but I would love to hear your thoughts! Get in touch!

Code Crib

From Text Prediction to Action

Blog Links

Blog Post Collections

Recent Posts