Routing Customer Feedback with Few-Shot Retrieval
A stream of customer feedback arrives as free text, written by people who are annoyed, grateful, or just terse, and somebody has to decide which team should act on each message. “รถสกปรก” (rot sokaprok, “the car is dirty”) belongs to the wash-and-cleaning desk. “ใช้เวลานาน” (chai wela nan, “it took too long”) belongs to whoever owns turnaround time. A single message often belongs to several teams at once. Done by hand, this triage is slow, inconsistent between reviewers, and the first thing to fall behind when volume spikes. The job is to read each message and route it to the right team automatically, and one message can carry more than one label.
The textbook name for this is multi-class, multi-label text classification, and the textbook answer is to train a dedicated classifier. That is a defensible answer. It is also, when the real constraint is ship something useful soon, the wrong opening move. This post is about the reframe that let the system go live quickly and then get better in production: treat classification as constrained generation grounded by retrieval, and turn every human correction into a new example the system can use immediately.
Why not just train a classifier
A dedicated multi-label classifier is a real project, not a checkbox. It needs a labeled corpus large enough to learn from, a language-appropriate representation, a training loop, a held-out evaluation you trust, a serving path, and a story for what happens when the label taxonomy changes, which it always does. Worse, the taxonomy here is not five buckets. It is dozens of fine-grained categories, each with a human-readable label, and they are exactly the kind of categories an operations team likes to add to and rename as they learn what they want to track.
Every one of those taxonomy edits is a retraining trigger for a trained model. Add a category, and the classifier has never seen it; it cannot emit a class that was not in its training labels. So the model that was supposed to save effort quietly becomes a thing you re-fit on a schedule, and the gap between “an operator noticed a new kind of complaint” and “the model can route it” is one full training cycle wide. When the requirement is speed, paying that cost up front, before anyone has confirmed the categories are even right, is premature.
The reframe is small and it changes everything: a classifier picks one of a fixed set of learned classes, but a language model can simply name the right category, given the list of categories to choose from. Classification becomes text generation. And text generation does not need a training run to learn a new category; it needs the new category to appear in its prompt.
This shortcut pays off in any language, but it pays off most in a low-resource one. Where a major language hands you pretrained classifiers and embeddings to fine-tune, a low-resource language may offer neither, and building a language-appropriate representation from scratch is exactly the multi-month detour the generation approach skips. The saving in development time is largest precisely where conventional NLP is hardest.
Making the output trustworthy: structured generation
Asking a model to “name the category” is only useful if the answer is
machine-readable every single time. Free-form prose that mentions a category
is not an API; a downstream router needs a guaranteed shape. So the output is
constrained to a typed schema rather than parsed out of prose. With
instructor wrapping the OpenAI
client, the model is handed a Pydantic model as its response contract and the
library enforces it:
class Routing(BaseModel):
categories: list[str] # every category that applies to the message
client = instructor.patch(OpenAI())
result = client.chat.completions.create(
model=LLM_MODEL,
response_model=Routing, # the contract, validated on the way out
messages=[{"role": "user", "content": prompt}],
)
The multi-label requirement falls out naturally. Rather than a fixed set of binary heads (one per class, the usual multi-label trained-model design), a message that touches several concerns comes back as a list with several entries and a single-concern message as a list with one. No fixed width, no per-class threshold tuning; the model names every category that applies and the schema guarantees the result parses.
The part that does the real work: retrieval as few-shot
Here is the catch. Hand a model a long list of fine-grained categories and a raw message and ask it to choose, cold, and it does roughly the right thing, which is not good enough for routing. The categories are close together. The difference between “the staff did not explain things” and “the staff was not attentive” is a judgment call that depends on how this particular operation has drawn the line in the past. Zero-shot, the model does not know where that line is. It is guessing at a house style it has never seen.
The fix is few-shot prompting: show the model worked examples so it can imitate the decision instead of inventing it. But static, hand-picked examples do not scale to dozens of categories, and they go stale. So the examples are retrieved per message instead of fixed. Every message that has already been labeled is embedded once and kept in ChromaDB next to its category; that example store is the only thing the approach needs standing. Then, for each incoming message, the flow is the same three steps:
- Embed the incoming message with the same model used to build the store.
- Ask ChromaDB for its nearest neighbors by cosine similarity: the past messages that most resemble it in meaning, not in spelling.
- Paste those neighbors, with the categories a human already assigned them, into the prompt as the few-shot examples.
def get_examples(text: str, n: int = 5) -> str:
neighbors = collection.query(query_texts=[text], n_results=n)
rows = "message, category\n"
for doc, meta in zip(neighbors["documents"][0], neighbors["metadatas"][0]):
rows += f"{doc}, {meta['category']}\n"
return rows
The model is no longer reasoning from a label list in the abstract. It is
looking at the n most similar past messages (five by default, adjustable per
request), seeing how a human routed each, and continuing the pattern. The full
taxonomy still goes into the prompt so every category stays reachable, but the
retrieved neighbors are what make the call precise, because they encode the
house style the bare list cannot.
Does the retrieval actually help?
The claim that retrieved neighbors are what make the call precise is an empirical one, so it is worth measuring rather than asserting. The test is an A/B comparison on a held-out split of labeled feedback: the same messages routed two ways, once by the model working cold from the taxonomy alone (zero-shot), and once with the retrieved few-shot examples added (the RAG version), scored against the human labels. Single-occurrence categories are pushed into the training side so nothing unseen lands in the test set, and the comparison is run on two separate held-out test sets to check the result is not a fluke of one.
Because a message can carry several categories, ordinary accuracy is the wrong yardstick; the evaluation uses multi-label metrics instead:
- Exact Match Ratio is the strict one: the message counts as correct only if every category matches, no misses and no extras. It is the closest thing to “did this message get routed perfectly.”
- Macro F1 averages the precision/recall balance per category, so a rare category weighs as much as a common one. It rewards getting the long tail right, not just the handful of frequent categories.
- Hamming loss is the fraction of individual category slots the model got wrong; unlike the other two, lower is better.
| Test set | Condition | Exact Match | Macro F1 | Hamming Loss |
|---|---|---|---|---|
| A | Zero-shot LLM | 0.20 | 0.23 | 0.086 |
| A | + RAG few-shot | 0.61 | 0.57 | 0.035 |
| B | Zero-shot LLM | 0.22 | 0.17 | 0.038 |
| B | + RAG few-shot | 0.60 | 0.37 | 0.017 |
The pattern is the same on both test sets and it is not subtle. Retrieval roughly tripled the exact-match rate, lifting it from “wrong four times out of five” to “right three times out of five,” and it cut the Hamming loss by half or more. Macro F1 tells the more interesting story: the jump is larger on set A (0.23 to 0.57) than on set B (0.17 to 0.37), because set B has the longer, finer-grained taxonomy, and the long-tail categories are exactly where a handful of retrieved examples help most and where there is the most headroom left.
Two honest caveats. The absolute numbers are not a trained classifier’s numbers; 0.6 exact match on a fine multi-label taxonomy is a starting point, not a finish line. And these were measured on a fixed snapshot of the example store, before the human-in-the-loop loop below has run at all. That second caveat is the whole point: this is the floor, and the next section is about why it rises.
The flywheel: corrections are training data, available instantly
Routing the message is not the end of the story; it is the start of the loop that makes the whole approach pay off. A reviewer still looks at the routed cases, and sometimes the model got it wrong, or right but incomplete. With a trained classifier, that correction is inert: it sits in a spreadsheet until the next retraining batch, weeks of similar mistakes later. Here the correction is the mechanism. The corrected message is upserted straight into ChromaDB as one more labeled example:
@app.post("/add")
def add(example: LabeledExample):
collection.upsert(
embeddings=[embed(example.text)],
documents=[example.text],
metadatas=[{"category": example.category}],
ids=[f"_id_{uuid.uuid4()}"],
)
The next time a similar message arrives, that freshly corrected example is sitting in the retrieved neighborhood, steering the model toward the answer a human just endorsed. There is no training run, no redeploy, no version bump. The example store is the model’s memory, and a person edits it directly by doing the review they were already going to do. A new category works the same way: add a few labeled examples for it and it becomes routable immediately, because “learning” it means nothing more than making it retrievable.
This is the human-in-the-loop done as a flywheel rather than a chore. The reviewers are not annotating a dataset for some future model; their everyday corrections improve the live system on the next request. Over time the most confusing, most-corrected message types accumulate the densest example coverage, exactly where the model needs the most help. The reported outcome of running this in production was case resolution time falling by roughly 18%, and the honest source of that number is less the cleverness of any single prediction than this compounding loop tightening routing week over week.
Packaging it so it can move
None of this earns its keep if seeding and expanding the example store means a
deploy. The whole thing runs as a Dockerized FastAPI service with ChromaDB as a
companion container, and the example store is editable at runtime through the
API: /add for a single labeled message, a bulk CSV upload for backfilling
history or seeding a fresh deployment, and /predict for routing. An operations
team can paste a batch of historical labels in as a CSV and watch routing
sharpen, with no engineer in the path. The thing that would have been a
retraining job in the trained-model world is, here, an HTTP request.
When this is the right call, and when it stops being one
This architecture is the right first answer, not the eternal one, and being clear about the trade is the point. Every routed message is a model call, so there is per-request cost and latency a trained classifier running locally would not have, and a dependency on a hosted model. There is no hard accuracy guarantee; you are trading a model you can characterize precisely for one you can change instantly. For high, stable volume over a frozen taxonomy, those trades eventually favor training the dedicated classifier after all.
But here is the quiet bonus: by the time that day comes, you are not starting from nothing. Every correction the reviewers made has been accumulating in ChromaDB as a clean, human-verified, labeled example. The retrieval store you built to avoid training is precisely the labeled dataset you would need to train. The fast path and the eventual robust path are not in conflict; the first one quietly produces the raw material for the second.
What carried the project was refusing the reflex. “Multi-label text classification” sounds like it demands a trained model, and that framing would have spent the first month on data plumbing before anything routed a single message. Reframing it as retrieval plus constrained generation put a working router in front of reviewers almost immediately, and wiring their corrections back into the retrieval store turned the unglamorous review work into the engine that makes routing better. The model was never the interesting part. The loop was.