Building a Bilingual Semantic Product Search Engine

A second-hand marketplace is a search problem wearing a catalog’s clothes. The listings are written by sellers, not a content team. One person lists a handbag as a “bag”, another as a “purse”, another types the Thai word กระเป๋า (krapao, “bag”), and another writes ของแท้ มือสอง (khong thae mue song, “authentic, second-hand”). They are all the same object. A buyer who searches “purse” should find every one of them. Keyword search, the kind every database ships with, matches strings: it has no idea that “purse” and กระเป๋า point at the same thing, because the letters do not overlap. So the goal was the now-familiar one: make search understand meaning, not just spelling, and do it across Thai and English in the same index.

The reflex answer to “semantic search” is a vector database. Qdrant, Weaviate, LanceDB, pgvector: pick one, embed everything, query by cosine distance. That is the right answer for some teams. It was the wrong first answer here, and most of this post is about the two decisions that followed from saying so: where the search index actually comes from, and how to make its ranking trustworthy.

The infrastructure decision: boring on purpose

An early-stage product chasing its first users does not optimize for the most powerful retrieval stack. It optimizes for the least operational surface that still clears the bar. The data already lived in PostgreSQL, the system of record. Standing up and operating a separate vector store, tuning its indexing, keeping it alive at 3am, is a standing cost paid in the scarcest currency a small team has: attention.

A managed search engine inverts that. Meilisearch Cloud ships hybrid search, lexical and semantic in one query, with typo tolerance, faceting, and bilingual tokenization already in the box. You pour the data in, wait a few minutes, and it works. No embedding pipeline to babysit, no nearest-neighbor index to re-tune as the catalog grows. The trade is real, you give up some control over the retrieval internals, but for a team that needs good search this quarter rather than the best search next year, the managed engine is the correct boring choice.

Which leaves exactly one hard problem, and it is not the search engine.

The real work: data does not teleport

The catalog lives in Postgres. Search runs in Meilisearch. Nothing magical carries a row from one to the other. The moment a seller edits a price or a listing sells, the index is wrong until something reconciles it. That something is a synchronization mechanism, and it is where the engineering actually lives.

The naive version, having the application dual-write to both Postgres and Meilisearch on every change, is a trap. It scatters search code through the app, and the two stores drift the instant a write half-fails. The robust version treats the database’s own write-ahead log as the source of truth and follows it: change data capture (CDC). Postgres logical replication already emits every insert, update, and delete as a stream; a consumer tails that stream and pushes the changes into the index. The application keeps writing to Postgres exactly as before and never learns that search exists.

The first tool I reached for was meilisync. On paper it does precisely this. In practice its config and API had aged out: a repo last meaningfully touched years ago, assumptions that no longer matched current Meilisearch, and a setup that simply would not come up. Fighting a stale tool into working is rarely worth it, so I switched to MeiliBridge, a younger CDC bridge written in Rust that tails a Postgres logical-replication slot and batches changes into Meilisearch.

MeiliBridge had the right design but a deployment that did not start. Its v0.1.6 container refused to boot: the config loader was still looking for keys from an earlier schema (metrics where the file now said monitoring, among others), and a Redis dependency that was documented as optional was, in practice, mandatory. This is the ordinary texture of adopting young infrastructure. The difference from the meilisync dead end was that here the fix was legible, so I fixed it and sent it upstream: binary-touch/meilibridge#9 corrects the config mounting and metrics-port handling and makes Redis genuinely optional. With that patch, the bridge deploys and the pipeline runs.

The whole sync side is that diagram. Postgres stays the single writer. MeiliBridge is the only thing that knows both stores exist. Meilisearch builds the embeddings as documents land, using a small OpenAI model over a template of the name and description fields, so the same index answers both a keyword match and a semantic one. A row mutated in Postgres shows up in search within a couple of seconds, with no change to the host application. That property, the index follows the database, is the entire reason to prefer CDC over dual-writes.

The part where the value hides: relevance

Standing up the pipeline gets results onto the screen. It says nothing about whether they are the right results, and that is the half that actually decides whether anyone trusts the search box. Hybrid search has a dial, the semantic ratio, that weighs lexical matching against embedding similarity. Turn it all the way to lexical and you are back to keyword search: a typo or a synonym returns nothing. Turn it all the way to semantic and the engine starts returning things that are plausibly related but wrong, because embeddings are confident about neighbors that a shopper would never consider the same product.

The bilingual catalog makes that tension sharp in a way an English-only catalog hides. Consider two Thai queries:

จักรยาน (chakkrayan, “bicycle”)
จักรยานยนต์ (chakkrayan-yon, “motorcycle”)

Look at the romanizations: the word for motorcycle is literally the word for bicycle plus a syllable, the way “motor” rides on top of “cycle” in English. The two strings overlap almost completely, and to an embedding model the two meanings sit almost on top of each other too. Lean the dial toward semantics and a search for bicycle starts dragging in motorcycles, and the reverse, precisely the items a buyer did not mean. Lean it back toward lexical and the brand and synonym queries that needed semantics go cold.

I treated this as a measurement problem rather than a matter of taste. The setup is simple: take a fixed list of representative queries, run each one, and have a human look at the top 10 results and mark each as relevant or not. That judged set then lets you score any configuration with two standard search metrics:

Precision@10 asks the blunt question: of the 10 results on the first screen, how many are actually relevant? Eight out of ten is 0.8. It does not care about order, only the hit rate in the top 10.
NDCG@10 (normalized discounted cumulative gain) asks the sharper question: are the good results near the top? It rewards a relevant item at position 1 more than the same item at position 9, because a shopper reads top-down. A score of 1.0 means the ideal ordering; lower means good results are buried below worse ones.

Precision tells you whether the right things showed up; NDCG tells you whether they showed up in the right order. I scored each candidate configuration on both as I moved the dial and adjusted ranking rules, typo tolerance, and the embedding template. The aggregate numbers stayed respectably high (mean NDCG@10 around 0.90), but the aggregate was not the interesting part. The per-query breakdown was, because it showed the trade instead of hiding it. The columns below are Precision@10, the count of relevant results in each query’s top 10:

Query	Tests	More lexical	More semantic
`furniture` (EN category)	broad category recall	6/10	6/10
`Nike` (EN brand)	brand recall	5/10	10/10
`Hermes` (brand)	brand recall	7/10	8/10
`จักรยาน` (TH chakkrayan, bicycle)	near-synonym precision	9/10	4/10
`จักรยานยนต์` (TH chakkrayan-yon, motorcycle)	near-synonym precision	6/10	2/10

Raising the semantic weight was not a clean win or a clean loss. It rescued the brand queries, Nike went from half-relevant to perfect, while it eroded the Thai near-synonym pairs that were already doing fine on lexical matching. There is no single ratio that wins both columns, because the two columns want opposite things. Brand recall wants the engine to generalize; near-synonym precision wants it to stop generalizing at exactly the wrong moment.

So the resolution was not a magic number but a division of labor. The retrieval dial is set where brand and synonym recall pay off, the demo runs at a semantic ratio of 0.7, and the cases where semantics over-generalizes are handled by a different mechanism: faceted filtering. A shopper who wants bicycles and is shown a motorcycle does not need the ranker to be perfect; they need a category facet that lets them say “bicycles only” in one click. Tuning decided what the ranker should reach for; faceting caught what it reached too far for. The metrics existed to make that boundary visible, not to chase a single score upward.

The demo: making the API legible

Alongside the pipeline I built a small search UI, so the behavior was something you could use rather than read about in a config file. It is deliberately thin, a single page over the Meilisearch search API, but it exercises the parts that matter:

Type-ahead, debounced, issuing hybrid queries as you type so suggestions reflect meaning and not just prefix matching.
Faceted filtering by category, the mechanism that backstops the over-generalization above.
Bilingual querying, where typing in Thai or English reaches the same documents because both the lexical and the semantic sides of the index are language-aware.

The point of the demo was never the UI. It was to make the shape of the search API concrete, so the people building the product could see what a query costs, what a result looks like, and where the tuning knobs actually live.

What the project was really about

It is tempting to file this under “added semantic search,” but that framing misses where the work was. Two decisions carried it. The first was refusing the reflex: for an early-stage product, the win was not the most powerful vector store but the engine you can stand up in an afternoon and stop thinking about, which turned the whole problem into a synchronization problem. The second was treating relevance as something you measure rather than something you feel, which turned a vague “make search better” into a specific, defensible boundary between what the ranker handles and what facets handle.

The unglamorous middle, a CDC bridge that follows the write-ahead log and a one-line-in-spirit upstream patch to make it deploy, is what holds the whole thing up. Semantic search, for a team chasing its first users, is less about the cleverness of the retrieval and more about the boring reliability of the plumbing underneath it and the honesty of the yardstick on top.

Building a Bilingual Semantic Product Search Engine

The infrastructure decision: boring on purpose

The real work: data does not teleport

The part where the value hides: relevance

The demo: making the API legible

What the project was really about

Peerapon Wechsuwanmanee

Related Posts

A Full-Lifecycle MLOps System for Credit Default, Part 1: From Messy Data to a Self-Updating Model

A Full-Lifecycle MLOps System for Credit Default, Part 2: Why Accuracy Is the Wrong Goal for a Lender