Scoring Motorcycle Condition from Inspection Videos

A used motorcycle’s condition lives mostly in things a spreadsheet cannot hold: the way the engine sounds at idle, a colour of exhaust smoke, an inspector’s scribbled note. The goal is a single grade an appraiser or a pricing system can act on. The obvious framing is regression: predict a 0-100 condition score. The inputs do not arrive as a feature matrix, though; they arrive as a video, a line of free text, and a short checklist.

So there are two problems, not one. The first is mechanical: turn three things of completely different shapes into one row a model can read. The second is subtler, and it is where most of the value turned out to be: decide what a “good” prediction even means. The regression framing has a trap in it, and stepping out of that trap is what made the result usable. We will build the feature pipeline first, then come back to the framing.

Multi-modal scoring pipeline: an inspection video, a Thai-language note, and a categorical checklist each collapse into a fixed-length vector, concatenate into one feature row, and pass through a single regression pipeline to produce a 0-100 condition score.

The whole feature side is that diagram. Three streams, each reduced to a fixed-length vector, concatenated, and fed to one model. The rest of this post is what happens inside each stream, why concatenation is the entire multi-modal trick, and then the reframing that decided whether any of it was trustworthy.

Three modalities, one rule

A model wants every row to have the same width. A 12-second clip and a 20-second clip cannot disagree about how many columns they occupy; neither can a one-word note and a paragraph. So there is exactly one rule the input layer has to obey: each modality must collapse to a fixed-length vector, no matter how much or how little raw data it started from. Get that, and concatenation is trivial. Miss it, and nothing downstream lines up.

Each stream earns its fixed length differently.

Audio: collapsing the time axis

The engine sound is recorded in the wild, a phone held near a running bike in a yard, not a studio. The clip length varies and the background is noisy, so the extractor has to be robust to both.

The pipeline never feeds raw waveform to the model. It walks the audio down to a compact description of its texture:

Audio feature extraction: a video's audio track becomes a log-mel spectrogram, then MFCC plus delta and delta-delta coefficients form a 39-row matrix that still varies in length over time; averaging across the time axis collapses it to a fixed 39-number vector.

The video’s audio track becomes a log-mel spectrogram (how energy is spread across pitch over time), from which it computes MFCCs, the standard compact summary of timbre, plus their first and second time-derivatives (delta and delta-delta) so the change in the sound, a knock, a rattle, is captured and not just its average. That stacks into a matrix of 39 rows.

That matrix still has a time axis, which is the thing that varies between clips. The final move collapses it:

# 39 rows (13 MFCC + 13 delta + 13 delta-delta) x however many time frames.
all_features = np.vstack([mfccs, delta_mfccs, delta2_mfccs])
# Average across time -> one fixed 39-number vector, any clip length.
return np.mean(all_features, axis=1)

Averaging over time is a deliberate trade. It throws away when a sound happened and keeps how the engine sounds on average, and in exchange every clip, long or short, becomes the same 39 numbers. For a holistic condition grade that trade is the right one: the model needs the character of the engine, not a transcript of one particular rev.

Text: borrow a language model, do not build one

The inspector’s note is free Thai text. Thai is genuinely awkward to process, it is written without spaces between words, so the usual “split on whitespace” tokenisation simply does not work. Hand-rolling that would be a project on its own.

So the pipeline does not. It hands the note to a pretrained Thai language model (WangchanBERTa) and takes the model’s internal representation as the feature vector. One pooling step turns the per-token outputs into a single fixed-length vector for the whole note:

outputs = text_model(**tokenizer(text, return_tensors="pt", truncation=True))
# Mean-pool the token embeddings -> one fixed-length vector per note.
embedding = outputs.last_hidden_state.mean(dim=1)

Two decisions matter here. Use a model trained on the right language, a Thai-specific model already understands Thai script and segmentation; a generic English model would see mush. And mean-pool to a fixed length, the same move as the audio stream, so a terse note and a wordy one produce vectors of identical width. A missing note is filled with an empty string rather than dropped, so the column is always present.

Categorical: one-hot, with a safe default

The checklist is the easy stream: a handful of inspector judgements, engine sound category, exhaust colour, oil residue, fender alignment, each a small set of fixed choices. One-hot encoding turns them into columns directly.

The one production-minded touch is how unseen values are handled:

OneHotEncoder(handle_unknown="infrequent_if_exist", min_frequency=1)

A category the encoder never saw in training does not crash the request; it folds into an infrequent bucket and the prediction degrades gracefully. The same fail-soft instinct as routing unknowns to a default elsewhere: a new label showing up next month should cost a little accuracy, not a 500.

Concatenation is the whole trick

Here is the anticlimax that makes multi-modal learning approachable. Once every stream is a fixed-length vector, fusing them is one line:

features = np.hstack([categorical_values, audio_features, text_features])

That is “multi-modal” in its entirety: not a fancy fusion network, just three vectors laid end to end into one wide row, in a fixed order. The audio occupies 39 columns, the text its block, the checklist its one-hot columns, and the model sees a single flat feature space. It has no idea one chunk came from a microphone and another from a language model, and it does not need to.

The order has to be identical at training and serving time, because column 41 must mean the same thing in both. That ordering contract is the one piece of glue the whole approach rests on.

Picking a model, and keeping it honest

With features settled, the model is a bake-off rather than a guess. Five regressors, ElasticNet, Random Forest, Gradient Boosting, SVR, and XGBoost, each get a randomised hyperparameter search with cross-validation, and the winner is serialised for serving.

The detail that keeps this from rotting in production is that the fitted preprocessors are part of the model. The one-hot encoder learned its categories and the scaler learned its means from the training data; those have to be saved and reused at prediction time, not re-derived. So everything, the encoder, the scaler, the regressor, lives inside a single scikit-learn Pipeline that is pickled as one unit, and the FastAPI layer at serve time runs the exact same audio and text extraction in the exact same column order before calling pipeline.predict. One code path for both training and serving is what prevents train/serve skew, the quiet way a model gets fed subtly different features in production than it learned on.

That settles how to pick a winner, but not what to pick it on. And that is the real story.

The trap in the regression framing

A regressor outputs a number, so the instinct is to grade it with regression metrics: R² and RMSE. Both are fine for ranking candidates against each other, and both quietly mislead about whether the thing is usable.

The problem is false precision. A model that emits 43.7 is implying a confidence that does not exist; the human grader it learned from could not reliably tell 43 from 45 on the same bike. Optimising RMSE chases decimal places that are noise, and reporting an R² of 0.7 tells a business owner exactly nothing about whether they can trust the next assessment. The number looks scientific and answers the wrong question.

What the business actually asks is binary: for this bike, is the prediction close enough to act on, yes or no? That is not a regression question. It is a classification question wearing a regressor’s clothes.

Flip it: within the band is a pass

So the problem gets reframed. The model still produces a 0-100 number, but success is no longer measured in error. A prediction is scored as a simple pass/fail: if it lands within a 5-point band of the expert’s grade it is correct (true); anything outside is wrong (false).

Pass/fail scoring: predicted condition scores plotted against an expert's grade on a 0-100 line, with a plus-or-minus 5 point band around the expert score. Predictions inside the band count as correct (true); predictions outside are wrong (false). The metric is the share that land inside.

def is_correct(y_true, y_pred, tol=5):
    # Within the band -> True (pass); outside -> False (fail).
    return np.abs(y_true - y_pred) <= tol

accuracy = np.mean(is_correct(y_true, y_pred)) * 100  # % of trustworthy calls

This one move changes everything about how the project reads. The headline metric is now the share of assessments that land within tolerance, a single percentage anyone can act on: “the model agrees with an expert nine times out of ten” is a sentence a business owner can make a decision with, in a way that “R² = 0.7” never was. It is also a more honest target, because it only rewards the model for being right to the precision that actually exists, and stops it chasing decimals that do not. And it is harder to game: a model cannot hide a handful of wild misses behind a flattering average the way RMSE lets it. The band, not the decimal, is the deliverable.

It even feeds back into model selection. That same pass/fail rate rides alongside R² and RMSE in cross-validation, so the bake-off optimises for the metric the business reads, not the one the textbook defaults to. The two can disagree, and when they do, the percentage denominated in trustworthy assessments is the one that decides which model ships.

The takeaway

Two lessons, one mechanical and one conceptual. The mechanical one: multi-modal scoring needs no bespoke fusion architecture. Reduce every modality to a fixed-length vector, the audio by averaging over time, the text by borrowing a pretrained language model, the checklist by one-hot encoding, and fusion is a concatenation feeding an ordinary model.

The conceptual one matters more. A problem handed to you as regression is not obligated to stay regression. The 0-100 score looked like the deliverable, but the thing the business could actually use was a binary verdict: is this call trustworthy or not. Drawing a tolerance band and grading pass/fail turned an uninterpretable error metric into a number a non-specialist can read and act on, made the target honest about the precision that really exists, and cost nothing but a change of perspective. The model was never the hard part; deciding what counts as right was.