A Full-Lifecycle MLOps System for Credit Default, Part 1: From Messy Data to a Self-Updating Model
A credit-default model is easy to demo and hard to operate. The notebook that hits a respectable AUC is the first 10% of the work; the other 90% is everything that keeps that number honest once new repayment data keeps arriving and nobody wants to babysit a retrain.
This is Part 1 of a two-part write-up on building a full-lifecycle MLOps system for predicting default on a vehicle-backed consumer-loan portfolio. It covers the machinery: how raw data becomes features, how dozens of model variants are tracked and compared, how the best one is promoted into production without downtime, and how the whole loop retrains itself. Part 2 covers the part that actually makes it pay: turning a probability into a profitable approve/ reject decision.
At its core the model is a straight line from messy inputs to a single number: a default probability. Everything in Part 1 is about making each of those stages reproducible; everything in Part 2 is about what you do with that number.
The shape of the system
The design follows one principle: a model in production is a perishable good. The data distribution drifts, new repayment outcomes accumulate, and a model that was best last quarter may be mediocre now. So the system is built not as “train once, deploy” but as a loop that re-competes models against fresh data and swaps in the winner on its own.
Three pieces cooperate to make that loop safe:
- MLflow is the memory. Every model variant the system ever trains is logged with its settings and its score, so “which model is best, and why” is a query, not a guess.
- Prefect is the conductor. Cleaning, training, evaluation, and promotion are individual steps composed into one workflow that runs on a schedule.
- FastAPI is the front door. It always serves whatever model currently occupies the production slot, with no knowledge of how that model got there.
Everything downstream of this article hangs off those three roles. The rest of Part 1 is about what happens inside each one.
Messy data is the real problem
Real credit data does not arrive as a tidy feature matrix. It arrives as a spreadsheet export: free-text categories typed by humans, dates in two or three different formats, fields with dozens of near-duplicate labels, and missing values scattered everywhere. Before any modelling, this has to become something a learning algorithm can consume, and how you do that cleaning quietly determines how good the model can ever be. Three choices did most of the work.
Collapse high-cardinality categories on purpose. Some fields carried two
dozen distinct labels, many of them rare or saying the same thing in different
words. Feed those in raw and the model wastes capacity memorising noise. Folding
them down to a handful of meaningful groups (and routing anything unrecognised
to a single Unknown bucket) gives the model fewer, stronger signals and a
safe default for values it has never seen. The unrecognised-goes-to-Unknown
rule matters in production: a new category showing up next month degrades the
prediction gracefully instead of crashing the request.
Treat “missing” as information, not absence. In credit data, the fact that a value is missing is often predictive in itself: a blank field can correlate with riskier applicants. So before filling any gap, the pipeline records that it was missing as its own feature, then fills the gap separately. The model gets both facts and can decide for itself whether absence matters:
# Remember the gap before you fill it.
df[f"{col}_filled"] = df[col].isnull().astype(int)
Let the data choose the fill, not a hunch. Numeric gaps are filled by looking at the most similar applicants (a KNN imputer) rather than slapping in a column average. How many neighbours to consult is itself decided empirically: the pipeline tries a few values and keeps whichever reconstructs held-out values most accurately. The recurring theme: wherever there’s a knob, the system tunes it from data instead of trusting a default.
Turning preprocessing choices into experiments
Here is the idea that ties the project together. There is no single “right” way to encode categories or fill gaps; mean-encoding might beat one-hot for one model and lose for another. Rather than argue about it, the system treats every such choice as a dial to be tested, not a decision to be made up front.
Concretely, the transform step is parameterised: you ask it for a specific combination of encoder, scaler, and imputer, and it builds features that way. That single lever lets the training flow sweep a grid of three encoders × two imputers × six model families and fit every combination. Each one is a candidate; the system trains them all and lets the scoreboard decide.
That scoreboard is MLflow. Each candidate is logged as a run carrying the exact settings that produced it and the AUC it earned on held-out data. With everything recorded, picking the best model becomes a one-line query against the ledger rather than a note in someone’s notebook:
best_run = mlflow.search_runs(order_by=["metrics.auc_test DESC"]).iloc[0]
One non-obvious rule makes this trustworthy: the fitted preprocessors are part of the model. An encoder that learned its mapping from the training data, or a scaler that learned its means, must be saved with the model and reused at prediction time. Re-deriving them later, or worse, fitting them on different data, is how a model that looked great in evaluation quietly rots in production. So each logged run bundles the model and its exact preprocessors together as one unit.
Blue-green promotion: never break the live model
Picking the best run automatically is risky if “best” is measured against the data it trained on. A challenger always looks good on its own homework. So the promotion rule is stricter: the reigning model is re-evaluated on the same fresh data as every challenger, thrown into the same scoreboard, and only displaced if something genuinely beats it. A model that’s still the best simply stays.
Swapping the winner in borrows the blue-green idea from web deployment. There
is one production slot, a latest/ directory the serving layer always reads.
Promotion never edits that slot in place. Instead it archives the current
occupant to a dated folder, then writes the new champion into the slot:
This buys two things for free. There is never a moment when the production slot is empty or half-written, so the API never serves a broken model. And because every former champion is kept and dated, rolling back a bad promotion is a single directory move, not a retraining job.
Closing the loop with orchestration
None of this should require a human running scripts in order. Prefect wires the steps (load data, re-score the incumbent, train the challenger matrix, pick the winner, promote) into one workflow, and runs it on a schedule. The cadence is set in Prefect’s UI, so it can be matched to how fast new repayment outcomes actually arrive without anyone touching code.
The payoff is a system that maintains itself: as labelled outcomes accumulate, it quietly re-runs, re-competes every model variant against the newest data, and replaces the production model only when the evidence says to. No manual retrain, no risky hand-deploy.
Serving without surprises
The inference service has the easiest job and the strictest constraint. It loads the bundle from the production slot (model plus its exact preprocessors) and runs each incoming request through the same transform code used in training. That sharing is the whole point: the most common way prediction quality silently drifts is when serving cleans data even slightly differently than training did (train/serve skew). One code path for both makes that class of bug impossible by construction.
The response is deliberately minimal: a probability of default, and a label derived from it:
proba = model.predict_proba(features)[0]
return {"label": int(proba > threshold), "probability": float(proba)}
Packaged in a container, the service boots, reads the production slot, and serves real-time predictions. Because promotion only ever swaps that slot, picking up a freshly retrained champion needs no redeploy; the next request simply loads the new model.
Where Part 1 ends
The system now produces, for every applicant, a calibrated probability of
default, and keeps that model current on its own. But look again at that last
snippet: a single threshold decides where a probability becomes a decision,
approve or reject. Treating that as a default 0.5, or picking it by intuition,
quietly leaves money on the table in both directions: rejecting good customers
costs lost profit, approving bad ones costs the whole loan.
Part 2 is about that one number: why accuracy is the wrong thing for a lender to optimise, how to choose the threshold by simulating profit-and-loss instead, and what that change was worth on out-of-sample data.
Continue to Part 2: Why Accuracy Is the Wrong Goal for a Lender.