A Full-Lifecycle MLOps System for Credit Default, Part 2: Why Accuracy Is the Wrong Goal for a Lender

Part 1 built a system that keeps a credit-default model current on its own: messy data in, a calibrated probability of default out, retrained and promoted without a human in the loop. If you want the technical machinery behind that probability, the data cleaning, the MLflow experiment matrix, blue-green promotion, and the self-retraining loop, it is all in Part 1; this post assumes only its output. Part 1 ended on a single unresolved line of code, the one that turns that probability into an actual decision:

return {"label": int(proba > threshold), "probability": float(proba)}

Everything in Part 1 was in service of proba. This post is about threshold: the one number that decides where a probability becomes an approve or a reject. Leave it at the default 0.5, or pick it to maximise accuracy, and you quietly leave a large amount of money on the table. Choosing it by simulating profit-and-loss instead was worth a 30% profit increase on out-of-sample data versus a baseline cut-off policy.

Accuracy is the wrong scoreboard

The instinct is to tune a classifier for accuracy, or for AUC, and call the best one done. For a lender that instinct is actively harmful, because it assumes the two ways of being wrong cost the same. They do not.

A lending decision has two failure modes:

Approve a borrower who defaults. The loan goes out and does not come back. You lose the principal: the entire amount lent.
Reject a borrower who would have repaid. No loan, no harm done to the balance sheet, but you forgo the margin you would have earned: the interest across the installments, which is a fraction of the principal.

Asymmetric cost of the two errors: approving a defaulter loses the full loan principal, while rejecting a good applicant forgoes only the smaller interest margin. The two mistakes are not the same size.

These are not the same size. Losing a whole principal dwarfs missing one customer’s margin. Accuracy is blind to that: it counts both errors as a single tick in the “wrong” column. A model tuned to be right most often will happily trade one catastrophic false approval for several cheap false rejections, because the scoreboard it was handed cannot see the difference. The fix is to hand it a scoreboard denominated in money.

Put a price on every outcome

Instead of counting right and wrong, assign every prediction its real financial consequence. With the probability of default in hand, each applicant falls into one of four cells, and each cell has a payoff:

profit =  installments_total - loan_amount   # margin earned on a good loan
loss   =  loan_amount                         # principal lost on a bad loan

choices = [
     profit,   # repays,    approved  ->  earn the margin        (win)
    -profit,   # repays,    rejected  ->  forgo the margin       (opportunity cost)
    -loss,     # defaults,  approved  ->  lose the principal     (the expensive error)
     0,        # defaults,  rejected  ->  avoided the loan        (no gain, no loss)
]

This 2x2 is the entire argument made concrete. The bottom two rows are the errors, and they are wildly unequal: a wrong approval costs a full loss, a wrong rejection costs a much smaller profit. Summed across every applicant in a held-out set, this gives a single figure that accuracy never could: the total money a given threshold would have made or lost on real outcomes.

Note what is and is not the model’s job here. The model produces the probability; the business supplies the payoffs. Keeping those two concerns separate is what lets the same model serve different risk appetites just by re-pricing the cells.

The profit curve has a peak

Once a threshold has a dollar value, finding the best one stops being a matter of taste and becomes a search. Sweep the threshold across its whole range, score the held-out portfolio at each step, and keep the one that books the most money:

thresholds = np.arange(0, 1, 0.01)
pnl = [calculate_pnl(scored_df, t).sum() for t in thresholds]
best_threshold = thresholds[np.argmax(pnl)]

Plot that sweep and the shape tells the whole story. Total profit is neither flat nor monotonic, and both ends of the range lose money. Set the threshold near zero and the policy approves almost no one: it books no defaults, but it forgoes the margin on every good customer it turns away, so it runs at a loss. Raise the threshold and it admits the safest applicants first, so profit climbs. It peaks where the next applicant let in is about as likely to default, costing a full principal, as to repay and earn a margin. Push the threshold higher still and the policy waves through ever riskier borrowers; defaults start to dominate, and profit falls away toward the larger loss of approving everyone.

Total profit plotted against the decision threshold, running from reject-everyone on the left to approve-everyone on the right. A single-peaked curve: both extremes sit below break-even, and profit climbs to a peak in between that lands away from 0.5.

The peak of that curve is the answer, and it does not sit at 0.5. Because the cost of a bad approval so outweighs the cost of a good rejection, the profit-maximising cut-off lands somewhere the accuracy-maximising one never would. Optimising the two objectives gives two different thresholds, and only one of them is denominated in the currency the business actually cares about.

What the change was worth

Swapping an intuition-picked cut-off for the peak of the profit curve was worth a 30% profit increase on out-of-sample data. Crucially, that figure is measured on data the threshold was not chosen on, the same discipline Part 1 applied to model selection: a number that only looks good on its own training data is a number that is lying to you. The lift held up on outcomes the search had never seen.

Nothing about the underlying model changed to earn that. Same probabilities, same features, same AUC. The entire gain came from interpreting one number correctly: reading the probability through a cost model instead of a coin flip.

The threshold is perishable too

Part 1 argued that a model in production is a perishable good, re-competed against fresh data and swapped out when something beats it. The threshold is no different. The profit-maximising cut-off depends on the current mix of loan sizes, margins, and default rates, and all of those drift. A threshold chosen a year ago is as stale as a model trained a year ago.

So the threshold search is not a one-off analysis done in a notebook. It runs as a step inside the same orchestrated retraining loop: when a new champion is chosen, its profit-optimal threshold is computed on the same fresh data and saved with the model bundle. The serving layer from Part 1 reads it back at request time, no different from reading the model’s weights:

return {"label": int(proba > threshold), "probability": float(proba)}

The line that opened this post is the line that closes it, only now threshold is not a guess. It is the peak of a profit curve, re-derived every time the model is, travelling alongside the weights so the live decision always reflects both the latest data and the actual economics of a loan.

The takeaway

A classifier gives you a probability. A business needs a decision, and the bridge between them is a single threshold that almost every tutorial leaves at 0.5. For anything where the two kinds of error cost different amounts, and in lending they differ by an order of magnitude, that default is the most expensive line in the pipeline. Price the outcomes, sweep the threshold, take the peak, and keep it fresh. The model was never the hard part.

A Full-Lifecycle MLOps System for Credit Default, Part 2: Why Accuracy Is the Wrong Goal for a Lender

Accuracy is the wrong scoreboard

Put a price on every outcome

The profit curve has a peak

What the change was worth

The threshold is perishable too

The takeaway

Peerapon Wechsuwanmanee

Related Posts

Credit Risk Assessment When the Data Was Never Built for It

A Full-Lifecycle MLOps System for Credit Default, Part 1: From Messy Data to a Self-Updating Model