Credit Risk Assessment When the Data Was Never Built for It
Picture a big-ticket purchase, the kind too expensive to pay for at once, bought on an instalment plan. Because the sum is large and the buyer is a stranger, approval is not instant. There is a manual process: paperwork, a check of the applicant’s circumstances, and a field officer who visits the applicant’s home to size them up in person before anything is signed. The question the whole process exists to answer is simple to state and expensive to get wrong: if we approve this person, will they pay us back?
That is the job for a model: score an application before it is approved, so the
human process has a second opinion. But every credit-scoring tutorial starts from
a place this problem never gives you: a clean table, one row per applicant, a
tidy set of feature columns, and a default column at the end. Real operational
data has none of that. It is a database built to run a lending business, not to
train a model. The applicant’s signal is smeared across half a dozen tables, and
the one column the tutorial takes for granted, the label, does not exist
anywhere. Nobody recorded “this person defaulted” as a field. You have to
manufacture it.
So before any model, there are two unglamorous problems that decide the entire ceiling of the project. First, reconstruct a feature matrix from an operational schema that was never meant to produce one. Second, invent an honest target label out of repayment history. And running underneath both is a single discipline that turns out to be the technical spine of the whole thing: a hard line between what is knowable when the decision is made and what only the future reveals. The classifier at the end is the easy part.
The data is not a table
The instinct is to ask “which table holds the training data.” There is no such table. An applicant’s signal is distributed across the operational schema by the logic of the business, not the logic of modelling:
- the contract record (price, down payment, instalment count, payment size),
- the customer record (income, occupation, marital status, dependents),
- the referral records (how this person came to the business, and there can be several per customer),
- the home-visit record an officer fills in before approval,
- the selling item record (its list price, and the cost behind it).
Reconstructing one row per application is a JOIN you discover, not one you are handed. And which tables actually carry predictive signal is not knowable in advance. The fields a stakeholder is sure will matter (headline income, say) do not always dominate, and the only way to find out is to pull every table in, turn each into usable columns, and let the model weigh them. That second step, turning raw operational fields into features, is the next job.
The features are the easy half
Turning these tables into a feature matrix is mostly mechanical, and it is worth being honest about that: almost none of it is clever. Each operational field gets the ordinary treatment. The free-text home-visit note an officer writes before approval, the kind of dwelling, whether it is owned or rented, how long the applicant has lived there, gets parsed into a residence type and a numeric tenancy length:
def residence_type(note: str) -> str:
# Bucket the free-text dwelling description into known categories;
# anything unrecognised falls through to a safe default elsewhere.
for category in RESIDENCE_CATEGORIES:
if note.startswith(category):
return category
return "unknown"
def tenancy_months(note: str) -> int:
# A note describing a rental often carries a duration worth keeping.
if describes_rental(note):
return parse_trailing_months(note)
return 0
A customer can be referred through several channels, so a packed referral string
("3|7|12") expands into a block of binary columns, one per known channel. Income
is consolidated from whichever field is populated, a monthly figure when present,
otherwise a daily figure scaled up, with a recorded zero treated as “not stated”
rather than a real income of nothing. Standard stuff, the ordinary tax of working
from an operational database instead of a clean export.
The one decision in here worth flagging is small: keep the human-authored note at all. It is the closest thing in the dataset to a person’s read of the applicant, so rather than drop a messy text column, the pipeline mines it and then discards the raw text, which can carry personally identifying detail. The structured categories survive; the free text does not. The judgement is kept, the PII is not.
So the features are the easy half. The two parts that actually decide whether the project works are both harder and stranger: there is no label to train on, and not every column you have is one you are allowed to use. The rest of this post is those two.
The label does not exist
Here is the part that has no equivalent in a tutorial. There is no default
column. The warehouse records what a contract is currently doing, its servicing
status and its age, not a tidy verdict of “good” or “bad.” The target has to be
constructed from that, and the construction is where most of the judgement lives.
The skeleton is the servicing status. Some outcomes are already settled:
def label(status: str, age_months: float, term_months: int) -> str:
if status == "CLOSED": # paid to term
return "good"
if status in ("WRITTEN_OFF", "REPOSSESSED"): # total loss
return "bad"
if status in ("TERMINATED", "CANCELLED"): # failed early
return "bad"
...
The hard cases are the active contracts, the ones still being paid, which
have no settled outcome yet. You cannot wait years for every loan to close, so an
active contract is judged by how far it has come: if it is most of the way
through its term or simply old enough, it is behaving like a good loan and is
labelled good.
if status == "ACTIVE":
completion = age_months / term_months
if completion >= 0.70 or age_months >= 18:
return "good" # far enough along to trust
if age_months >= 6:
return "monitor" # real, but verdict not in yet
return "exclude" # too young to judge
Those two numbers, the 0.70 completion mark and the 18-month floor, are not
sacred; they are knobs. Loosen them and more active contracts qualify as good,
which buys you training rows at the cost of certainty. Tighten them and every
label is safer but the dataset shrinks. They also tune recency: some contracts
in the warehouse are nearly a decade old, so leaning on long-settled history gives
you volume, while leaning on younger, still-active contracts keeps the signal
current at the price of being less sure how those loans end. Where you set the two
thresholds is a deliberate trade between how much data you train on and how recent
that data is, and it is worth revisiting rather than hard-coding once.
The two excludes are the whole point, and they are easy to get wrong. A contract
only a few months old gets thrown out, not labelled good: its outcome genuinely is
not known, and labelling it either way would teach the model a guess.
The second exclusion is sharper. Rejected applications are excluded entirely, because their outcome was never observed. The business turned them down, so the loan never happened and there is no repayment history to grade. This is not just missing data, it is a permanent blind spot. You can never learn whether a rejected applicant would in fact have paid, which means a false rejection, a good customer the old policy turned away, is literally unmeasurable. And it has a second consequence that colours everything downstream: every row that survives into training is one that was approved, so the dataset is a biased sample of the world. It describes only the population the existing policy already said yes to. The model learns the approved region of applicants well and the rejected region not at all, and no metric computed on this data can tell you otherwise. Labelling rejections as “bad” would not fix the bias, it would bake the old policy’s judgement in as if it were ground truth.
So only contracts with an outcome you can actually stand behind, settled or far
enough along, make it into training; the monitor middle ground is held back too.
That discipline, refusing to label what you cannot observe, is the difference
between a model that predicts default and a model that predicts what the old
policy already did. It costs you training rows. It is worth it.
The field that builds the label can never be a feature
The label is built from the servicing status: CLOSED, WRITTEN_OFF, and the
rest. That makes the servicing status the single most predictive column in the
entire dataset, trivially so, because it is the answer rephrased. Leave it in
the feature matrix and the model will score beautifully in evaluation and be
worthless in production. It would be reading the back of the book.
This is target leakage, and on this problem it is not a subtle, one-column risk; it is structural. Recall the job: score an application before approval. At that moment, a brand-new application has no servicing status, no contract age, no repayment history. Those fields do not exist yet. They only come into being after the contract runs for months. So every field that was used to build the label, or that only materialises after the decision, has to be stripped out of the features:
# These describe how the contract turned out. They build the label,
# so they must never travel into the feature matrix.
POST_OUTCOME = ["servicing_status", "contract_age_months", "label", ...]
features = df.drop(columns=POST_OUTCOME)
The clean way to think about it is a line drawn through time. On one side is everything knowable at the moment of decision: income, the home-visit note, the selling item’s price, how the applicant was referred. On the other side is everything the future reveals: whether they paid, how the contract aged, how it ended. The label is built entirely from the future side. The features must be drawn entirely from the past side. Keeping that line clean is not a cleanup step; it is the definition of what “pre-approval scoring” even means.
This is the trap that catches newcomers most often. Handed a wide table, the instinct is to feed every column to the model and let it sort them out. But many of the most predictive-looking columns are recorded after the decision, and a column you will not have at prediction time is not a feature, it is noise wearing a feature’s clothes. The only honest question to ask of each column is: would this value exist, for a brand-new application, at the moment the decision is made? If not, drop it, however much it flatters the score in evaluation. The discipline also reframes which surviving columns matter: among the fields genuinely available before approval, a human’s in-person read of the applicant, the home-visit note, is one of the richer ones, precisely because it is real signal measured on the right side of the line.
The second leak: do not let a customer cross the split
There is a quieter leak between the labelled data and an honest score. A single customer can hold several contracts. Split the rows randomly into train and test and the same person can land on both sides, so the model is graded on people it already learned. The fix is to split on the customer, not the row: every contract belonging to one customer goes wholly into train, or wholly into test, never both.
train_customers, test_customers = split(df["customer_id"].unique())
train = df[df["customer_id"].isin(train_customers)]
test = df[df["customer_id"].isin(test_customers)]
Two lines, and it is the difference between a test score you can believe and one that is flattering you. Both leaks have the same shape, information crossing a boundary it should not: the first across time, the second across people. Close both and the eventual bake-off across candidate classifiers is finally measuring the only thing a credit model is ever asked to do, generalisation to new people whose outcomes are not yet known.
Close the loop: predictions become labels
The model is served behind a small interface the field officers actually use. An officer enters an application reference, and the app returns the decision in plain terms: approve or reject, the probability of default behind it, and the threshold that decision was made against.
result = call_prediction_api(reference)
# -> {"prediction": "approve", "probability": 0.23, "threshold": 0.39}
A simple frontend, but it is doing something more than displaying a score. Every
prediction it makes is logged against its reference. And that is the same
contract whose servicing status the label factory reads from. So today’s
prediction and tomorrow’s truth are keyed to the same record: as that contract
ages past the very thresholds the labelling rule uses (most of the way through
its term, or eighteen months old), it crosses from the future side of that
time-line back into settled fact, acquires a real good or bad label, and the
earlier prediction can finally be scored against it.
That closes the loop the whole project hangs on. The interface is not just a way to read the model; it is the mechanism that keeps the training set growing. Each prediction starts life as an unlabelled guess and, given enough months, matures into a labelled example the next model learns from. The feature archaeology and the label construction are not one-time setup steps run in a notebook. They are a standing process the interface feeds.
The takeaway
A classifier needs two things handed to it: a feature matrix and a column of labels. Operational data hands you a business, and you build both yourself. The features are the mechanical half, every table turned into columns. The label is the part with no equivalent in a tutorial: you turn servicing history into an honest verdict, refuse to grade the cases whose outcome is not in yet, and accept that the applicants the old policy rejected are a blind spot no metric can see into. And the discipline that holds it together is a single line drawn through time: the label is made of the future, the features must be made of the past, and nothing crosses.
None of it is the model. In a real credit problem, the dataset is the work, and manufacturing it honestly, refusing to invent a label you cannot observe and refusing to let the answer leak into the question, is what separates a score you can lend against from one that just repeats yesterday’s policy back to you.