Model Sprawl Is the New Tech Debt
AI teams accumulate models faster than they build controls. How to manage model sprawl with registries, drift monitoring, rollbacks, and consolidation.
If you've been building AI features since 2024, you probably have more models in production than you planned. The support ticket classifier you shipped as a PoC that got promoted to production. The content moderation fine-tune added three months later. The RAG pipeline for internal search that runs on a different provider because it was cheaper for that workload. The sentiment model someone added for a dashboard metric. Each one was a reasonable individual decision. Collectively, they're infrastructure without a maintenance plan.
Modern AI systems are rarely single models anymore. They are orchestrations of foundation models, fine-tuned variants, retrieval pipelines, prompts, guardrails, routing logic, and feedback loops. Most of those layers get added one at a time by people who say, "it's just one model."
The model sprawl problem isn't a capability problem. It's a maintenance and observability problem. This issue is the practical guide to getting control of it.
How You End Up Here
The path is consistent enough to be a pattern.
First model ships as a proof of concept for one specific use case. It works well enough to get promoted to production. No versioning, because it's one model. No monitoring beyond infrastructure metrics, because it's stable.
Second model gets added for a different use case. Different provider, because the team evaluating it found a cost or quality advantage. Still no registry, because it's just two models and everyone knows what they are.
Third model is a fine-tune of the first, created to improve performance on a specific domain. Now there are two versions of a model serving different use cases, tracked in a spreadsheet.
By the time you have five models, the spreadsheet is out of date, two people have left the team, and nobody can answer with confidence which model version is currently serving production traffic for any given feature.
Three signals you're already in this situation: you can't identify which model version is in production for a given feature without checking the deployment logs, output quality on an older feature has been declining but you can't pin it to a specific change, and rolling back a model update involves a Slack thread with three people rather than a deployment flag.
Version Control for Models
The mental model shift is the same as the one early software teams made with code: stop treating each model as a one-off artifact and start treating model versions the way you treat software releases. The principles are identical: tagged versions, changelogs, rollback capability. The implementation is different because models aren't code, but the tooling for this is mature.
A model registry is the central store that makes this tractable. Every model artifact that touches production is registered with a version tag before deployment. The registry captures training parameters, evaluation results, the dataset version used for training, and deployment status. When something degrades in production, the first question, which version caused this, has a 60-second answer instead of a Slack investigation.
For open-source, MLflow is a strong default. MLflow's Model Registry provides a centralized store for versioning, stage transitions, and deployment workflows, and MLflow 3 extends the platform's GenAI support with tracking, evaluation, and observability for models, prompts, agents, and AI applications.
Weights & Biases is strong if you want experiment tracking tightly coupled with registry and lineage. W&B Registry is designed as a central repository for artifact versions, lineage, audit history, and governance across teams. The tradeoff is packaging and pricing: check the current W&B plan details before you standardize, because those change more often than your observability architecture should.
Two practices that make any registry useful: immutable version IDs (a hash or semantic version tied to training data, architecture, and eval results, not just "v2"), and staged rollouts where new versions go to a canary slice (5-10% of traffic) before full promotion. Define the rollback trigger before deployment. "If output quality metrics drop more than X% from baseline, revert" is a decision that should exist in writing before the deployment, not improvised when something goes wrong.
Monitoring That Catches Model Drift
Infrastructure monitoring tells you when a model is down or slow. It doesn't tell you when a model is producing significantly worse outputs than it was six months ago. That requires output monitoring, which most teams don't have.
Three signals to instrument.
Output distribution shift is the leading indicator most teams miss. For a classifier, track the ratio of each output class over time. For a text generator, track output length, vocabulary diversity, and any domain-specific quality signals you can define. A significant shift in distribution often precedes user complaints by days or weeks. You want to catch it before users catch it.
Human feedback integration closes the loop from production back to evaluation. If your product has any feedback mechanism (thumbs up/down, correction flows, support escalations where the user overrides an AI decision), that signal needs to route to your model monitoring stack as labeled data. Not just as a metric to observe, but as input to your evaluation suite. Every correction is a data point about where the model is wrong.
Scheduled eval runs against golden fixtures are the synthetic check that catches regressions systematically. Run your evaluation dataset against the production model endpoint on a schedule: daily for most features, hourly if the stakes are high enough to justify the cost. If a golden fixture that used to pass starts failing, something changed. You find out from the eval run, not from a user complaint. (Reference Issue #5's Builder Tip on golden fixture suites for the setup methodology.)
For tooling: Evidently gives you open-source evaluation and observability for ML and LLM systems, with 100+ built-in evals and support for evaluations, monitoring, and tracing. Arize Phoenix gives you open-source LLM tracing and evaluation with OpenTelemetry-based instrumentation. LangSmith combines tracing, evaluation, and monitoring, especially if you're already in the LangChain ecosystem.
When to Consolidate, When to Keep Them Separate
Once you have visibility into your model fleet, you'll face the question of whether to consolidate onto fewer models or maintain the specialization. The answer should come from data, not from the original decision to build a fine-tune.
The consolidation case applies when two models are doing similar tasks and the performance delta between them is small. The maintenance savings from running one model on one provider are real: fewer monitoring dashboards, simpler rollout logic, lower cognitive overhead. If a fine-tuned model is beating the general model by 2% on the metrics that matter, that's probably not worth separate infrastructure.
The keep-separate case applies when a domain-specialized model genuinely outperforms a general model on the task it was built for. Measured, not assumed. "We fine-tuned it so it must be better" is not sufficient justification for maintaining separate infrastructure. Run the general model against your fine-tuned model on a real evaluation set drawn from production data. If the fine-tuned model wins by a meaningful margin on the metrics that matter for that use case, the specialization is worth the overhead. If it doesn't, consolidate and free up the infrastructure.
The practical test: pull 200 representative inputs from production traffic for that feature. Score both models on your evaluation rubric. If the fine-tuned model wins by more than 10 percentage points on the metrics that drive user outcomes, keep it separate. Below that threshold, the maintenance cost starts to outweigh the performance gain.
Builder Tip
Maintain a model inventory. One document (or database record) per production model, with six fields: what it does, which provider and version is currently serving production, when it was last updated, who owns it, what the rollback procedure is, and what monitoring is in place.
If you can't fill in all six fields for a model that's in production, that's where to start, before setting up a registry, before changing your monitoring stack. The inventory is the forcing function that surfaces which models are ungoverned. A Notion table, a YAML file in your infrastructure repo, or a registry entry all work. What matters is that it exists, it's current, and the team knows where it is.
Quick Hits
The model registry is a solved problem for ML models. For prompts, it's 18 months behind. Most teams version code and models but treat prompt changes as freeform edits. The gap is closing (see Issue #5 on prompt version control), but if your model fleet includes prompt-tuned applications, apply the same inventory discipline to prompts that you apply to model artifacts.
Evidently is no longer just a tabular-data monitoring tool. Its current open-source stack covers ML and LLM evaluation, monitoring, and tracing, which makes it viable for mixed model fleets.
The who-owns-it field in the model inventory turns out to be the most important one. Model drift investigations that stall do so because nobody knows whose job it is to act on the signal. Ownership doesn't need to be formal. A named person who gets paged when the monitoring alert fires is enough.
Provider consolidation is often the highest-leverage simplification available. If you have three models on three providers, moving two of them to the same provider as the third reduces billing overhead, simplifies credential management, and often improves negotiating position on pricing. Evaluate this separately from model architecture decisions.