Risk adjustment was supposed to level the playing field. Instead, it’s become a financial drain—amplified by machine learning/NLP coding tools that plateau in accuracy, add operational complexity, and fail to scale cleanly.
What started as a system to ensure fair payments based on member health complexity has morphed into an expensive, error‑prone process. Many organizations leaned on statistical NLP to help find additional HCCs in medical records, only to run into accuracy/recall ceilings (often ~80% in real-world ICD‑10 automation), suggested-code “guesswork,” confidence thresholds, and black‑box logic that perpetuate historical errors rather than correcting them. See, for example, a hospital evaluation of a Machine Learning ICD‑10 auto‑coder reporting less than 70% accuracy.
The numbers don’t lie. Missed HCCs drive revenue loss; confidence‑based suggestions create audit exposure; and brittle models increase staffing strain and tech sprawl. It’s time to face the root cause: traditional ML/NLP tools aren’t just inefficient—they create unsustainable operations when accuracy must scale with volume and audit readiness.
The Hidden Revenue Hemorrhage
Every missed HCC code represents lost revenue that compounds over time. A primary driver: ML/NLP auto‑coding tools that cap out on recall and require humans to adjudicate low‑confidence “suggested” codes. In practice, coders discard a meaningful share of those suggestions because the model’s confidence is too low to justify in audit, which means eligible HCCs never make it into RAF scores. In hospital ICD‑10 automation studies, NLP systems commonly report performance near the 80% range—good for triage, not for closing high‑stakes risk gaps end‑to‑end.
Operationally, that ceiling shows up as incomplete chart reviews: confidence‑threshold triage, suggested‑code guesswork, and black‑box rationales that coders can’t easily validate, consume their time and still miss subtle but risk adjusting conditions. When teams rush thousands of charts under deadlines, they default to the model’s confidence rules—further suppressing capture of edge‑case HCCs that require precise language evidence rather than probability.
For providers, the impact scales beyond individual misses. Incomplete risk data depresses RAF scores, creating systematic underpayment across populations. Lower RAF scores pull down economics and can translate to weaker quality performance and fewer resources for care management. What looks like throughput optimization becomes a revenue leakage mechanism—rooted in the limitations of traditional ML/NLP.
RADV Audits: The Compliance Threat That Keeps CFOs Awake
Risk Adjustment Data Validation (RADV) audits represent the other side of the equation. Many ML/NLP tools rely on confidence intervals and opaque model logic; they suggest codes without deterministic, text‑level evidence that a reviewer can point to in the chart. That’s precisely what fails in audit.
According to CMS, RADV audit results show significant variation in documentation quality, with many health plans unable to substantiate submitted diagnosis codes. When code selection depends on model thresholds rather than explicit documentation matches, audit teams struggle to reproduce the logic—and financial recoveries follow.
The balancing act becomes untenable: push hard on model‑assisted capture and absorb audit exposure, or throttle to “safe” high‑confidence suggestions and accept missed HCCs. Either path is a direct consequence of probability‑based coding.
The compliance response—more reviewers, second‑level QA, additional vendor overlays—drives defensive spending without eliminating the root cause. As long as suggested/guess codes and confidence thresholds govern what gets submitted, audit risk remains built in.
The Operational Cost Spiral
Risk adjustment operations have become budget black holes. A major driver is the care and feeding of ML/NLP stacks: model retraining, threshold tuning, exception queues for low‑confidence suggestions, and manual audit rationalization. Each layer adds people, process, and platforms—without lifting the accuracy ceiling that created the problem in the first place.
Staffing costs climb because probability‑based tooling still requires experienced coders to adjudicate the “gray zone” the model cannot deterministically resolve. Turnover compounds the issue as institutional knowledge about model behavior is lost and must be rebuilt.
Technology investments haven’t delivered promised returns either. Multiple point solutions—triage NLP, separate audit tools, QA overlays—create a brittle, expensive stack to integrate and maintain. The literature on deep learning auto‑coding notes ongoing maintenance and generalization challenges that force continual operational babysitting.
The administrative burden extends beyond direct coding costs. Denials, appeals, and payment reconciliation trace back to codes that lack clear, text‑anchored justification. Finance teams struggle to forecast when capture depends on shifting confidence thresholds rather than deterministic evidence. Leadership time is diverted to managing tooling rather than improving a plan’s member care and operations.
Traditional Coding Tools: Built for Yesterday’s Challenges
Most HCC coding software platforms were designed for fee‑for‑service abstraction, not RAF‑sensitive, audit‑exposed risk adjustment. In that environment, ML/NLP tools run into inherent limits:
- Accuracy/recall ceilings: In real‑world ICD‑10 auto‑coding, NLP systems commonly report performance near ~70% (JMIR hospital evaluation less than70% accuracy across all ICD-10 coding; state‑of‑the‑art study reporting ~80% accuracy for identifying the top 50 codes). That leaves too many HCCs uncaptured without heavy human backstops.
- Suggested code guesswork: Models propose candidates with confidence intervals, not certainties. Low‑confidence items either get discarded (missed revenue) or submitted (audit risk).
- Black‑box logic: Probabilistic inference is hard to reproduce in QA and during RADV. Coders and auditors need specific words/phrases that justify a code.
- Perpetuation of historical errors: Models trained on legacy labels learn yesterday’s mistakes and biases, carrying misses forward rather than correcting them.
- High operational overhead: Continuous retraining, threshold tuning, exception handling, and QA overlays add cost and complexity without breaking through the ceiling.
Workflow impact follows. Coders spend disproportionate time adjudicating model uncertainty and navigating UI layers instead of verifying clinical evidence. QA becomes a second (or third) review pass to reverse engineer model choices for audit defensibility.
A different approach is required—one that replaces confidence‑based suggestions with deterministic, text‑anchored identification.
The Sustainability Breaking Point
All roads lead back to the tooling. Revenue losses (due to missed HCCs), audit exposure (unsupported suggestions), staffing strain (manual adjudication of low‑confidence codes), and tech sprawl (multiple overlays to tame model uncertainty) stem from ML/NLP’s core limitations. More spend on the same paradigm doesn’t change the ceiling.
Executives feel it. Leadership time shifts to triaging throughput and QA rather than improving care and member growth. Recruiting and training new coders to work around model limitations is expensive—and that institutional knowledge walks out the door with turnover.
The vicious cycle is predictable: underperformance triggers more reviewers, more QA, and more point solutions. Accuracy plateaus remain. Costs rise. Complexity grows. Without changing the underlying method, results stagnate.
How Cavo Breaks the Cycle
Cavo Health’s Precise Word Matching AI is built to remove probability from high‑stakes coding. It doesn’t learn from historical labels or suggest guesses. It deterministically finds the exact words and phrases in the chart that justify risk‑adjustable diagnoses and links them to the correct, most specific codes including combination codes —every time the evidence appears.
What changes in practice:
- 98% HCC recall (Cavo Health)
- Deterministic word matching: no suggested/guess codes, no confidence intervals to tune or defend.
- Transparent, audit‑ready logic: each code is backed by explicit text citations for reviewers and RADV.
- Consistent performance regardless of volume: the engine scales without adding reviewers or QA layers.
- Simpler operations: fewer exception queues, fewer overlays, fewer meetings to explain model behavior.
By replacing probability with precision, Cavo reduces dependence on scarce coding talent while improving HCC capture rates. Volume spikes don’t require staffing adjustments. And because every code is text‑anchored, audit preparation becomes a methodical verification step rather than a reconstruction of a model’s black-box..
The Path to Sustainable Risk Adjustment
Sustainability starts by removing the root cause of misses and audit exposure: confidence‑based, black‑box code suggestion. Deterministic, text‑anchored identification changes the math—higher recall, fewer exceptions, less QA, simpler audits.
Cavo Health’s Precise Word Matching AI was designed for this exact problem. It consistently surfaces overlooked HCC opportunities while maintaining audit‑ready transparency, and it scales with reduced additional coders.
Organizations don’t need another overlay. They need a different method that delivers diagnostic coding truth—so revenue is accurate, audits are predictable, and operations stay lean.
If you’d like to see how this works in your own charts, contact Cavo Health. Our team will walk through a practical, data‑driven evaluation and let the results speak for themselves.
