How Spotify Uses Multi‑Model Voting to Give GenAI a Trustworthy Confidence Score
Spotify’s finance engineering team tackled the challenge of reliable GenAI outputs for invoice parsing by testing three confidence‑scoring methods, discarding two, and refining a multi‑model voting approach with weighted votes, calibration, and practical implementation details for high‑risk, SOX‑compliant scenarios.
If your team is using GenAI for “cannot‑fail” scenarios (finance, law, healthcare), you’ll encounter a crucial problem: how to determine whether GenAI’s output is trustworthy?
Spotify’s finance engineering team recently shared a practical case: they automated global invoice parsing with GenAI, but financial use must meet SOX compliance, so they added a “confidence score” to decide whether to auto‑approve or route to human review.
This article does not discuss complex theory; it distills reusable experience: selecting the optimal method from three candidates, implementation details, and remaining challenges.
1. Why “confidence score” is mandatory in serious scenarios
Regulatory requirement : Financial domain must comply with SOX, cannot rely on “feel‑good AI” without clear evidence.
Low tolerance for errors : A single mistake in invoice parsing can cause accounting mismatches and compliance risk.
Human hand‑off needed : AI alone is insufficient; a score acts as a switch to separate “AI‑handled” from “human‑reviewed” cases, improving efficiency.
2. Three confidence‑scoring methods tested – two eliminated, one kept
Spotify tried three mainstream methods and kept only one.
Method 1 (discarded): Calibrator model (AI evaluates AI)
Approach : Use an extra GenAI model to score the primary model’s output.
Pros : Independent judgment, can learn from human feedback.
Fatal issues :
Score is opaque – e.g., “80” without explanation, unacceptable for compliance.
Instability – same output can vary by >10 points, unsuitable for finance.
Method 2 (discarded): Log‑probability
Approach : Use the model’s token‑level confidence to compute an average score.
Pros : Access to low‑level data, appears objective.
Fatal issues : Score does not correlate with actual accuracy; high scores can still be wrong.
Method 3 (kept): Majority voting (multi‑model consensus)
Approach : Run several different GenAI models on the same invoice; confidence = proportion of models agreeing on the answer.
Why kept :
Score strongly correlates with accuracy – more agreeing models → higher correctness.
Logic is simple and explainable for compliance teams.
Results are stable as long as models and data stay unchanged.
3. Implementing majority voting – three essential details
“Majority voting” sounds simple, but Spotify identified three key optimizations.
Model count: 5–6 models is optimal
Literature suggests 4–7 models balance diversity and cost.
Fewer than 5 leads to “majority‑wrong” failures.
More than 6 inflates cost with little accuracy gain.
Spotify settled on 5–6 models from different vendors to avoid homogeneous errors.
Weighted voting: more accurate models have higher influence
Not all models are equally reliable (e.g., Model A 90% accuracy vs. Model B 80%).
Weight each model by historical accuracy (A’s vote counts 1.2, B’s counts 1.0) and compute a weighted confidence score.
Score calibration: align voting score with real accuracy
Raw voting percentages can drift from true accuracy (e.g., 80% vote → 70% actual).
Apply Platt scaling to map voting scores to calibrated values, making the confidence more truthful.
4. Unsolved challenges – two temporary work‑arounds
1. Long‑text parsing
Problem: Long fields (e.g., full address) are expressed inconsistently across models.
Temporary fix: Split the text into atomic parts (city, street, number), vote on each piece, then aggregate.
2. Score granularity
Problem: With 7 models, score steps are ~14%, making it hard to meet a 95% threshold.
Temporary fix: Prompt each model with five different questions, expanding to 35 votes (≈3% steps).
Drawback: Cost increases fivefold; long‑term solution requires cheaper models.
5. Three core takeaways
Select the method based on the scenario : In high‑risk, compliance‑driven domains, a transparent “majority voting” approach beats opaque complex models.
Details matter : Without weighting and calibration, a simple vote yields unreliable scores.
Embrace imperfection and iterate : Use temporary fixes for long texts or granularity, then refine as better solutions become available.
If your team is building GenAI for serious domains, start with a “majority voting” confidence score – Spotify has already cleared the first two hurdles.
Continuous Delivery 2.0
Tech and case studies on organizational management, team management, and engineering efficiency
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
