Artificial Intelligence 9 min read

How Spotify Uses Multi‑Model Voting to Give GenAI a Trustworthy Confidence Score

Spotify’s finance engineering team tackled the challenge of reliable GenAI outputs for invoice parsing by testing three confidence‑scoring methods, discarding two, and refining a multi‑model voting approach with weighted votes, calibration, and practical implementation details for high‑risk, SOX‑compliant scenarios.

Continuous Delivery 2.0

Sep 1, 2025

How Spotify Uses Multi‑Model Voting to Give GenAI a Trustworthy Confidence Score

If your team is using GenAI for “cannot‑fail” scenarios (finance, law, healthcare), you’ll encounter a crucial problem: how to determine whether GenAI’s output is trustworthy?

Spotify’s finance engineering team recently shared a practical case: they automated global invoice parsing with GenAI, but financial use must meet SOX compliance, so they added a “confidence score” to decide whether to auto‑approve or route to human review.

This article does not discuss complex theory; it distills reusable experience: selecting the optimal method from three candidates, implementation details, and remaining challenges.

1. Why “confidence score” is mandatory in serious scenarios

Regulatory requirement : Financial domain must comply with SOX, cannot rely on “feel‑good AI” without clear evidence.

Low tolerance for errors : A single mistake in invoice parsing can cause accounting mismatches and compliance risk.

Human hand‑off needed : AI alone is insufficient; a score acts as a switch to separate “AI‑handled” from “human‑reviewed” cases, improving efficiency.

2. Three confidence‑scoring methods tested – two eliminated, one kept

Spotify tried three mainstream methods and kept only one.

Method 1 (discarded): Calibrator model (AI evaluates AI)

Approach : Use an extra GenAI model to score the primary model’s output.

Pros : Independent judgment, can learn from human feedback.

Fatal issues :

Score is opaque – e.g., “80” without explanation, unacceptable for compliance.

Instability – same output can vary by >10 points, unsuitable for finance.

Method 2 (discarded): Log‑probability

Approach : Use the model’s token‑level confidence to compute an average score.

Pros : Access to low‑level data, appears objective.

Fatal issues : Score does not correlate with actual accuracy; high scores can still be wrong.

Method 3 (kept): Majority voting (multi‑model consensus)

Approach : Run several different GenAI models on the same invoice; confidence = proportion of models agreeing on the answer.

Why kept :

Score strongly correlates with accuracy – more agreeing models → higher correctness.

Logic is simple and explainable for compliance teams.

Results are stable as long as models and data stay unchanged.

3. Implementing majority voting – three essential details

“Majority voting” sounds simple, but Spotify identified three key optimizations.

Model count: 5–6 models is optimal

Literature suggests 4–7 models balance diversity and cost.

Fewer than 5 leads to “majority‑wrong” failures.

More than 6 inflates cost with little accuracy gain.

Spotify settled on 5–6 models from different vendors to avoid homogeneous errors.

Weighted voting: more accurate models have higher influence

Not all models are equally reliable (e.g., Model A 90% accuracy vs. Model B 80%).

Weight each model by historical accuracy (A’s vote counts 1.2, B’s counts 1.0) and compute a weighted confidence score.

Score calibration: align voting score with real accuracy

Raw voting percentages can drift from true accuracy (e.g., 80% vote → 70% actual).

Apply Platt scaling to map voting scores to calibrated values, making the confidence more truthful.

4. Unsolved challenges – two temporary work‑arounds

1. Long‑text parsing

Problem: Long fields (e.g., full address) are expressed inconsistently across models.

Temporary fix: Split the text into atomic parts (city, street, number), vote on each piece, then aggregate.

2. Score granularity

Problem: With 7 models, score steps are ~14%, making it hard to meet a 95% threshold.

Temporary fix: Prompt each model with five different questions, expanding to 35 votes (≈3% steps).

Drawback: Cost increases fivefold; long‑term solution requires cheaper models.

5. Three core takeaways

Select the method based on the scenario : In high‑risk, compliance‑driven domains, a transparent “majority voting” approach beats opaque complex models.

Details matter : Without weighting and calibration, a simple vote yields unreliable scores.

Embrace imperfection and iterate : Use temporary fixes for long texts or granularity, then refine as better solutions become available.

If your team is building GenAI for serious domains, start with a “majority voting” confidence score – Spotify has already cleared the first two hurdles.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

GenAI confidence scoring invoice parsing majority voting SOX compliance

Written by

Continuous Delivery 2.0

Tech and case studies on organizational management, team management, and engineering efficiency

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. Why “confidence score” is mandatory in serious scenarios

2. Three confidence‑scoring methods tested – two eliminated, one kept

Method 1 (discarded): Calibrator model (AI evaluates AI)

Method 2 (discarded): Log‑probability

Method 3 (kept): Majority voting (multi‑model consensus)

3. Implementing majority voting – three essential details

Model count: 5–6 models is optimal

Weighted voting: more accurate models have higher influence

Score calibration: align voting score with real accuracy

4. Unsolved challenges – two temporary work‑arounds

1. Long‑text parsing

2. Score granularity

5. Three core takeaways

Continuous Delivery 2.0

How this landed with the community

Was this worth your time?

0 Comments

Method 1 (discarded): Calibrator model (AI evaluates AI)

Method 2 (discarded): Log‑probability

Method 3 (kept): Majority voting (multi‑model consensus)