Voice Translation Auto Scoring

Voice Translation Auto Scoring evaluates 100% of the translated calls your team had by measuring the quality of the original vs translated speech transcripts, how much of the conversation has been affected by translation issues and giving an aggregated score to the call.

The goal is to evaluate how effectively meaning was preserved across languages and to identify where translation mistakes caused confusion or required the speaker to repeat and clarify information. This way, it gives admins the data to monitor every call and improve performance at scale.

How the call score is calculated

When scoring the accuracy of a translated call, Auto Scoring focuses on the impact of the occurred issues. The system follows a three-step process:

  1. Identify and categorize the issue

    • Identify the issue: It identifies the exact moment a transcription or translation mistake occurred.
    • Define the affected segment: It identifies the entire conversation segment impacted by the error, including the mistake itself and any sentences used to repeat or clarify information until the meaning is clear.
    • Label by impact: The system evaluates the type of the issue and gives the respective weight based on the level of impact: Intent accuracy > Entity accuracy > Conversation flow > Native fluency. This ensures the score accurately represents the most impactful mistranslations. 

      Read more about each type and their weights below.

  2. Measure the disruption

    The system then measures the percentage of conversation in the full call impacted by each type of issue.

  3. Calculate the final score

    The system calculates a final call accuracy percentage by taking a weighted average across all issues. Each issue type has a specific weight, so high-impact errors contribute more to the score calculation.

Translation issue labels

VT Auto Scoring evaluates translated calls across four criteria, each contributing a different weight to the final score based on its impact on the call flow.

  Metric Weight Focus Area
1 Intent accuracy 35% Did the meaning stay the same?
Checks for changes in core actions (e.g., “cancel” vs “reschedule”) or polarity (“want” vs “don’t want”).
2 Entity accuracy 30% Were the details correct?
Focuses on critical facts like names, dates, addresses, numbers, and technical terminology.
3 Conversation flow 25% Was the call smooth?
Flags issues like garbled text, abrupt endings, or translation delays that cause interruptions.
4 Native fluency 10% Did it sound professional?
Evaluates if phrasing is too literal. This only flags issues that did not cause confusion.

 

 

  1. Intent accuracy (35% of score)

    Getting the meaning right: This checks if the system accurately captured and translated what the speaker actually meant.

    An error is flagged when the translation changes the core message. For example:

    • Flipped meanings: for example, an original don't want is translated into want.
    • Changed actions: an original cancel is translated into reschedule.
    • Shifted tone: a polite request like could you is translated into a command like you must.
    • Context errors: the translation says something that makes no sense based on the overall conversation context.
  2. Entity accuracy (30% of score)

    Getting the details right: This ensures key details were both transcribed and translated correctly.

    What it looks for: Accuracy of names, numbers, dates, addresses, specific terminology, etc.

  3. Conversation flow (25% of score)

    Smoothness of the call: This measures how natural and fluid the interaction felt.

    What it looks for: Garbled or nonsense words, sentences ending abruptly, or translation issues that cause talk-overs and awkward pauses.

  4. Native fluency (10% of score)

    This checks if the translation sounds right and feels professional for the situation.

    What it looks for: Phrasing that feels stiff or too word-for-word, a tone that is a bit too direct for the situation, or cultural inappropriateness.

      Info

    This metric only flags issues that did not cause confusion. If an issue disrupted the conversation or required clarification, it is assigned to Conversation flow instead.

     

 

Have more questions? Submit a request

Was this article helpful?
0 out of 0 found this helpful