Hello everyone!
These days, it's become common to recreate a single video content in multiple languages to serve a global audience. This process is generally known as content localization.
At the core of localization lies transcription—the process of extracting spoken dialogue from a video and converting it into text. Since this involves turning video into text, the process is often referred to as VTT (Video-to-Text).
There are many services that use AI to provide VTT capabilities, and LETR WORKS is one of them.
Today, we’d like to share the results of a recent in-house test we conducted to evaluate VTT accuracy—including LETR WORKS—and offer some practical tips on how to make the most of these tools.
📼 How to Use a VTT Solution with LETR WORKS
Let’s walk through how a VTT solution works using LETR WORKS as an example.
To begin, users can upload a video file using the "New Project" feature. LETR WORKS supports MP4 format, the most widely used video format.
<LETR WORKS Video Upload Interface-1>
<LETR WORKS Video Upload Interface-2>Once uploaded, you can configure varioussettings such as the language and audio processing options according to theguidelines. After a short processing time, the transcript will be generated automatically.
<Language and setting configuration for transcription>For our test, we used a 4-minute Korean-language video. In addition to LETR WORKS, we also tested another popular VTT solution from a global provider, which we’ll refer to as Solution A. Like LETR WORKS, it offered a user-friendly and intuitive interface.
📊 Accuracy Results and Key Takeaways
The test video contained a total of 621 words. LETR WORKS produced 40 errors, resulting in an accuracy of 93.6%, while Solution A had 33 errors, achieving an accuracy of 94.7%.
Reaching 100% accuracy would mean a dramatic shift in workflow efficiency, as it would eliminate the need for any manual review.
However, since we're not quite there yet, someone still needs to verify the transcription for errors—meaning review remains essential.
That said, by analyzing the types of errors, we discovered some important insights. The majority of mistakes fell into two categories:
- Incorrect word recognition
- Misrecognition of proper nouns
Interestingly, most commonly used words were recognized accurately. The errors primarily occurred with neologisms, compound terms, and proper nouns like people's names or organization names.
To reduce these types of errors, you can use a Term Base feature to predefine how the AI should interpret specific words and phrases—particularly new or brand-specific vocabulary. This function is useful not only for transcription but also for translation workflows.
While the systems haven't reached 100%, the near-95% accuracy rate indicates that VTT solutions are evolving quickly, thanks to rapid advancements in AI technology.
🤖 LETR WORKS' Approach and Commitment
LETR WORKS is built to integrate seamlessly with speech recognition APIs provided by global AI providers.
We’re committed to choosing only the most advanced voice recognition models, and we continuously monitor performance to ensure our users enjoy greater efficiency and convenience in their work.
Thanks for reading!
We hope this post helps you better understand the current capabilities and practical use of VTT solutions for multilingual video content.