KIDOU speech recognition in comparison with Azure, Google Cloud, and OpenAI Whisper

Abstract

KIDOU stands out significantly in German speech recognition compared to the tested models: OpenAI Whisper, NVIDIA Nemo, Google Cloud, and Azure.

Especially in recognizing numbers and technical vocabulary, KIDOU demonstrates superior precision with the lowest error rates in the test.

Additionally, the model impresses with high speed, resource-efficient architecture, and flexible deployment options – from cloud environments to mobile devices. These features make KIDOU a powerful solution for demanding speech recognition applications.

Speech recognition has become an integral part of modern life. Whether in medicine, industry, or technical applications, precise and efficient solutions save time, facilitate documentation, and open new possibilities for process automation.

But which speech models deliver the best results?

This article examines KIDOU, our custom speech recognition model that can be deployed locally, in the cloud, and on mobile devices. Compared to well-known models like OpenAI Whisper, NVIDIA Nemo, and the cloud services from Google and Azure, we demonstrate how KIDOU excels in speed, accuracy, and versatility. KIDOU can accurately recognize numbers as well as medical and technical terminology.

The test conditions and datasets were designed to be as realistic and application-relevant as possible. Alongside general speech data, we included medical and technical scenarios, as well as the often underestimated challenge of precise number recognition.

What makes KIDOU speech recognition stand out?

KIDOU speech models are characterized by four key features:

Speed:

KIDOU processes speech extremely fast—a 10-second recording is transcribed in just 0.2 seconds.

This means processing is up to 50 times faster than the actual duration of the speech recording.

Compact sizes:

KIDOU is available in 250 MB and 40 MB versions.

Additional sizes can be tailored to specific needs, making KIDOU a perfect fit for various applications.

Data Security:

The models run not only in the cloud but also on-premises and locally on mobile devices like smartphones and laptops – an essential feature for data-sensitive applications.

Resource Efficiency:

KIDOU models are optimized to function seamlessly even on devices with limited processing power.

They deliver fast and precise results without excessive battery or hardware strain, even on smartphones or laptops.

With KIDOU, we offer a solution that is both technologically advanced and flexibly adaptable—a combination that is particularly attractive to companies and organizations with specific requirements.

Test Design

For our comparison, the speech models were tested under realistic conditions:

Comparison Models:

  • Whisper (openai/whisper-large-v3-turbo)
  • Nemo (RNNT-Hybrid-Model)
  • Google- Cloud (as of August 2024)
  • Azure-Cloud (as of August 2024)
  • KIDOU (our model, as of Juli 2024)

Test Datasets:

  • Mozilla Common Voice (Delta V19, 18.09.2024):
    Open speech data for general tests.
  • Medical Audio Data:
    Internal data from the medical field (e.g., “The patient has shown clear signs of endocarditis for five days.”)
  • Technical Audio Data:
    Speech recordings with specialized technical vocabulary (e.g., “Dust boot tie rod end torn.”)
  • Spoken Numbers:
    Numbers in various formats (e.g., phone numbers, decimal numbers, years).

This combination provides a comprehensive overview of the strengths and weaknesses of each model across different application areas.

Why different Datasets?

Each speech recognition model has different strengths and weaknesses. This article aims to explore where these lie. The following table lists error rates separately for each dataset and model. Each application has different requirements – various fields require distinct terminology, and recording conditions and background noise also vary.

Test results

Results of internal tests

Our test results are based on internal tests under specific conditions and may vary depending on the application scenario. The comparative models tested were evaluated in the versions specified in each case.

Test conditions and methodology

Specific data sets were analyzed in order to evaluate the performance of the models in different scenarios. This showed that each model has individual strengths and weaknesses that cannot be captured in their entirety in this test.

Important note

This test does not claim to present one model as fundamentally better or superior. Rather, it is intended to provide an insight into the results under the given test conditions and highlight the different focuses of the models.

 

Word Error Rate (WER):
Measures transcription accuracy based on the percentage of correctly transcribed words. Lower values are better.

Dataset KIDOU KIDOU Technical Nemo Whisper Google Cloud
Azure Cloud
Mozilla Common Voice DE 8.44% 8.92% 6.46% 10.02% 15.78% 9.80%
Medical Dataset 8.42% 9.58% 13.69% 21.57% 23.79% 12.29%
Technical Dataset 23.99% 4.15% 38.93% 39.35% 22.59% 28.30%

 

Number Error Rate (NER):
Evaluates the precision of number recognition, particularly important in applications where mistakes can have significant consequences..

KIDOU KIDOU-Technical Nemo Whisper Cloud Google Cloud Azure
NER 1.35% 0.96% 6.27% 10.18% 8.59% 18.52%

(Lower values are better)

Observations and Analysis

KIDOU’s strength in number recognition:

Numbers play a crucial role in many applications. Cloud services like Google and Azure show weaknesses in this area, while KIDOU models provide precise results – even with complex formats like decimal numbers.

Technical vocabulary:

The technical dataset highlights the importance of specialized models. KIDOU-Technical outperforms all other models significantly, as it is specifically tailored for this use case.

Strong performance in general speech recognition:

KIDOU excels not only in specialized scenarios but also in general speech data, making it highly versatile—from everyday speech recognition to niche applications.

Testing on mobile devices

We’ve seen that KIDOU’s recognition rates are impressive, but how does it perform on smartphones in terms of battery consumption?

A test on a Samsung S23 yielded remarkable results: One hour of continuous speech recognition consumed only 5% of the battery.

This means that even with intensive use, the device remains energy-efficient—ideal for mobile applications where reliability and endurance are crucial.

Application scenarios

KIDOU also offers text comprehension and the extraction of structured information. KIDOU thus enables a wide range of use cases.

Conclusion

KIDOU models combine speed, precision, and versatility—an ideal solution for companies relying on robust speech recognition. With compact sizes, impressive recognition rates, and excellent number recognition, KIDOU offers clear advantages over competitors.

We are happy to optimize our KIDOU speech model for your specific use case.

Follow Us!

Fara Sendjaja, Marketingmanagerin

KENBUN IT AG
Haid-und-Neu-Straße 7
76131 Karlsruhe
+49 721 781 503 02
office@kenbun.de

INTEGRATION    
KIDOU Sprach Tools