New AI model introduced to combat voice fraud

Scientists from MTUCI and AIRI Institute proposed a new fake generated vote detection model called AASIST3. The presented architecture was included in the top 10 best solutions in an international competition ASVspoof 2024 Challenge. The model is applicable to counteract voice fraud and increasesecurity of systems using voice authentication.

Voice biometrics (ASV) systems help identify people based on their voice characteristics. They are used to authenticate users for financial transactions and exclusive access control in smart devices, as well as to combat next-generation telephone fraud.

Voice recognition models can be vulnerable to adversarial attacks, where a small change in the input audio, configured in a certain way, leads to a significant change in the model's output, but to a human it is imperceptible or insignificant. In search of ways to bypass security barriers, attackers have learned to generate synthetic voices using text-to-speech (TTS) and voice conversion (VC). To effectively counter such attacks, it is necessary to implement anti-voice spoofing systems.

The AASIST AI model for audio analysis was demonstrated by a team of scientists from South Korea and France in 2021 and showed high reliability, confirmed by numerous studies. At the same time, with the rapid development of generative AI after 2022, it no longer lacks high-quality functionality for detecting synthetic voices. Using AASIST as a base, the Intelligent Solutions team at MTUSI and the Trusted and Secure Intelligent Systems team at AIRI, with the participation of a Skoltech graduate student, formed a new architecture for identifying fake synthesized voices.

The use of the Kolmogorov-Arnold network (KAN), additional layers and preliminary training, a better feature extractor, as well as special training functions, made it possible to improve the performance of the model by more than 2 times compared to the basic solution. In addition, the created model demonstrates better generalization ability to new types of attacks.

“It is important to use modern neural network methods to counteract voice spoofing, because attackers are constantly improving their tools. TTS and VC technologies make it possible to create synthetic voices, which are already very difficult to distinguish from real ones. The advantage of KAN networks lies in their ability to take into account the context and knowledge of voice data, allowing you to more effectively distinguish between a genuine voice and a fake one. Such networks not only recognize fakes with high accuracy, but are also able to adapt to new types of threats. The introduction of such advanced methods significantly increases the level of security and protection against voice spoofing attacks.” , – noted Oleg Rogovhead of the scientific group “Trusted and Secure Intelligent Systems” AIRI.

The problem of voice anti-spoofing can be solved using 2 approaches. The first is a binary classification of whether speech in audio is genuine human or artificially generated. The second is in conjunction with the voice biometrics system, when it is necessary to allow authorization when presenting the real voice of Speaker A, but not when presenting the speech of Speaker B or the artificial speech of Speaker A. The process of creating a model and choosing a training approach was iterative: the researchers tested different hypotheses , selected the best and tried to combine approaches so as to strengthen quality metrics, for example, EER (the level at which the error rate of the first type is equal to the error rate of the second type) and t-DCF, which weighs the contributions of errors for different authorization scenarios (for both metrics –– the less the better).

On validation data, we managed to achieve a t-DCF of 0.2657 compared to 0.5671 for the usual AASIST. On test data (speakers and attack types were not represented in the training and validation sets), our models showed a t-DCF of 0.5357 and an EER of 22.67% for the closed scenario (additional data and pre-trained models cannot be used) and a t-DCF of 0.1414 and an EER of 4.89% for an open competition scenario.

“AASIST3 demonstrates potential for practical applications in various fields, including the financial sector and telecommunications. The main goal of the development is to combat voice fraud and improve the security of systems using voice authentication. Integration into a business can be accomplished in a variety of ways, from implementing a separate software solution to integrating into existing security systems via API. The need for such technologies is high given the growing threat of attacks using synthetic voices.” – explained Hrach Mkrtchan, Head of the Scientific and Research Institute “Intelligent Solutions” MTUCI.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *