Audio Cloning can Take Over a Phone Call in Real Time Without the Speakers Knowing

02-07-2024 • https://www.activistpost.com, By Joel McConvey

In the wake of a Hong Kong fraud case that saw an employee transfer US$25 million in funds to five bank accounts after a virtual meeting with what turned out to be audio-video deepfakes of senior management, the biometrics and digital identity world is on high alert, and the threats are growing more sophisticated by the day.

A blog post by Chenta Lee, chief architect of threat intelligence at IBM Security, breaks down how researchers from IBM X-Force successfully intercepted and covertly hijacked a live conversation by using LLM to understand the conversation and manipulate it for malicious purposes – without the speakers knowing it was happening.

"Alarmingly," writes Lee, "it was fairly easy to construct this highly intrusive capability, creating a significant concern about its use by an attacker driven by monetary incentives and limited to no lawful boundary."

Hack used a mix of AI technologies and a focus on keywords
By combining large language models (LLM), speech-to-text, text-to-speech and voice cloning tactics, X-Force was able to dynamically modify the context and content of a live phone conversation. The method eschewed the use of generative AI to create a whole fake voice and focused instead on replacing keywords in context – for example, masking a spoken real bank account number with an AI-generated one. Tactics can be deployed through a number of vectors, such as malware or compromised VOIP services. A three second audio sample is enough to create a convincing voice clone, and the LLM takes care of parsing and semantics.