Learning Library

← Back to Library

Audio Jacking: Man-in-the-Middle Voice Attack

Key Points

  • A simple conversation about a bank account number illustrates “audio jacking,” where the listener hears a different number than the speaker intended, revealing the attack’s subtle manipulation.
  • Researchers coined “audio jacking” as a new man‑in‑the‑middle (MITM) attack that intercepts and alters spoken audio in real time, demonstrated by a proof‑of‑concept demo.
  • The attacker can gain the MITM foothold via malware on a device, exploitation of VoIP services, or a spoofed three‑way call combined with a deep‑fake voice clone of one participant.
  • Once positioned, the interceptor captures the victim’s speech, converts it to text with a speech‑to‑text engine, modifies the content (e.g., changing numbers), then synthesizes and injects the altered audio back into the conversation.
  • Defending against audio jacking involves securing devices against malware, authenticating VoIP calls, using end‑to‑end encryption, and employing voice‑verification or out‑of‑band confirmation for critical information.

Full Transcript

# Audio Jacking: Man-in-the-Middle Voice Attack **Source:** [https://www.youtube.com/watch?v=xHRIjmx1_Fs](https://www.youtube.com/watch?v=xHRIjmx1_Fs) **Duration:** 00:12:52 ## Summary - A simple conversation about a bank account number illustrates “audio jacking,” where the listener hears a different number than the speaker intended, revealing the attack’s subtle manipulation. - Researchers coined “audio jacking” as a new man‑in‑the‑middle (MITM) attack that intercepts and alters spoken audio in real time, demonstrated by a proof‑of‑concept demo. - The attacker can gain the MITM foothold via malware on a device, exploitation of VoIP services, or a spoofed three‑way call combined with a deep‑fake voice clone of one participant. - Once positioned, the interceptor captures the victim’s speech, converts it to text with a speech‑to‑text engine, modifies the content (e.g., changing numbers), then synthesizes and injects the altered audio back into the conversation. - Defending against audio jacking involves securing devices against malware, authenticating VoIP calls, using end‑to‑end encryption, and employing voice‑verification or out‑of‑band confirmation for critical information. ## Sections - [00:00:00](https://www.youtube.com/watch?v=xHRIjmx1_Fs&t=0s) **Audio Jacking Man‑in‑the‑Middle Attack** - The segment introduces a novel “audio jacking” threat where a man‑in‑the‑middle malware intercepts a voice call, alters spoken information (like a bank account number), and demonstrates how attackers can exploit this to deceive listeners. ## Full Transcript
0:00hey Martin it was great seeing you at 0:02the conference yesterday great seeing 0:03you too hey hey jefff I want to pay you 0:05back for that pizza that we sheded can 0:07you some of your bank account details 0:10yeah sure thing my account number is 0:1431415 29 that's 0:208675309 got it thanks you bet take care 0:25okay what just happened then you heard 0:27me say this number and Martin wrote down 0:29a different number why did he do that 0:31does he have a hearing problem no he 0:33doesn't in fact what he wrote down was 0:35exactly what he heard you just didn't 0:37hear his side of the conversation 0:40welcome to the world of audio jacking 0:43yep it's a thing this is a new type of 0:45attack that one of our exforce 0:47researchers chenta Lee came up with and 0:49did a proof of concept let's take a look 0:51and see how it works and ultimately what 0:53you can do to protect yourself against 0:55it okay so how did this thing work well 0:58we're going to start with a diagram to 0:59to explain it so here we have Martin 1:02looks exactly like him right and this 1:04strapping young lad yours truly and here 1:07we have the attacker this guy becomes 1:11what we call a man in the middle in 1:13other words he inserts a control point 1:16between the two of us in our 1:17conversation now how could he do that 1:20well there's a lot of different ways but 1:22one of the simplest ways would be to do 1:24it through insertion of malware in other 1:26words if he sends malware to my system 1:29to my phone to my PC to my laptop 1:32whichever I'm using to make the call 1:34from then that could then establish the 1:38the man in the middle positioning 1:40because what he's going to need is an 1:42Interceptor and that's what this will do 1:45another way he could do this and by the 1:46way that malware could be embedded into 1:48an app that I download from an app store 1:51for instance and then that now puts the 1:54the uh Target in place another way would 1:57be to exploit voice over IP calling 2:01sometimes that in that case if someone 2:02is able to insert themsel in the middle 2:04of the conversation they might be able 2:06to take control and yet another option 2:09would be a three-way call where this guy 2:12the attacker calls me spoofing the 2:14number to make it looks like it came 2:15from Martin and he calls Martin spoofing 2:18my number making it look like it came 2:20from me and then inserts deep fake uh of 2:24my voice a copy or a clone of my voice 2:26starting the conversation so that way 2:28neither of us realizes the other one 2:31didn't initiate the call so there's a 2:33number of different ways that this might 2:35initially get kicked off but once we've 2:37done that once the attacker has 2:39established his position his foothold 2:41then what happens well so you remember 2:44in the call what I did was I called and 2:47I said something like you know it's good 2:49to see you at the conference Martin and 2:51this is where the Interceptor component 2:53comes in it intercepts what I've said 2:56and then it takes a look in fact it 2:59sends what I've just said down to 3:01another component that is a speech to 3:05text translator basically it takes the 3:08audio of what I said and turns it into 3:11text into readable words it then takes 3:14that information and sends it on into a 3:18large language 3:20model now why a large language model 3:22because these things are really good and 3:24natural language processing so they can 3:26understand the context of a conversation 3:28and not just pick out single words so an 3:31llm could look at what I've just said 3:33because it's been translated into text 3:35and analyze it and see what in I meaning 3:38in what I'm saying and in this this llm 3:41will be looking specifically for bank 3:44account number information it's going to 3:46want to know if I told a bank account 3:49number and in the first convers uh first 3:52uh thing that I said to Martin I didn't 3:54say anything about it so the answer in 3:57that case is going to be no and it's 4:00just going to take what I said allow it 4:03to go through the Interceptor and be 4:05passed along unimpeded unchanged so what 4:09I said is in fact what Martin hears 4:11sounds normal here's where it gets 4:13interesting Martin then answers me back 4:16and what he says is oh yeah good to see 4:19you too but what I'd like to uh do is 4:23pay you back for the pizza okay fine so 4:27the Interceptor takes his words 4:29translates them into text sends those to 4:33the large language model and he said in 4:35the message um send me your bank account 4:38number now the large language model is 4:40going to be smart enough to realize just 4:42the mention of the word bank account 4:43number is not the same thing as a bank 4:45account number because llms understand 4:48natural language so in that case again 4:50the answer is no uh so his message will 4:54be passed along back to me unimpeded 4:58again everything acts normal here's 5:00where it gets dicey what is going to 5:02happen next is I'm going to tell him my 5:06number 5:0831415 29 that's going to go through the 5:11Interceptor it's going to turn that into 5:13text it's going to go into the llm and 5:15it's going to say oh he just told a bank 5:18account number not just the word but 5:20actually gave a bank account number it's 5:22then going to take that information and 5:25this is where the attack gets 5:26interesting it's going to pass that on 5:28down to a text to speech so it's going 5:33to turn back the words into speech but 5:36what it's going to also do is take what 5:38I just said and remember there was an 5:40account number in here it's going to 5:41take that out and put something else in 5:44and what's it going to put it's going to 5:45put 5:488675309 that then gets passed on to a 5:51deep fake generator that has already 5:54been able to clone what my voice sounds 5:56like how could you do that well it turns 5:59out you can generate deep fakes with 6:01some of these language models that can 6:04operate with as little as 3 seconds of a 6:07sample of your voice some of them need 6:0930 seconds but some need more but the 6:12point is it's not hard to get 3 seconds 6:14or even 30 seconds of audio of a person 6:16and then be able to create a very 6:18lifelike clone or deep fake of their 6:20voice so it's going to substitute that 6:22into the message now all of this 6:25processing takes a little bit of time 6:27how do we cover that well there's a 6:28little bit of social engineering thing 6:30that we could insert you didn't hear it 6:32in our call but in the real proof of 6:34concept we would need to do this and 6:36that is it's going to generate a message 6:38in my voice that says oh yeah sure hold 6:42on a second while I look up the number 6:44so that's really just a delay tactic so 6:47that we can do this processing and then 6:49once it's processed it's going to 6:51actually send this account 6:53number that Martin is in going to take 6:56now in the meantime what I'm hearing 6:58because there would be aay lay on my 6:59side as I wait for this to happen is 7:02it's going to generate a message to me 7:04in Martin's voice that says hold on a 7:07second while I write it down so now both 7:10of us have a reasonable uh expectation 7:13that the other is going to be doing 7:14something but we're waiting for just a 7:16little bit of time and that's the time 7:18we need for this process to occur then 7:21once Martin gets that information he has 7:23the wrong account number well that wrong 7:25account number of course points up to 7:27the attacker he Ires the money to the 7:30attacker and the attacker has been 7:32successful so that's in a nutshell how 7:34this thing works pretty scary stuff 7:37right well that was just one scenario 7:39let's take a look at some other types of 7:42attacks that we might also see what you 7:44just saw was a financial based attack 7:47where someone is substituting in account 7:49numbers or other types of information 7:51like that but there could be other 7:52implications and other possibilities 7:55there could be health-based information 7:56that's being exchanged something that's 7:58really sensitive that could affect for 8:01for instance a patient's life if the 8:03wrong information is communicated from 8:05one doctor to another other things that 8:07could happen would be censorship say 8:10there that you're doing a talk and 8:12someone actually substitutes in 8:14different words that you did not say 8:16into a video now all of a sudden you 8:19have said something terrible that you 8:21didn't actually say and the implications 8:23of that could be devastating and then 8:25one other to consider is realtime 8:28impersonation in this case the attacker 8:30has the Deep fake they call up the other 8:33individual and they're able to speak to 8:35them in the voice of the person that 8:37they're impersonating what they say is 8:39in their voice and what comes out is in 8:41the voice of the person that they're 8:42wanting to to spoof so there could be a 8:45lot of scary implications for this 8:48technology if we're not prepared so what 8:50should you do to defend against an audio 8:52jacking attack defending against this 8:55stuff is really hard but we do have some 8:58tools some strategy iies that we can use 9:00to guard against this so we'll start off 9:03with the most important be skeptical 9:06don't believe everything you hear even 9:08if what you hear you're sure you heard 9:10the voice of the other person in this 9:12world of deep fakes and audio jacking 9:15you may not be hearing the other person 9:17actually saying what they do so think 9:19first then if it's something really 9:22important like sending bank account 9:23numbers or anything really sensitive 9:25like that you want to paraphrase and 9:28repeat and that way there may be a 9:31little bit of difficulty with the uh 9:33translation and you'll be able to catch 9:35it uh and catch it a little bit off 9:37guard but say it in different ways 9:39because that way the llm is looking for 9:41certain keywords or certain phrases 9:43certain ways of expressing and maybe 9:45you'll express it slightly differently 9:47another thing is if it's really 9:49important to you outof band 9:51Communication in other words we were 9:52just talking on a cell phone well if 9:55this is really important maybe don't 9:57include the bank account number in that 10:00maybe say I'll send you the account 10:02number through email not the greatest 10:05but maybe I'll text it to you maybe I'll 10:08send it to you in some other messaging 10:09app better still divide the account 10:12number up send half of the of the 10:14account number in one messaging app and 10:16half in another or switch from that 10:18device and switch over if you were doing 10:20it on a phone switch over to a laptop so 10:23anything that makes it so that the 10:25attack surface is broader that the 10:28attacker will have to who have 10:29compromised that's what you're looking 10:31to do make the job hard for them and 10:33then finally the best practices the 10:35standard stuff that we know we're always 10:37supposed to do but not everyone does it 10:40what what kinds of things do I mean by 10:42this well for instance keep your systems 10:45always patched with the latest level of 10:47software if whether it's a laptop 10:50whether it's a phone doesn't matter make 10:52sure that you have all the security 10:54patches that are possible in place um 10:58also when it comes to emails and 11:02attachments and and links in messages 11:05and things like that don't open them if 11:07you don't really have to if you don't 11:09really know what it's going to do 11:11because those things could be the way 11:13that the guy inserts the malware onto 11:15your system and then becomes the man in 11:17the middle then when it comes to apps 11:19that you download and who doesn't want 11:21to download a thousand apps on another 11:23phone but make sure that you get them 11:25from trusted sources even trusted 11:27sources can fail us every once in a 11:28while but you put the odds in your favor 11:31if you get it from a trusted App Store 11:33as opposed to another one where there 11:35might be malware a trojan horse 11:37something like that inserted into the 11:39app and then finally one of the things 11:42that might get exploited ultimately 11:44Downstream would be if they get your 11:46credentials and they try to log into 11:48your account or something like that so 11:50use things like multiactor 11:51authentication or you know I'm a big fan 11:54of replacing passwords with pass keys 11:58and the we have a video on that if you'd 12:00like to learn more about that but pass 12:01keys are a stronger way of securing your 12:05account AI can do some really amazing 12:08things for us and I'm a huge fan however 12:11if we're not careful it can also do some 12:13really devastating stuff to us so be 12:16informed keep learning stay vigilant and 12:19protect yourself against the attacks and 12:21if you want to know more about how this 12:23particular proof of concept works then 12:25click down in the description below and 12:27you'll see a link to a Blog post where 12:29you can find out the details and 12:31actually listen to audio samples that 12:33were generated during the proof of 12:36concept and by the way when is Martin 12:39going to finally send me that 12:41money thanks for watching if you found 12:43this video interesting and would like to 12:45learn more about cyber security please 12:46remember to hit like And subscribe to 12:48this 12:49channel