Machine Learning Approach to Guess Passwords via Microphones

Written on May 17, 2020

Most of the known attacks involve some kind of software in the victim’s device that can be exploited, or it’s a trojan by itself that can be controlled remotely. But what if attackers can get the victim’s sensitive data like passwords over the phone without the need to install any software?

In this post, I demonstrate how we can use ML to get passwords over Skype/Discord or any VoIP application without the need to interact with the victim’s device.


In summer 2019, I worked in a lab with smart Ph.D students who were working on different problems, using machine learning. They have introduced me to machine learning, and I loved the idea of utilizing it to solve problems that are not explicitly programmed. So I started reading about ML and how it can be used in the security field. I came across a video on Youtube that shows some researchers trying to guess someone’s password over Skype using only the sounds of the keyboard going through the microphone. As a red teamer, I think it’s an amazing way to utilize ML. And the researchers didn’t include any details regarding their implementation, so I decided to write my own :).

The code can guess the key that has been pressed using only the sounds of keyboard strokes. I wrote the idea a year ago but forgot to publish the code + write about it. So today, I will talk about the idea and how it can be used in an actual attack.


The Goal

The idea is to record a Discord call when the target starts entering sensitive data and then recovering the sensitive data from the captured recording.


The only changeable variables that are easy to determine are keyboard type, microphone type, how far the microphone is from the keyboard, and finally, the possible languages the target might use. To get the correct variables and train our classifier, we basically need the exact model of the laptop used by the target. By knowing the model, we can replicate the same environment and train the classifier. It’s easier if the target is using the laptop’s microphone since the distance between the keyboard and the microphone never changes. After determining the laptop’s model, we can generate as much data as we can to feed the classifier.

Training the Classifier

To train the classifier, we need as many samples as possible - the more, the merrier. In my case, my hypothetical target was using Macbook Pro 2011, which uses a loud keyboard compared to the newer Macbook Pro versions after 2015. I was able to generate 20 samples for each alphabet key and number. Then I fed the classifier the data.

Probability of Error

I think the probability of error is very high because this implementation doesn’t consider other possibilities; possibilities like holding Shift or pressing, Caps Lock and Tab. I think solving that specific problem is not very hard but needs me to add more logic to the code.

Overall, the probability of error is still high due to the nature of the problem. There are many many variables that I didn’t mention that can break this idea, but again, this is only proof that it can be implemented.

Other Interesting Solutions

We can also optimize the code by using existing password patterns from famous wordlists to help the classifier determine or confirm the guessed password, the same as this optimizer (Click here) for JWT Exfiltration Optimization.


In this demo, I asked my friend to use the laptop I have been training the classifier to work with and asked him to type a password while I record the microphone’s inputs.


After cutting the part where we talk, we have these 8 peaks as shown below:


(The new record file)

Now, can take the file and split it into 8 wav files.


Now, can test each file and try to identify the location of the key assuming that the keyboard is using the US-English:


Peak 1:


Peak 2:


Peak 3:


Peak 4:


Peak 5:


Peak 6:


Peak 7:


Peak 8:


Hooray! the password is kaka2002 I intentionally asked my friend to enter this password because the keys are far from each other, which makes him take time to type them. If he entered the password faster than he did, would probably not be able to identify peaks perfectly.

Source Code


If some organization were able to generate the right amount of samples for every popular laptop in the market and improved the idea with better implementations, I think they will be able to guess any entered data, only by listening to a microphone; and maybe.. just maybe.. it can be implemented over some IoT devices like Google Home and Amazon Alexa :)