Forensic Sciences

Speaker Recognition: On the Basis of their Habitual and Apprehensive Voice

Article Number: JUP002095 Volume 03 | Issue 01 | April - 2020 ISSN: 2581-4273
07th Mar, 2020
26th Mar, 2020
08th Apr, 2020
21st Apr, 2020


Gurpreet Kaur, Dr. Ranjeet Kumar Singh


Spoken language is the natural method of communication that contains the transfer of various information related to linguistics (accent etc.), information related to speakers (emotions, etc.), and also information related to the environment (background noise, etc.). The ability of humans to extract and decode spoken language automatically inspires various researchers to study the distinct prospects of spoken language, which includes recognition of accent or recognition of changed accent, recognition of emotions or gender, etc. “Voiceprint” is a collection of acoustic frequency spectrum which contains the significant features of a human speech that are used for the recognition of a speaker. The voiceprint of an individual has a distinct quality of uniqueness, durability, and strength. Every speaker has unique features of speaking besides those physiological dissimilarities such as the use of specific accent, intonation style, etc. An apprehensive speech is a disguised speech of the speaker recorded under the influence of any threat, nervousness, etc. used for various criminal purposes such as fraud or spam calls, etc. This paper focuses on the areas concerned with the information extraction of an individual’s speech observable in speech signals such as emotional state, intentional accent change, belligerence, etc. will give better clues to the investigator for the differentiation. Some external factors (environmental noise or emotions etc.) impact the effectiveness of speaker identification. But, the basic components of their original voice remain unchanged such as formant frequency in the “Voiceprints” which helps in the recognition process even after using an apprehensive voice. The intonation pattern of formants of the speaker’s original voices will almost be similar to the intonation pattern of formants of the speaker’s deliberate apprehensive voice. Keywords: Apprehensive Voice, Disguised Voice, Speaker Recognition, Voiceprint, Intonation Pattern.


Ears have the unique ability to receive and decipher spoken language. Besides that, the ears also have various diverse functions, out of which one function of the ear is the identification of people by their voices (Sharma and Bansal 2013). Forensic auditory analysis has been a subject of methodological and scientific discussion for a long time. It is a globally expanding tendency that the criminals are more willing to disguise their voices to hide their identity, particularly in cases of extortion, threatening calls, emergency calls to the police (Zhang and Tan, 2007). Every individual in this world has a unique voice. The voice of no two individuals is the same due to physiological dissimilarities. The individuality of the individual’s voice can be employed to verify the person’s identity. Every speaker has unique features of speaking besides those physiological dissimilarities such as the use of specific accent, intonation style, rhythm, suffering from any disorder that affects the speaking ability and causes tremors, etc. (Zheng and Li., 2017). There are many possibilities available to the speaker to manipulate his/her voice to falsify an automatic recognition system or even the human ear (Perrot, Aversano, and Chollet, 2007). Biometric access control systems, Automatic Speaker Recognition Systems, auditory analysis in forensics, etc. are several examples of speaker and voice recognition (Lal and Nath N. J., 2015). Biometric accessed security systems are built based on the unique features of humans like voice, fingerprints, etc. These types of systems provide an additional barrier to stop the unauthorized approach for the protection of data by detecting the user’s particular behavioral or physiological features. Biometric accessed security systems are more authentic than the standard traditional method. There are higher demands on speaker identification on modeling the vocal tract features of speakers such as an illness, to provide a more secure approach to financial or sensitive information. The verification of speakers gives more barriers to stop uncertified access to secure the data and also improves the protection provided by personal identification (Li, Yang, and Dai, 2014).

Apprehensive voice is mainly that voice which reflects fear, anger, anxiety, nervousness, shivered voice because of illness which in final disguises the voice intentionally or unintentionally. An apprehensive speech is that disguised speech of the speaker which is recorded under the influence of any threat, fear, anger, and nervousness, etc. It also comes under the disguised voice category. This type of voice is used for various criminal and illicit purposes such as fraud or spam calls, threatening calls, and also during the sample collection process of suspects. In the forensic science field, finding a solution to differentiate between apprehensive voices from normal voices will give better indications to the investigator during the investigation process. Voice disguise is an intentional act of a speaker to alter, distort, deviate, or manipulate their normal voice to hide or falsify their identity (Klevans & Rodman, 1998). In the field of acoustic analysis as well as in forensic science, the speaker recognition process or techniques are inexorable and they are used in the speaker identification. Speaker recognition based on Voiceprints is defined as the recognition of an individual or speaker’s identity by using their voiceprints. Various researches proposed that the voiceprint of an individual has a distinct quality of uniqueness, durability, and strength, which always remain stable and become unchanged in adulthood except for some disorders. It is also suggested that the voice of no individuals is the same due to physiological dissimilarities. Every speaker has unique features of speaking besides those physiological dissimilarities such as the use of specific accent, intonation style, rhythm, suffering from any disorder that affects the speaking ability, and causes tremors, etc. Intonation patterns of voice are the patterns of variation generated by the rise and fall in the pitch of the voice. These intonation patterns are also helpful in speaker recognition in case of a disguised or apprehensive voice.

There are different techniques or methods available for the speaker identification, out of which some are based on the phonetic approaches while few others are based on the automatic algorithms. It is important to determine whether the voice is apprehensive or normal before proceeding to further voice analysis. An apprehensive voice is also a form of a disguised voice that means the intentional actions to hide the person’s identity. But not all apprehensive voices are disguised, as some of them are generated due to the reasons when some person is in stress, or when he/she is suffering from some disease or under the influence of someone’s threat. It will also affect the performance of various speaker identification systems as they can either be affected by producing variations in the channels of communication or by producing variations in an individual’s voice. There are two possible variations in the channel of communication: the first one is handset variations and the other is environmental variations. Both of these dissimilarities were deeply investigated and researched by several researchers after which they suggested various normalization methods to oppose these variations.

The cause of unintentional modifications is emotional conditions such as stress, fear, excitement, etc. or physical illness like cold, etc. and these unintentional modifications fall under the category of “Apprehensive voice”. Voice disguising also has various good applications such as in radio and television interviews for information transmission without disclosing the speaker’s identity. It is also used in speech coding, synthesis of speech, entertainment, etc. Various methods are available to alter their voice such as using a foreign accent, altering speaking rate, etc. There are three main ways to produce an apprehensive voice intentionally: the first one is by placing a hand over the mouth, and the others are trying to produce a high pitch or low pitch, and producing a strained nostril voice (Künzel, 2004). To determine the apprehensive voice from the normal voice, firstly an analysis of spectrographic formants of the voice should be done and compared to the habitual voice. After that, an automatic classification is acquired and these acquired results and observations will, in turn, give fascinating clues of differentiation to the investigators.

There are two ways of variations in an individual’s voice that is: (Mireia Farrus, 2017)

1. Deliberate Variation – A disguised voice that is deliberate mainly speaker-dependent. Künzel (1) described the dissimilarities in the approach between both the women and men. The deliberate disguise also categorized into:

• Non-Electronic Disguise – Imitation of voice is natural in human beings and observed in the communication of humans through the medium of acquisition of language, the transformation of voice, and impersonation. Impersonation is a type of imitation which aims to generate someone’s voice/ speech (Markham, 1997).

• Electronic Disguise – In electronic disguise, a device is employed to alter someone’s natural voice/speech. When used deliberately, it is frequently observed in the voice conversion form. The conversion of voice is the modification of the source speaker’s voice into the target speaker’s voice in a manner to mimic the target's voice.

2. Non-deliberate Variation – There are several changes due to the uncontrolled reasons occurring in the people’s voice. The majority of those changes are caused naturally by modifications that impact the usual development of the body such as illness, age, etc. Other types of modifications can be initiated due to the use of electronic devices in the process of communication. They are known as “Non-deliberate Non-electronic Disguise” and “Non-deliberate Electronic Disguises”.

• Non-Electronic Disguise – The best example of non-electric and non-deliberate disguise is a hoarse voice. Hoarseness in voice is a change in the quality of voice habitually manifested by rough or breathy voices and which is usually caused by illnesses like laryngitis etc. (Sulica, 2011). The other example of this type of disguise is “Emotional changes” which contains the impact of emotions on the speech which have been studied widely. The other example is “Intoxication”. Some speeches are also altered or manipulated under the influence of intoxication. “Ageing” is another example of a non-electronic disguise in which speech production is facing physiological as well as anatomical changes throughout life (Schoetz, 2007).

• Electronic Disguise – Non-deliberate electronic disguise is defined as any distortion or deformity in speech because of the channel effects, for example, microphone use, etc. Aside from the population size employed apart from the size of the population used in the task of automatic speaker identification, the distortion produced by the noisy communication channels is evaluated as the huge factor impacting the performance of the system.

In the deliberate variation, the speaker/individual tries to mimic some other individual’s voice to confuse the listener. In the non-deliberate, variations occurred due to the emotional condition or physical conditions like illness, sore throat, etc. Other two possible variations occurred in the voice of human which can be further categorized into:

1. Electronic Disguised/Apprehensive Voice

2. Non-electronic Disguised/Apprehensive Voice

In the former, the disguise in voice is produced electronically using various software tools such as Praat or Audacity. The latter voice is produced by changing the individual’s voice mechanically such as by placing a hand over the mouth, by straining nostrils, etc. MFCC is the main feature that is employed widely for apprehensive voice identification and speaker identification. The MFC in the sound processing is a presentation of the short-term power sound spectrum depending on the linear cosine transform of a long-term power sound spectrum on the Mel scale of the frequency which is non-linear. MFC stands for Mel-frequency cepstrum which is a collection of the MFCC’s (Kurian and Kurup, 2016; George and George, 2015). Various distinct techniques have been employed in the past for identification of the author of an unidentified recording in several scenarios which ranges from entirely auditory to automatic. It was mentioned by the author Künzel (1) in the survey paper in 1975 that an average of 15 % of cases was provided to the Department of Speaker Recognition of the German Federal Police Office. The internal statistics of greater than 20 years disclosed that the disguise was found in most of the cases of some specific offense types such as kidnapping etc. Interestingly, the use of electronic equipment (voice changers) was exceptionally rare in Germany. 


C. Zhang, T. Tan. “Voice Disguise and Automatic Speaker Recognition”, Forensic Science International, Vol. 175, 2007, pp. 118–122.

G. S. Didla and H. Hollien. “Voice Disguise and Speaker Identification”, Acoustical Society of America, Proceedings of Meetings on Acoustics, 02-06 November 2015, Vol. 25, doi: 10.1121/2.0000239.

Gangamohan, P., et al. “Analysis of Emotional Speech—A Review.” Toward Robotic Socially Believable Behaving Systems - Volume I Intelligent Systems Reference Library, 2016, pp. 205–238., doi: 10.1007/978-3-319-31056-5_11.

George, A. M. et al. “Detection of Voice Disguise by Various Disguising Factors”, International Journal of Innovative Research in Computer and Communication Engineering, Vol. 3, Issue 8, August 2015, doi: 10.15680/IJIRCCE.2015. 0308050.

Künzel, H. J. “Effects of Voice Disguise on Speaking Fundamental Frequency”, International Journal of Speech, Language and the Law, vol. 7(2), 2000, pp.150-179, doi:

Künzel, Hermann J., et al. “Effect of Voice Disguise on the Performance of a Forensic Automatic Speaker Recognition System.” ODYSSEY04 -- The Speaker and Language Recognition Workshop Toledo, Spain. 2004

Kurian, S. “Recognition of Electronic Disguised Voices by the Means of MFCC”, International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering, Vol. 5, Issue 6, June 2016, doi:10.15662/IJAREEIE.2016.0506014.

Li, Dongdong, et al. “Cost-Sensitive Learning for Emotion Robust Speaker Recognition.” The Scientific World Journal, vol. 2014, 2014, pp. 1–9., doi:10.1155/2014/628516.

Lini T Lal et al. “Identification of Disguised Voices using Feature Extraction and Classification”, International Journal of Engineering Research and General Science, Volume 3, Issue 2, Part 2, March-April, 2015 ISSN 2091-2730.

Perrot, Chollet et al. “Detection and Recognition of Voice Disguise”, Conference: IAFPA International Association for Forensic Phonetics and Acoustics, 2007.

Perrot, Patrick, et al. “Voice Disguise and Automatic Detection: Review and Perspectives.” Lecture Notes in Computer Science Progress in Nonlinear Speech Processing, 2007, pp. 101–117. doi: 10.1007/978-3-540-71505-4_7.

Pohjalainen, Jouni, and Paavo Alku. “Automatic Detection of Anger in Telephone Speech with Robust Autoregressive Modulation Filtering.” 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, doi:10.1109/icassp.2013.6639128.

Sharma V, Bansal P.K, “A Review on Speaker Recognition Approaches and Challenges”, International Journal of Engineering Research & Technology (IJERT), Vol. 2, Issue 5, May – 2013.

Tiejun Tan. “The Effect of Voice Disguise on Automatic Speaker Recognition”, 3rd International Congress on Image and Signal Processing (CISP2010), Volume 8, 16-18 October 2010, doi: 10.1109/CISP.2010.5647131.

Tim Polzeh et al., “Anger Recognition in Speech Using Acoustic and Linguistic Cues”, Speech Communication, Volume 53, Issues 9–10, November–December 2011, Pages 1198-1209.

How to cite this article?

APA StyleKaur, G., & Singh, R. K. (2020). Speaker Recognition: On the Basis of their Habitual and Apprehensive Voice. Academic Journal of Forensic Sciences, 3 (1), 12–19.
Chicago Style
MLA Style

Create Your Password

We've sent a link to create password on your registered email, Click the link in email to start using Xournal.

Sign In

Forgot Password?
Don't have an account? Create Account

Create Account

Already have an account? Sign In

Forgot Password

Do you want to try again? Sign In

Publication Tracking