Analysis of Acoustic Anomalies and Speech Artefacts in Synthetic Content

undefined

Abstract

Purpose: The purpose of the study was to empirically confirm that acoustic anomalies and speech artefacts may constitute interpretable and robust descriptors used to detect audio deepfakes. The study focused on identifying typical deviations in voice, prosody and acoustic spectrum parameters, which result from synthetic speech generation by TTS models and voice conversion, in particular in the sound source and articulation track mirroring layers. The following stage was to assess the stability of these characteristics in realistic distribution of the content (in the wild), including recompression, variable band and typical noise.

Project and methods: A full framework was developed for unification, extraction and selection of acoustic characteristics, independent of classifiers. The analysis included the impact of the signal to noise ratio (SNR), which determines the quality of audio recording, where a low SNR value indicates strong impact of the background noise and significantly decreases the effectiveness of phase, cepstral and modulation characteristics. 46,371 clips from the DeepFake RealWorld (DFRW) set were analysed, which includes authentic and synthetic recordings, generated using various technologies (GAN, diffusion models, TTS, voice conversion). Five descriptor families were defined: tonal-glottal, cepstral-spectral, phase, energy-dynamic and prosodic-modulational. The selection was completed without using neural networks, using differential factors Δp = p_df − p_real and PR = p_df / p_real with thresholds Δp ≥ 0.15 or PR ≥ 1.5 and validity control FDR (q < 5%).

Results: The analysis revealed significant differences between authentic and synthetic speech. The highest differentiation effectiveness was obtained for characteristics LFCC, CQCC and MFCC (Δp to 0.25; PR ≈ 1.6–1.8), which maintained stability after degradations typical of social medial. The jitter/ shimmer, HNR/CPP and modulation characteristics showed smoothing of prosody and excessive voice regularity (Δp ≈ 0.17–0.23). Phase characteristics were useful in detecting harmonic discontinuities, however their effectiveness dropped at low SNR. The combination of acoustic analysis with audio-video synchronisation metrics (LSE-C/LSE-D) increased the resistance to single modality disturbance attacks.

Conclusions: The identified speech anomalies and artefacts are a credible and interpretable foundation of audio deepfake detection. The results have a direct application value for public safety and civil protection, because they enable building an auditable audio content layer for voice impersonation and message manipulations. In operating scenarios, such as crisis communication of public institutions, verification of authenticity of recordings disseminated in social media and analysis of sociotechnical incidents, interpretable descriptors may shorten the triage time, support the early warning and reduce the escalation risk of voice misinformation. They can be used as basis for hybrid forensic systems that combine classic acoustic descriptors with deep learning models to ensure interpretability and resistance to technological drift. The DFRW set and the applied selection method enable a comparable and repeatable evaluation of the effectiveness of characteristics in various distribution conditions. The continuation of the project (DFRWv2) will include database extension to ≥ 500,000 clips and multi-modal audio-video analyses, which will enable standardisation of the reporting of indicators Δp, PR, p_real, p_df and 95% CI in forensic studies and security engineering.

Keywords: audio deepfake, speech artefacts, acoustic anomalies, modulation spectrum, audio-video synchronisation, public safety, cybersecurity, civil protection, crisis communication, acoustic forensics

Type of article: original scientific article

Bibliography:

Amodei D., Hernandez D., AI and Compute, OpenAI, 2018.
Kaplan J., McCandlish S., Henighan T. et al., Scaling Laws for Neural Language Models, arXiv:2001.08361, 2020.
Brown T.B., Mann B., Ryder N. et al., Language Models Are Few-Shot Learners, “Advances in Neural Information Processing Systems” 2020, 33, 1877–1901.
Achiam J., Adler S., Agarwal S. et al., GPT-4 Technical Report, arXiv:2303.08774, 2023.
Chesney R., Citron D.K., Deep Fakes: A Looming Challenge for Privacy, Democracy, and National Security, “California Law Review” 2019, 107, 1753–1819, https://doi.org/10.15779/Z38RV0D15J.
Verdoliva L., Media Forensics and Deepfakes: An Overview, “IEEE Journal of Selected Topics in Signal Processing” 2020, 14(5), 910–932, https://doi.org/10.1109/JSTSP.2020.3002101.
Wang C., Chen S., Wu Y., Zhang Z., Zhou L., Liu S., Wei F., Neural Codec Language Models Are Zero-Shot Text-to-Speech Synthesizers, arXiv:2301.02111, 2023.
Ren Y., Hu C., Tan X., Qin T., Zhao S., Zhao Z., Liu T.Y., FastSpeech 2: Fast and High-Quality End-to-End Text to Speech, arXiv:2006.04558, 2020.
Korshunov P., Marcel S., Vulnerability of Face Recognition to Deep Morphing, w: ICB Workshops, 2019.
Nautsch A., Wang X., Todisco M. et al., ASVspoof 2019: A Large-Scale Public Database of Synthesized, Converted and Replayed Speech, “Computer Speech & Language” 2022, 72, 101309, https://doi.org/10.1016/j.csl.2021.101309.
Sumsub, Identity Fraud Report 2023: Trends and Forecasts, Sumsub, 2023 [dok. elektr.], https://sumsub.com/blog/guides-reports/identity-fraud-report-2023/ [dostęp: 01.12.2024].
Regula Forensics, Deepfake Trends 2024: Business Identity Fraud Survey, 2024 [dok. elektr.], https://regulaforensics.com/resources/deepfake-trends-2024-report/ [dostęp: 01.12.2024].
Matern F., Riess C., Stamminger M., Exploiting Visual Artifacts to Expose Deepfakes and Face Manipulations, w: WACV Workshops, 2019.
Guarnera L., Giudice O., Battiato S., DeepFake Detection by Analyzing Convolutional Traces, w: CVPR Workshops, 2020.
Mittal T., Bhattacharya U., Chandra R., Bera A., Manocha D., Emotions Don’t Lie: An Audio-Visual Deepfake Detection Method Using Affective Cues, w: Proceedings of the 28th ACM International Conference on Multimedia, 2020, 2823–2832, https://doi.org/10.1145/3394171.3413530.
Frank J., Schönherr L., WaveFake: A Data Set to Facilitate Audio Deepfake Detection, arXiv:2111.02813, 2021.
Kumar S.V., Reddy S.T.A., Kalyani V., Deepfake Detection on Social Media, “International Journal of Communication Networks and Information Security” 2024, 16(5), 776–782.
DiResta R., The Supply of Disinformation Will Soon Be Infinite, The Atlantic 2020.
Europol, AI and Policing: The Benefits and Challenges of Artificial Intelligence for Law Enforcement, Spotlight Report, 2024 [dok. elektr.].
TNO, Generative AI and the Information Domain: Scenario-Based Analysis, 2023 [dok. elektr.].
Haliassos A., Vougioukas K., Petridis S., Pantic M., Lips Don’t Lie: A Generalisable and Robust Approach to Face Forgery Detection, w: CVPR, 2021, 5039–5049, https://doi.org/10.1109/CVPR46437.2021.00500.
Li M., Ahmadiadli Y., Zhang X.P., A Survey on Speech Deepfake Detection, ACM Computing Surveys 2025, 57(7), 1–38, https://doi.org/10.1145/3657424.
Dolhansky B., Howes R., Pflaum B. et al., The DeepFake Detection Challenge (DFDC) Dataset, arXiv:2006.07397, 2020.
Rössler A., Cozzolino D., Verdoliva L., Riess C., Thies J., Nießner M., FaceForensics++: Learning to Detect Manipulated Facial Images, w: ICCV, 2019, https://doi.org/10.1109/ICCV.2019.00076.
Jiang L., Li R., Wu W. et al., DeeperForensics-1.0: A Large-Scale Dataset for Real-World Face Forgery Detection, w: CVPR, 2020, https://doi.org/10.1109/CVPR42600.2020.00288.
Li Y., Yang X., Sun P., Qi H., Lyu S., Celeb-DF: A Large-Scale Challenging Dataset for DeepFake Forensics, w: CVPR, 2020, https://doi.org/10.1109/CVPR42600.2020.00327.
Kinnunen T., Sahidullah M., Delgado H. et al., The ASVspoof 2017 Challenge, w: Interspeech, 2017, https://doi.org/10.21437/Interspeech.2017-1111.
Todisco M., Wang X., Vestman V. et al., ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection, arXiv:1904.05441, 2019.
Wu Z., Yamagishi J., Kinnunen T. et al., ASVspoof: The Automatic Speaker Verification Spoofing and Countermeasures Challenge, IEEE Journal of Selected Topics in Signal Processing 2017, 11(4), 588–604, https://doi.org/10.1109/JSTSP.2017.2739166.
Eom J., Lee J., Kim H. et al., AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks, w: INTERSPEECH 2022, 2398–2402, https://doi.org/10.21437/Interspeech.2022-11067.
Baevski A., Zhou Y., Mohamed A., Auli M., wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, NeurIPS 2020, 33, 12449–12460.
Durall R., Keuper M., Keuper J., Watch Your Up-Convolution: CNN-Based Generative Deep Neural Networks Are Failing to Reproduce Spectral Distributions, w: CVPR, 2020, https://doi.org/10.1109/CVPR42600.2020.00795.
Guo C., Pleiss G., Sun Y., Weinberger K.Q., On Calibration of Modern Neural Networks, w: ICML, 2017, 1321–1330.
Lu Y., Luo R., Ebrahimi T., Improving Deepfake Detectors against Real-World Perturbations, Applications of Digital Image Processing XLVI, SPIE 2023, https://doi.org/10.1117/12.2676695.
Carlini N., Tramer F., Wallace E. et al., Poisoning Web-Scale Training Datasets Is Practical, arXiv:2302.10149, 2023.
Qian Y., Yin G., Sheng L., Chen Z., Shao J., Thinking in Frequency: Face Forgery Detection by Mining Frequency-Aware Clues, w: ECCV, 2020, 86–103, https://doi.org/10.1007/978-3-030-58577-8_6.
Doshi-Velez F., Kim B., Towards a Rigorous Science of Interpretable Machine Learning, arXiv:1702.08608, 2017.
Wang T., Liao X., Chow K.P., Lin X., Wang Y., Deepfake Detection: A Comprehensive Survey from the Reliability Perspective, arXiv:2211.10881, 2022.
Selvaraju R.R., Cogswell M., Das A. et al., Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization, w: ICCV, 2017, https://doi.org/10.1109/ICCV.2017.74.
Neves J.C., Tolosana R., Vera-Rodriguez R. et al., GANprintR: Improved Fakes and Evaluation of the State of the Art in Face Manipulation Detection, “IEEE Journal of Selected Topics in Signal Processing” 2020, 14(5), 1038–1048, https://doi.org/10.1109/JSTSP.2020.2999165.
Paullada A., Raji I.D., Bender E.M., Denton E., Hanna A., Data and Its (Dis)Contents, “Patterns” 2021, 2(11), 100336, https://doi.org/10.1016/j.patter.2021.100336.
Gebru T., Morgenstern J., Vecchione B. et al., Datasheets for Datasets, Communications of the “ACM” 2021, 64(12), 86–92, https://doi.org/10.1145/3458723.
Ho J., Jain A., Abbeel P., Denoising Diffusion Probabilistic Models, “NeurIPS” 2020, 33, 6840–6851.
Mittal T., Sinha R., Swaminathan V., Collomosse J., Manocha D., Video Manipulations Beyond Faces, w: “WACV” 2023, 643–652, https://doi.org/10.1109/WACV56688.2023.00073.
Mittal A., Soundararajan R., Bovik A.C., Making a “Completely Blind” Image Quality Analyzer, “IEEE Signal Processing Letters” 2013, 20(3), 209–212, https://doi.org/10.1109/LSP.2012.2227726.
Zhang R., Isola P., Efros A.A., Shechtman E., Wang O., The Unreasonable Effectiveness of Deep Features as a Perceptual Metric, w: “CVPR” 2018, 586–595, https://doi.org/10.1109/CVPR.2018.00068.
Tu Z., Wang Y., Birkbeck N., Adsumilli B., Bovik A.C., UGC-VQA, “IEEE Transactions on Image Processing” 2021, 30, 4449–4464, https://doi.org/10.1109/TIP.2021.3070508.
Teed Z., Deng J., RAFT: Recurrent All-Pairs Field Transforms for Optical Flow, w: “ECCV” 2020, https://doi.org/10.1007/978-3-030-58536-5_24.
Khalid H., Tariq S., Kim M., Woo S.S., FakeAVCeleb, arXiv:2108.05080, 2021.
Chung J.S., Zisserman A., Out of Time: Automated Lip Sync in the Wild, “ACCV” 2016, 251–263, https://doi.org/10.1007/978-3-319-54184-6_16.
Prajwal K.R., Mukhopadhyay R., Namboodiri V.P., Jawahar C.V., A Lip Sync Expert Is All You Need, “ACM Multimedia” 2020, 484–492, https://doi.org/10.1145/3394171.3413532.
Baltrušaitis T., Zadeh A., Lim Y.C., Morency L-P., OpenFace 2.0, “IEEE FG” 2018, https://doi.org/10.1109/FG.2018.00019.
Frank J., Eisenhofer T., Schönherr L. et al., Leveraging Frequency Analysis for Deep Fake Image Recognition, “ICML” 2020, 3247–3258.
Tolosana R., Vera-Rodriguez R., Fierrez J., Morales A., Ortega-Garcia J., Deepfakes and Beyond, “Information Fusion” 2020, 64, 131–148, https://doi.org/10.1016/j.inffus.2020.06.014.
Cohen J., A Coefficient of Agreement for Nominal Scales, “Educational and Psychological Measurement” 1960, 20(1), 37–46, https://doi.org/10.1177/001316446002000104.
Krippendorff K., Content Analysis: An Introduction to Its Methodology, 4th ed., SAGE, 2019.
Artstein R., Poesio M., Inter-Coder Agreement for Computational Linguistics, “Computational Linguistics” 2008, 34(4), 555–596, https://doi.org/10.1162/coli.07-034-R2.
Northcutt C.G., Athalye A., Mueller J., Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks, arXiv:2103.14749, 2021.
Güera D., Delp E.J., Deepfake Video Detection Using Recurrent Neural Networks, “AVSS” 2018, https://doi.org/10.1109/AVSS.2018.8639163.
EBU, R 128: Loudness Normalisation and Permitted Maximum Level of Audio Signals, European Broadcasting Union, 2014.
ITU-R, BS.1770-4: Algorithms to Measure Audio Programme Loudness and True-Peak Audio Level, ITU, 2015.
ITU-T, P.56: Objective Measurement of Active Speech Level, ITU, 2011.
Bianchi T., Piva A., Image Forgery Localization via Block-Grained Analysis of JPEG Artifacts, “IEEE Transactions on Information Forensics and Security” 2012, 7(3), 1003–1017, https://doi.org/10.1109/TIFS.2012.2187516.
de Cheveigné A., Kawahara H., YIN: A Fundamental Frequency Estimator for Speech and Music, “Journal of the Acoustical Society of America” 2002, 111(4), 1917–1930, https://doi.org/10.1121/1.1458024.
Titze I.R., Principles of Voice Production, Prentice Hall, 1994.
Boersma P., Accurate Short-Term Analysis of the Fundamental Frequency, “Proceedings of the Institute of Phonetic Sciences” 1993, 17, 97–110.
Krom G.D., A Cepstrum-Based Technique for Determining a Harmonics-to-Noise Ratio, “Journal of Speech and Hearing Research” 1993, 36(2), 254–266, https://doi.org/10.1044/jshr.3602.254.
De Leon P.L., Stewart B., Yamagishi J., Synthetic Speech Discrimination Using Pitch Pattern Statistics, w: Interspeech, 2012.
Muda L., Begam M., Elamvazuthi I., Voice Recognition Algorithms Using MFCC and DTW, arXiv:1003.4083, 2010.
Sahidullah M., Kinnunen T., Hanilçi C., A Comparison of Features for Synthetic Speech Detection, w: Interspeech, 2015.
Todisco M., Delgado H., Evans N., Constant Q Cepstral Coefficients, “Computer Speech & Language” 2017, 45, 516–535, https://doi.org/10.1016/j.csl.2017.01.001.
Markel J.D., Gray A.J., Linear Prediction of Speech, Springer, 2013.
Rabiner L., Schafer R., Theory and Applications of Digital Speech Processing, Prentice Hall, 2010.
Peeters G., A Large Set of Audio Features for Sound Description, CUIDADO IST Project Report, 2004.
Murthy H.A., Yegnanarayana B., Formant Extraction from Group Delay Function, “Speech Communication” 1991, 10(3), 209–221, https://doi.org/10.1016/0167-6393(91)90008-K.
McAuliffe M., Socolof M., Mihuc S. et al., Montreal Forced Aligner, w: Interspeech, 2017, 498–502, https://doi.org/10.21437/Interspeech.2017-1386.
Patel Y., Tanwar S., Gupta R. et al., Deepfake Generation and Detection, IEEE Access 2023, 11, 143296–143323, https://doi.org/10.1109/ACCESS.2023.3342844.
Shen J., Pang R., Weiss R.J. et al., Natural TTS Synthesis by Conditioning WaveNet, w: ICASSP, 2018, 4779–4783, https://doi.org/10.1109/ICASSP.2018.8461368.
Kim J., Kong J., Son J., Conditional Variational Autoencoder with Adversarial Learning, w: ICML, 2021, 5530–5540.
Jędrasiak K., Audio Stream Analysis for Deep Fake Threat Identification, Civitas et Lex 2024, 41(1), 21–35.
Elliott T.M., Theunissen F.E., The Modulation Transfer Function for Speech Intelligibility, “PLoS Computational Biology” 2009, 5(3), e1000302, https://doi.org/10.1371/journal.pcbi.1000302.
Houtgast T., Steeneken H.J., A Review of the MTF Concept, “Journal of the Acoustical Society of America” 1985, 77(3), 1069–1077, https://doi.org/10.1121/1.392224.
Hillenbrand J., Houde R.A., Acoustic Correlates of Breathy Vocal Quality, “Journal of Speech, Language, and Hearing Research” 1996, 39(2), 311–321, https://doi.org/10.1044/jshr.3902.311.
Benjamini Y., Hochberg Y., Controlling the False Discovery Rate, Journal of the Royal Statistical Society B 1995, 57(1), 289–300, https://doi.org/10.1111/j.2517-6161.1995.tb02031.x.
Efron B., Tibshirani R.J., An Introduction to the Bootstrap, Chapman & Hall/CRC, 1994.
Tomashenko N., Wang X., Vincent E. et al., Supplementary Material to the Paper “The VoicePrivacy 2020 Challenge”, 2022.

Analysis of Acoustic Anomalies and Speech Artefacts in Synthetic Content

Karol Jędrasiak, Ph.D. Eng.

Abstract