Analiza anomalii akustycznych i artefaktów mowy w treściach syntetycznych

undefined

Abstrakt

Cel: Celem badania było empiryczne potwierdzenie, że anomalie akustyczne i artefakty mowy mogą stanowić interpretowalne i odporne deskryptory służące do detekcji audio deepfake’ów. Praca koncentrowała się na identyfikacji charakterystycznych odchyleń w parametrach głosu, prozodii i widma akustycznego, które powstają w wyniku syntetycznego generowania mowy przez modele TTS i konwersji głosu, szczególnie w warstwie odwzorowania źródła dźwięku i toru artykulacyjnego. Dalszym etapem była ocena stabilności tych cech w warunkach realistycznej dystrybucji materiałów (ang. in the wild), obejmującej rekompresję, zmienne pasmo oraz typowe szumy.

Projekt i metody: Opracowano kompletną ramę unifikacji, ekstrakcji i selekcji cech akustycznych, niezależną od klasyfikatorów. W analizie uwzględniono wpływ stosunku sygnału do szumu (SNR), który określa jakość nagrania audio, przy czym niski SNR oznacza silny wpływ szumu tła i istotnie obniża skuteczność cech fazowych, cepstralnych i modulacyjnych. Analizie poddano 46 371 klipów ze zbioru DeepFake RealWorld (DFRW), obejmującego nagrania autentyczne i syntetyczne, wygenerowane różnymi technologiami (GAN, modele dyfuzyjne, TTS, voice conversion). Zdefiniowano pięć rodzin deskryptorów: tonalno-glottalne, cepstralne i widmowe, fazowe, energetyczno-dynamiczne oraz prozodyczno-modulacyjne. Selekcję prowadzono bez użycia sieci neuronowych, wykorzystując wskaźniki różnicowe Δp = p_df − p_real i PR = p_df / p_real przy progach Δp ≥ 0,15 lub PR ≥ 1,5 oraz kontroli istotności FDR (q < 5%).

Wyniki: Analiza ujawniła istotne różnice między mową autentyczną a syntetyczną. Najwyższą skuteczność rozróżniania uzyskano dla cech LFCC, CQCC i MFCC (Δp do 0,25; PR ≈ 1,6–1,8), które zachowały stabilność po degradacjach typowych dla mediów społecznościowych. Wskaźniki jitter/shimmer, HNR/CPP i cechy modulacyjne wskazały na wygładzenie prozodyki i nadmierną regularność głosu (Δp ≈ 0,17–0,23). Cechy fazowe były użyteczne w wykrywaniu nieciągłości harmonicznych, choć ich skuteczność spadała przy niskim SNR. Połączenie analizy akustycznej z metrykami spójności audio-wideo (LSE-C/LSE-D) zwiększyło odporność na ataki zakłócające jedną modalność.

Wnioski: Zidentyfikowane anomalie i artefakty mowy stanowią wiarygodny, interpretowalny fundament detekcji audio deepfake’ów. Wyniki mają bezpośrednią wartość aplikacyjną dla bezpieczeństwa publicznego, cyberbezpieczeństwa oraz ochrony ludności, ponieważ umożliwiają budowę audytowalnej warstwy preselekcji materiału audio pod kątem podszywania się głosem i manipulacji komunikatami. W scenariuszach operacyjnych, takich jak komunikacja kryzysowa instytucji publicznych, weryfikacja autentyczności nagrań rozpowszechnianych w mediach społecznościowych oraz analiza incydentów socjotechnicznych, interpretowalne deskryptory mogą skracać czas triage, wspierać wczesne ostrzeganie i ograniczać ryzyko eskalacji dezinformacji głosowej. Mogą służyć jako podstawa hybrydowych systemów forensycznych łączących klasyczne deskryptory akustyczne z modelami uczenia głębokiego, zapewniając interpretowalność i odporność na dryf technologiczny. Zbiór DFRW i zastosowana metoda selekcji umożliwiają porównywalną, powtarzalną ocenę skuteczności cech w różnych warunkach dystrybucyjnych. Kontynuacja projektu (DFRWv2) obejmie rozszerzenie bazy do ≥ 500 000 klipów i analiz multimodalnych audio-wideo, co pozwoli na standaryzację raportowania wskaźników Δp, PR, p_real, p_df i 95% CI w badaniach forensycznych i inżynierii bezpieczeństwa.

Słowa kluczowe: audio deepfake’i, artefakty mowy, anomalie akustyczne, widmo modulacyjne, spójność audio-wideo, bezpieczeństwo publiczne, cyberbezpieczeństwo, ochrona ludności, komunikacja kryzysowa, forensyka akustyczna

Typ artykułu: oryginalny artykuł naukowy

Bibliografia:

Amodei D., Hernandez D., AI and Compute, OpenAI, 2018.
Kaplan J., McCandlish S., Henighan T. et al., Scaling Laws for Neural Language Models, arXiv:2001.08361, 2020.
Brown T.B., Mann B., Ryder N. et al., Language Models Are Few-Shot Learners, “Advances in Neural Information Processing Systems” 2020, 33, 1877–1901.
Achiam J., Adler S., Agarwal S. et al., GPT-4 Technical Report, arXiv:2303.08774, 2023.
Chesney R., Citron D.K., Deep Fakes: A Looming Challenge for Privacy, Democracy, and National Security, “California Law Review” 2019, 107, 1753–1819, https://doi.org/10.15779/Z38RV0D15J.
Verdoliva L., Media Forensics and Deepfakes: An Overview, “IEEE Journal of Selected Topics in Signal Processing” 2020, 14(5), 910–932, https://doi.org/10.1109/JSTSP.2020.3002101.
Wang C., Chen S., Wu Y., Zhang Z., Zhou L., Liu S., Wei F., Neural Codec Language Models Are Zero-Shot Text-to-Speech Synthesizers, arXiv:2301.02111, 2023.
Ren Y., Hu C., Tan X., Qin T., Zhao S., Zhao Z., Liu T.Y., FastSpeech 2: Fast and High-Quality End-to-End Text to Speech, arXiv:2006.04558, 2020.
Korshunov P., Marcel S., Vulnerability of Face Recognition to Deep Morphing, w: ICB Workshops, 2019.
Nautsch A., Wang X., Todisco M. et al., ASVspoof 2019: A Large-Scale Public Database of Synthesized, Converted and Replayed Speech, “Computer Speech & Language” 2022, 72, 101309, https://doi.org/10.1016/j.csl.2021.101309.
Sumsub, Identity Fraud Report 2023: Trends and Forecasts, Sumsub, 2023 [dok. elektr.], https://sumsub.com/blog/guides-reports/identity-fraud-report-2023/ [dostęp: 01.12.2024].
Regula Forensics, Deepfake Trends 2024: Business Identity Fraud Survey, 2024 [dok. elektr.], https://regulaforensics.com/resources/deepfake-trends-2024-report/ [dostęp: 01.12.2024].
Matern F., Riess C., Stamminger M., Exploiting Visual Artifacts to Expose Deepfakes and Face Manipulations, w: WACV Workshops, 2019.
Guarnera L., Giudice O., Battiato S., DeepFake Detection by Analyzing Convolutional Traces, w: CVPR Workshops, 2020.
Mittal T., Bhattacharya U., Chandra R., Bera A., Manocha D., Emotions Don’t Lie: An Audio-Visual Deepfake Detection Method Using Affective Cues, w: Proceedings of the 28th ACM International Conference on Multimedia, 2020, 2823–2832, https://doi.org/10.1145/3394171.3413530.
Frank J., Schönherr L., WaveFake: A Data Set to Facilitate Audio Deepfake Detection, arXiv:2111.02813, 2021.
Kumar S.V., Reddy S.T.A., Kalyani V., Deepfake Detection on Social Media, “International Journal of Communication Networks and Information Security” 2024, 16(5), 776–782.
DiResta R., The Supply of Disinformation Will Soon Be Infinite, The Atlantic 2020.
Europol, AI and Policing: The Benefits and Challenges of Artificial Intelligence for Law Enforcement, Spotlight Report, 2024 [dok. elektr.].
TNO, Generative AI and the Information Domain: Scenario-Based Analysis, 2023 [dok. elektr.].
Haliassos A., Vougioukas K., Petridis S., Pantic M., Lips Don’t Lie: A Generalisable and Robust Approach to Face Forgery Detection, w: CVPR, 2021, 5039–5049, https://doi.org/10.1109/CVPR46437.2021.00500.
Li M., Ahmadiadli Y., Zhang X.P., A Survey on Speech Deepfake Detection, ACM Computing Surveys 2025, 57(7), 1–38, https://doi.org/10.1145/3657424.
Dolhansky B., Howes R., Pflaum B. et al., The DeepFake Detection Challenge (DFDC) Dataset, arXiv:2006.07397, 2020.
Rössler A., Cozzolino D., Verdoliva L., Riess C., Thies J., Nießner M., FaceForensics++: Learning to Detect Manipulated Facial Images, w: ICCV, 2019, https://doi.org/10.1109/ICCV.2019.00076.
Jiang L., Li R., Wu W. et al., DeeperForensics-1.0: A Large-Scale Dataset for Real-World Face Forgery Detection, w: CVPR, 2020, https://doi.org/10.1109/CVPR42600.2020.00288.
Li Y., Yang X., Sun P., Qi H., Lyu S., Celeb-DF: A Large-Scale Challenging Dataset for DeepFake Forensics, w: CVPR, 2020, https://doi.org/10.1109/CVPR42600.2020.00327.
Kinnunen T., Sahidullah M., Delgado H. et al., The ASVspoof 2017 Challenge, w: Interspeech, 2017, https://doi.org/10.21437/Interspeech.2017-1111.
Todisco M., Wang X., Vestman V. et al., ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection, arXiv:1904.05441, 2019.
Wu Z., Yamagishi J., Kinnunen T. et al., ASVspoof: The Automatic Speaker Verification Spoofing and Countermeasures Challenge, IEEE Journal of Selected Topics in Signal Processing 2017, 11(4), 588–604, https://doi.org/10.1109/JSTSP.2017.2739166.
Eom J., Lee J., Kim H. et al., AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks, w: INTERSPEECH 2022, 2398–2402, https://doi.org/10.21437/Interspeech.2022-11067.
Baevski A., Zhou Y., Mohamed A., Auli M., wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, NeurIPS 2020, 33, 12449–12460.
Durall R., Keuper M., Keuper J., Watch Your Up-Convolution: CNN-Based Generative Deep Neural Networks Are Failing to Reproduce Spectral Distributions, w: CVPR, 2020, https://doi.org/10.1109/CVPR42600.2020.00795.
Guo C., Pleiss G., Sun Y., Weinberger K.Q., On Calibration of Modern Neural Networks, w: ICML, 2017, 1321–1330.
Lu Y., Luo R., Ebrahimi T., Improving Deepfake Detectors against Real-World Perturbations, Applications of Digital Image Processing XLVI, SPIE 2023, https://doi.org/10.1117/12.2676695.
Carlini N., Tramer F., Wallace E. et al., Poisoning Web-Scale Training Datasets Is Practical, arXiv:2302.10149, 2023.
Qian Y., Yin G., Sheng L., Chen Z., Shao J., Thinking in Frequency: Face Forgery Detection by Mining Frequency-Aware Clues, w: ECCV, 2020, 86–103, https://doi.org/10.1007/978-3-030-58577-8_6.
Doshi-Velez F., Kim B., Towards a Rigorous Science of Interpretable Machine Learning, arXiv:1702.08608, 2017.
Wang T., Liao X., Chow K.P., Lin X., Wang Y., Deepfake Detection: A Comprehensive Survey from the Reliability Perspective, arXiv:2211.10881, 2022.
Selvaraju R.R., Cogswell M., Das A. et al., Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization, w: ICCV, 2017, https://doi.org/10.1109/ICCV.2017.74.
Neves J.C., Tolosana R., Vera-Rodriguez R. et al., GANprintR: Improved Fakes and Evaluation of the State of the Art in Face Manipulation Detection, “IEEE Journal of Selected Topics in Signal Processing” 2020, 14(5), 1038–1048, https://doi.org/10.1109/JSTSP.2020.2999165.
Paullada A., Raji I.D., Bender E.M., Denton E., Hanna A., Data and Its (Dis)Contents, “Patterns” 2021, 2(11), 100336, https://doi.org/10.1016/j.patter.2021.100336.
Gebru T., Morgenstern J., Vecchione B. et al., Datasheets for Datasets, Communications of the “ACM” 2021, 64(12), 86–92, https://doi.org/10.1145/3458723.
Ho J., Jain A., Abbeel P., Denoising Diffusion Probabilistic Models, “NeurIPS” 2020, 33, 6840–6851.
Mittal T., Sinha R., Swaminathan V., Collomosse J., Manocha D., Video Manipulations Beyond Faces, w: “WACV” 2023, 643–652, https://doi.org/10.1109/WACV56688.2023.00073.
Mittal A., Soundararajan R., Bovik A.C., Making a “Completely Blind” Image Quality Analyzer, “IEEE Signal Processing Letters” 2013, 20(3), 209–212, https://doi.org/10.1109/LSP.2012.2227726.
Zhang R., Isola P., Efros A.A., Shechtman E., Wang O., The Unreasonable Effectiveness of Deep Features as a Perceptual Metric, w: “CVPR” 2018, 586–595, https://doi.org/10.1109/CVPR.2018.00068.
Tu Z., Wang Y., Birkbeck N., Adsumilli B., Bovik A.C., UGC-VQA, “IEEE Transactions on Image Processing” 2021, 30, 4449–4464, https://doi.org/10.1109/TIP.2021.3070508.
Teed Z., Deng J., RAFT: Recurrent All-Pairs Field Transforms for Optical Flow, w: “ECCV” 2020, https://doi.org/10.1007/978-3-030-58536-5_24.
Khalid H., Tariq S., Kim M., Woo S.S., FakeAVCeleb, arXiv:2108.05080, 2021.
Chung J.S., Zisserman A., Out of Time: Automated Lip Sync in the Wild, “ACCV” 2016, 251–263, https://doi.org/10.1007/978-3-319-54184-6_16.
Prajwal K.R., Mukhopadhyay R., Namboodiri V.P., Jawahar C.V., A Lip Sync Expert Is All You Need, “ACM Multimedia” 2020, 484–492, https://doi.org/10.1145/3394171.3413532.
Baltrušaitis T., Zadeh A., Lim Y.C., Morency L-P., OpenFace 2.0, “IEEE FG” 2018, https://doi.org/10.1109/FG.2018.00019.
Frank J., Eisenhofer T., Schönherr L. et al., Leveraging Frequency Analysis for Deep Fake Image Recognition, “ICML” 2020, 3247–3258.
Tolosana R., Vera-Rodriguez R., Fierrez J., Morales A., Ortega-Garcia J., Deepfakes and Beyond, “Information Fusion” 2020, 64, 131–148, https://doi.org/10.1016/j.inffus.2020.06.014.
Cohen J., A Coefficient of Agreement for Nominal Scales, “Educational and Psychological Measurement” 1960, 20(1), 37–46, https://doi.org/10.1177/001316446002000104.
Krippendorff K., Content Analysis: An Introduction to Its Methodology, 4th ed., SAGE, 2019.
Artstein R., Poesio M., Inter-Coder Agreement for Computational Linguistics, “Computational Linguistics” 2008, 34(4), 555–596, https://doi.org/10.1162/coli.07-034-R2.
Northcutt C.G., Athalye A., Mueller J., Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks, arXiv:2103.14749, 2021.
Güera D., Delp E.J., Deepfake Video Detection Using Recurrent Neural Networks, “AVSS” 2018, https://doi.org/10.1109/AVSS.2018.8639163.
EBU, R 128: Loudness Normalisation and Permitted Maximum Level of Audio Signals, European Broadcasting Union, 2014.
ITU-R, BS.1770-4: Algorithms to Measure Audio Programme Loudness and True-Peak Audio Level, ITU, 2015.
ITU-T, P.56: Objective Measurement of Active Speech Level, ITU, 2011.
Bianchi T., Piva A., Image Forgery Localization via Block-Grained Analysis of JPEG Artifacts, “IEEE Transactions on Information Forensics and Security” 2012, 7(3), 1003–1017, https://doi.org/10.1109/TIFS.2012.2187516.
de Cheveigné A., Kawahara H., YIN: A Fundamental Frequency Estimator for Speech and Music, “Journal of the Acoustical Society of America” 2002, 111(4), 1917–1930, https://doi.org/10.1121/1.1458024.
Titze I.R., Principles of Voice Production, Prentice Hall, 1994.
Boersma P., Accurate Short-Term Analysis of the Fundamental Frequency, “Proceedings of the Institute of Phonetic Sciences” 1993, 17, 97–110.
Krom G.D., A Cepstrum-Based Technique for Determining a Harmonics-to-Noise Ratio, “Journal of Speech and Hearing Research” 1993, 36(2), 254–266, https://doi.org/10.1044/jshr.3602.254.
De Leon P.L., Stewart B., Yamagishi J., Synthetic Speech Discrimination Using Pitch Pattern Statistics, w: Interspeech, 2012.
Muda L., Begam M., Elamvazuthi I., Voice Recognition Algorithms Using MFCC and DTW, arXiv:1003.4083, 2010.
Sahidullah M., Kinnunen T., Hanilçi C., A Comparison of Features for Synthetic Speech Detection, w: Interspeech, 2015.
Todisco M., Delgado H., Evans N., Constant Q Cepstral Coefficients, “Computer Speech & Language” 2017, 45, 516–535, https://doi.org/10.1016/j.csl.2017.01.001.
Markel J.D., Gray A.J., Linear Prediction of Speech, Springer, 2013.
Rabiner L., Schafer R., Theory and Applications of Digital Speech Processing, Prentice Hall, 2010.
Peeters G., A Large Set of Audio Features for Sound Description, CUIDADO IST Project Report, 2004.
Murthy H.A., Yegnanarayana B., Formant Extraction from Group Delay Function, “Speech Communication” 1991, 10(3), 209–221, https://doi.org/10.1016/0167-6393(91)90008-K.
McAuliffe M., Socolof M., Mihuc S. et al., Montreal Forced Aligner, w: Interspeech, 2017, 498–502, https://doi.org/10.21437/Interspeech.2017-1386.
Patel Y., Tanwar S., Gupta R. et al., Deepfake Generation and Detection, IEEE Access 2023, 11, 143296–143323, https://doi.org/10.1109/ACCESS.2023.3342844.
Shen J., Pang R., Weiss R.J. et al., Natural TTS Synthesis by Conditioning WaveNet, w: ICASSP, 2018, 4779–4783, https://doi.org/10.1109/ICASSP.2018.8461368.
Kim J., Kong J., Son J., Conditional Variational Autoencoder with Adversarial Learning, w: ICML, 2021, 5530–5540.
Jędrasiak K., Audio Stream Analysis for Deep Fake Threat Identification, Civitas et Lex 2024, 41(1), 21–35.
Elliott T.M., Theunissen F.E., The Modulation Transfer Function for Speech Intelligibility, “PLoS Computational Biology” 2009, 5(3), e1000302, https://doi.org/10.1371/journal.pcbi.1000302.
Houtgast T., Steeneken H.J., A Review of the MTF Concept, “Journal of the Acoustical Society of America” 1985, 77(3), 1069–1077, https://doi.org/10.1121/1.392224.
Hillenbrand J., Houde R.A., Acoustic Correlates of Breathy Vocal Quality, “Journal of Speech, Language, and Hearing Research” 1996, 39(2), 311–321, https://doi.org/10.1044/jshr.3902.311.
Benjamini Y., Hochberg Y., Controlling the False Discovery Rate, Journal of the Royal Statistical Society B 1995, 57(1), 289–300, https://doi.org/10.1111/j.2517-6161.1995.tb02031.x.
Efron B., Tibshirani R.J., An Introduction to the Bootstrap, Chapman & Hall/CRC, 1994.
Tomashenko N., Wang X., Vincent E. et al., Supplementary Material to the Paper “The VoicePrivacy 2020 Challenge”, 2022.

Analiza anomalii akustycznych i artefaktów mowy w treściach syntetycznych

dr inż. Karol Jędrasiak

Abstrakt