An overview of Automatic Speaker Recognition

Slides:



Advertisements
Présentations similaires
PowerPoint. A guide to the use of ICT in the MFL classroom by Dean Horne Prudhoe Community High School.
Advertisements

Les pronoms compléments
GERPISA Eleventh International Colloquium June 11-13, 2003 Paris The Origins and the Limits of the Productive Models Diversity Research questions and research.
Département fédéral de lintérieur DFI Office fédéral de la statistique OFS Implementing the economic classification revision (NACE / ISIC) in the Business.
Réserver une table au restaurant
(Nom du fichier) - D1 - 01/03/2000 FTR&D/VERIMAG TAXYS : a tool for the Development and Verification of RT Systems a joint project between France Telecom.
Why? Extended language Sentence level work Cultural Find out new language Stimulus for creativity Reinforce everything children do in English Fun for.
Gérard CHOLLET Fusion Gérard CHOLLET GET-ENST/CNRS-LTCI 46 rue Barrault PARIS cedex 13
RPM - Reconnaissance de la Parole Multilingue - Un début de Parcours -
An Introduction to Biometric Verification of Identity
Some activities on Non-linear Speech Processing at ENST/CNRS-LTCI
Reconnaissance du locuteur
Reconnaissance de la parole
June 15th, 2004 BioSecure1 BioSecure : Future of Biometrics and Evaluations Gérard CHOLLET CNRS-LTCI, GET-ENST European Biometric Forum European Biometric.
Une introduction à la vérification biométrique de l'identité
Primary French PowerPoints What’s Your Name?.
Revenir aux basiques !. 1 Revenir aux basiques Processus Nécessité daméliorer la Maîtrise les Offres et Projets: lanalyse des causes racines montre un.
Inforoute Santé du Canada Les défis de linteropérabilité en e-santé Mike Sheridan, Chef de lexploitation 19 mai 2006.
Talking about yourself
Les verbes en -er. chanter – to singtelephoner – to call danser – to dancetravailler – to work diner – to have dinnervoyager –to travel ecouter – to listen.
Time with minutes French II Le 30 Octobre.
Cliquez et modifiez le titre Cliquez pour modifier les styles du texte du masque Deuxième niveau Troisième niveau Quatrième niveau Cinquième niveau 1 Regulation.
Tbilisi – November 27, 2007 FAO / EBRD COOPERATION PROGRAMME ______ PROTECTION OF GEORGIAN WINE APPELLATIONS.
Status report SOLEIL April 2008
Coopération/Distribution DEA Informatique Nancy. Content 4 Introduction - Overview 4 Coordination of virtual teams : –explicit interaction model –explicit.
Université Des Sciences Et De La Technologie DOran Mohamed Boudiaf USTO République Algérienne Démocratique et Populaire Département de linformatique Projet.
Comparison Unite 6: Partie 1
Defence R&D Canada R et D pour la défense Canada Novel Concepts for the COP of the Future Denis Gouin Alexandre Bergeron-Guyard DRDC Valcartier.
Bao LY VAN Doctorant – INT
TM.
Electronic Portfolio/ Portfolio électronique QPAT
Defence Research and Development Canada Recherche et développement pour la défense Canada Canada 11-1.
The Benefits of Technology in the Classroom By: Jennifer Langer.
DELF Le 12 au 15 avril POURQUOI DELF? Official French language diplomas (DELF-DALF) - Why take the DELF and the DALF ? The Diplôme dEtudes en Langue.
Assessment and the new secondary curriculum S. Barfoot.
How to solve biological problems with math Mars 2012.
EUROPEAN ASSOCIATION OF DEVELOPMENT RESEARCH AND TRAINING INSTITUTES ASSOCIATION EUROPÉENNE DES INSTITUTS DE RECHERCHE ET DE FORMATION EN MATIÈRE DE DÉVELOPPEMENT.
AFNOR NF Z – "Online Consumer Reviews
Mardi 20 Novembre 2012 Recap I can
TortoiseSVN N°. Subversion : pour quoi faire ? Avoir un espace de stockage commun – Tous les étudiants du SIGLIS ont un espace svn commun Partager vos.
IAFACTORY | conseil en architecture de linformation | | |
PURCHASING PHASE REVIEW Cornerstones of Purchase baseline
Les choses que j aime Learning Objective: To know how to use j aime to talk about things I like to do.
Laboratoire de Bioinformatique des Génomes et des Réseaux Université Libre de Bruxelles, Belgique Introduction Statistics.
Les Tâches Ménagères Learning Objectives:
Florian Bacher & Christophe Sourisse [ ] Seminar in Interactive Systems.
Présentation dun modèle dinterface adaptative dun système de diagnostique et dintervention industriel: ADAPTS (Adaptive Diagnostics And Personalized Technical.
Ce document est la propriété d EADS CCR ; il ne peut être communiqué à des tiers et/ou reproduit sans lautorisation préalable écrite d EADS CCR et son.
Un chat deux chats deux chiens Un chien deux chevaux Un cheval
Jeudi le 7 novembre. F 3 DUE: Virtual tour in LMS by 7:30 for the 70! DUE: Flashcards also for the 70 today (50 Friday) 1. Poem practice Le dormeur du.
QU’EST-CE QUE TU FAIS?.
Marketing électronique Cours 5 La personnalisation.
CLS algorithm Step 1: If all instances in C are positive, then create YES node and halt. If all instances in C are negative, create a NO node and halt.
Thematic Alignment of Static Documents with Meeting Dialogs Dalila Mekhaldi Diva Group Department of Computer Science University of Fribourg.
Français II H – Leçon 1B Structures
Title of topic © 2011 wheresjenny.com Each and Every when to use ?
Presenting the wonderful world of Pronouns.
INDICATOR DEFINITION An indicator describes the manifestation of a process of change resulting from the pursuit of an action. Un indicateur décrit la manifestation.
Différencier: NOMBRE PREMIER vs. NOMBRE COMPOSÉ
16-Oct-00SL-BI and QAP Presented to QAWG on 23/10/2000Slide 1 Quality Assurance in SL/BI Jean-Jacques GRAS (SL-BI)
VTHD PROJECT (Very High Broadband Network Service): French NGI initiative C. GUILLEMOT FT / BD / FTR&D / RTA
KM-Master Course, 2004 Module: Communautés virtuelles, Agents intelligents C3: Collaborative Knowledge construction & knowledge sharing Thierry NABETH.
Reprise Rappel 3, Part A Revised 9/10/12.
Ministère de l’Éducation, du Loisir et du Sport Responsables des programmes FLS et ELA: Diane Alain et Michele Luchs Animateurs: Diane Alain et Michael.
Information Theory and Radar Waveform Design Mark R. bell September 1993 Sofia FENNI.
Vérification du locuteur avec des méthodes segmentales en collaboration avec : Jean HENNEBERT Jan CERNOCKY Gérard CHOLLET.
F RIENDS AND FRIENDSHIP Project by: POPA BIANCA IONELA.
Question formation In English, you can change a statement into a question by adding a helping verb (auxiliary): does he sing? do we sing? did they sing.
Session 3: Implementation experience: Selection of measures based on Cost-effectiveness Analysis Introduction: summary of relevant results of the questionnaire.
Transcription de la présentation:

An overview of Automatic Speaker Recognition Gérard CHOLLET chollet@tsi.enst.fr GET-ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 http://www.tsi.enst.fr/~chollet

Outline Motivations, Applications Speech production background Speaker characteristics in the speech signal Automatic Speaker Verification : Decision theory Text dependent Text independent Databases, Evaluation, Standardization Audio-visual speaker verification Conclusions Perspectives

Why should a computer recognize who is speaking ? Protection of individual property (habitation, bank account, personal data, messages, mobile phone, PDA,...) Limited access (secured areas, data bases) Personalization (only respond to its master’s voice) Locate a particular person in an audio-visual document (information retrieval) Who is speaking in a meeting ? Is a suspect the criminal ? (forensic applications)

Domains of Automatic Speaker Recognition Your voice is a signature Speaker verification (Voice Biometric) Are you really who you claim to be ? Identification within an open set : Is this speech segment coming from a known speaker ? Identification within a closed set Speaker detection, segmentation, indexing, retrieval : Looking for recordings of a particular speaker Combining Speech and Speaker Recognition Adaptation to a new speaker Personalization in dialogue systems

Applications Access Control Physical facilities, Computer networks, Websites Transaction Authentication Telephone banking, e-Commerce Speech data Management Voice messaging, Search engines Law Enforcement Forensics, Home incarceration

Voice Biometric Avantages Often the only modality over the telephone, Low cost (microphone, A/D), Ubiquity Possible integration on a smart (SIM) card Natural bimodal fusion : speaking face Disadvantages Lack of discretion Possibility of imitation and electronic imposture Lack of robustness to noise, distortion,… Temporal drift

Speaker Identity in Speech Differences in Vocal tract shapes and muscular control Fundamental frequency (typical values) 100 Hz (Male), 200 Hz (Female), 300 Hz (Child) Glottal waveform Phonotactics Lexical usage, idiolects The differences between Voices of Twins is a limit case Voices can also be imitated or disguised

Speaker Identity suprasegmental factors segmental factors (~30ms) spectral envelope of / i: / f A Speaker A Speaker B Speaker Identity segmental factors (~30ms) glottal excitation: fundamental frequency, amplitude, voice quality (e.g., breathiness) vocal tract: characterized by its transfer function and represented by MFCCs (Mel Freq. Cepstral Coef) suprasegmental factors speaking speed (timing and rhythm of speech units) intonation patterns dialect, accent, pronunciation habits

Speech production

Speech analysis

Inter-speaker Variability We were away a year ago.

Intra-speaker Variability We were away a year ago.

Mel Frequency Cepstral Coefficients

Speaker Verification Typology of approaches (EAGLES Handbook) Text dependent Public password Private password Customized password Text prompted Text independent Incremental enrolment Evaluation

Automatic Speaker Verification Verification System Claimed Identity Acceptation Rejection Speech processing Biometric Technology

What are the sources of difficulty ? Intra-speaker variability of the speech signal (due to stress, pathologies, environmental conditions,…) Recording conditions (filtering, noise,…) Temporal drift Intentional imposture Voice disguise

Decision theory for identity verification Two types of errors : False rejection (a client is rejected) False acceptation (an impostor is accepted) Decision theory : given an observation O and a claimed identity H0 hypothesis : it comes from an impostor H1 hypothesis : it comes from our client H1 is chosen if and only if P(H1|O) > P(H0|O) which could be rewritten (using Bayes law) as

Decision

Distribution of scores

Receiver Operating Characteristic (ROC) curve

Detection Error Tradeoff (DET) Curve

History of Speaker Recognition

Current approaches

Text-dependent Speaker Verification Uses Automatic Speech Recognition techniques (DTW, HMM, …) Client model adaptation from speaker independent HMM (‘World’ model) Synchronous alignment of client and world models for the computation of a score.

Dynamic Time Warping (DTW)

Dynamic Time Warping (DTW) DODDINGTON 1974, ROSENBERG 1976, FURUI 1981, etc. “Bonjour” locuteur 1 “Bonjour” locuteur 2 “Bonjour” locuteur n “Bonjour” locuteur test Y “Bonjour” locuteur X meilleur chemin

Vector Quantization (VQ) SOONG, ROSENBERG 1987 Dictionnaire locuteur 1 Dictionnaire locuteur 2 Dictionnaire locuteur n “Bonjour” locuteur test Y Dictionnaire locuteur X meilleure quant.

Hidden Markov Models (HMM) ROSENBERG 1990, TSENG 1992 “Bonjour” locuteur test Y “Bonjour” locuteur X “Bonjour” locuteur 1 “Bonjour” locuteur 2 “Bonjour” locuteur n Best path

Ergodic HMM PORITZ 1982, SAVIC 1990 HMM locuteur 1 HMM locuteur n “Bonjour” locuteur test Y HMM locuteur X meilleur chemin

Gaussian Mixture Models (GMM) REYNOLDS 1995

An example of a Text-dependent Speaker Verification System : The PICASSO project Sequences of digits Speaker independent HMM of each digit Adaptation of these HMMs to the client voice (during enrolment and incremental enrolment) EER of less than 1 % can be achieved Customized password The client chooses his password using some feedback from the system Deliberate imposture

Deliberate imposture The impostor has some recordings of the target client voice. He can record the same sentences and align these speech signals with the recordings of the client. A transformation (Multiple Linear Regression) is computed from these aligned data. The impostor has heard the target client password. He records that password and applies the transformation to this recording. The PICASSO reference system with less than 1 % EER is defeated by this procedure (more than 30 % EER)

Incremental enrolment of customised password The client chooses his password using some feedback from the system. The system attempts a phonetic transcription of the password. Incremental enrolment is achieved on further repetitions of that password Speaker independent phone HMM are adapted with the client enrolment data. Synchronous alignment likelihood ratio scoring is performed on access trials.

HMM structure depends on the application

Speaker Verification (text independent) The ELISA consortium ENST, LIA, IRISA, DDL, Uni-Fribourg, Uni-Balamand... http://elisa.ddl.ish-lyon.cnrs.fr/ NIST evaluations http://www.nist.gov/speech/tests/spk/index.htm Gaussian Mixture Models, Graphical models Segmental approaches (ALISP)

Gaussian Mixture Model Parametric representation of the probability distribution of observations:

Gaussian Mixture Models 8 Gaussians per mixture

GMM speaker modeling WORLD GMM MODEL TARGET GMM MODEL GMM MODELING WORLD DATA TARGET SPEAKER Front-end GMM MODELING WORLD GMM MODEL GMM model adaptation TARGET GMM MODEL

Baseline GMM method l WORLD GMM MODEL HYPOTH. TARGET GMM MOD. = Front-end WORLD GMM MODEL Test Speech = LLR SCORE

Support Vector Machines and Speaker Verification Hybrid GMM-SVM system is proposed SVM scoring model trained on development data to classify true-target speakers access and impostors access, using new feature representation based on GMMs Modeling Scoring GMM SVM

SVM principles X y(X) Feature space Input space H Class(X) Ho Separating hyperplans H , with the optimal hyperplan Ho Ho H Class(X)

Results

State of the art – research directions (3) world model, speaker independent, train with all available speaker, using the algorithm EM . client model, Obtained as an adaptation of , MAP with a prior distribution MLLR with a transform function Unified approach

Adaptation Degré de liberté variable  Partitionnement variable des distributions Après chaque étape E de l’EM  partitionnement donnant une quantité de données suffisante par classe 12 9 17 6 23 21 33 56

Hierarchical - MLLR adapted System

National Institute of Standards & Technology (NIST) Speaker Verification Evaluations Annual evaluation since 1995 Common paradigm for comparing technologies

Evaluations NIST: généralités Standard reconnu pour l’évaluation des systèmes de vérification du locuteur Plusieurs centaines de locuteurs différents, Plusieurs dizaines de milliers d’accès de test. Participation des meilleurs laboratoires mondiaux MIT, IBM, Nuance…. Participation de l’ENST depuis 1997.

Evaluations NIST: Protocole Phase d’apprentissage 2 minutes de parole spontanée Condition téléphonique, réseau cellulaire Phase de test Durée des fichiers de 5s à 50s de parole spontanée

Evaluations NIST: Résultats Les résultats sont présentés et discutés lors d’un workshop annuel. Amélioration constante des performances de l’ENST (18%9%) malgré une augmentation de la difficulté: Réduction de la durée d’apprentissage, Réseau commuté  réseau cellulaire.

Evaluations NIST: Résultats

Combining Speech Recognition and Speaker Verification. Speaker independent phone HMMs Selection of segments or segment classes which are speaker specific Preliminary evaluations are performed on the NIST extended data set (one hour of training data per speaker)

1. 1 Speech Segmentation Large Vocabulary Continuous Speech Recognition (LVCSR) need huge amount of transcribed speech data language (and task) dependent good results for a small set of languages (with existing AND available transcripts) we do not have such system Data-driven speech segmentation not yet usable for speech recognition purposes no annotated databases needed language and task independent we could use it to segment the speech data for a text-independent speaker verification task and for language identification ALISP (Automatic Language Independent Speech Processing) method

1.2 ALISP data-driven speech segmentation

3. Data-driven Speech Segmentation for Speaker Verification Current best speaker verification systems are based on Gaussian Mixture Models (each speech frame is treated independently, and no temporal information is taken into account); Improvements are still necessary Speech is composed of different sounds Phonemes have different discriminant characteristics for speaker verification nasals and vowels convey more speaker characteristics then other speech classes we would like to exploit this idea, but with data-driven ALISP unit An automatic speech segmentation tool is needed

3.1 Advantages and disadvantages of the speech segmentation step Problems: Need of an automatic speech segmentation tool Speaker modeling per speech classes => more data needed More classes => more complicated systems Advantages Possibility to use it in combination with a dialogue based systems Text-prompted speaker verification Better accuracy if enough speech data available

3.2 Proposed system: ALISP based Segmental Speaker Verification using DTW Speaker specific information is extracted from the : ALISP based speech segments = > Client Dictionary Non-speaker (world speakers) : ALISP based speech segments => World Dictionary Dynamic Time Warping (DTW) was already used for speaker verification, but in a text-dependent mode comparison of two speech data with a similar linguistic content the DTW distance measure between two speech segments conveys some speaker specific characteristics Originality: use DTW in text-independent mode The speech data are first segmented in ALISP classes, in order to remove the linguistic variability Measure the distances among speaker and non-speaker speech segments

3.3 Searching in client and world speech dictionaries for speaker verification purposes

3.4 Database and experimental setup for the speaker verification experiments Development data: NIST 2001 cellular data (American English) world speakers (60 female + 59 male): train the ALISP speech segmenter model the non-speakers Evaluated on small subset (14 female + 14 male speakers) from NIST 2001 cellular data full set of NIST 2002 cellular data (??? speakers) Speech parameterization : LPCC for initial ALISP segmentation and MFCC afterward 64 ALISP speech classes

3.5 Results: example of data-driven speech segmentation for speaker verification Comparison of a manual transcription with the ALISP segmentation (I think my my daughter ) 2 occurrences of the English phone-sequence : m - ay ; corresponding ALISP sequences: HM-Hf-Ha and HM-Hz-Ha-HC

3.6 Results: another example data-driven speech segmentation for speaker verification 2 another occurrences of the English phone : ay ; the corresponding ALISP sequences: HX-Hf and Hf-Ha previous slide : Hf-Ha and Ha-Hz

3.7 Speaker Verification DET curves

3.8 Conclusions State of the art NIST 2002 results for EER: best 8% to worst 28% Problem with the small data set results: influence of the size of the test set and/or mismatched train/test conditions What we have NOT done: exploit the speech classes (silence classes are also included) normalization (with pseudo-impostors) exploit the DTW distance value, not only the “preference” result

SuperSID experiments

GMM with cepstral features

Selection of nasals in words in -ing being everything getting anything thing something things going

Fusion

Fusion results

Visages parlants et vérification d’identité Le visage et la parole offrent des informations complémentaires sur l’identité de la personne. De nombreux PC, PDA et téléphones sont et seront équipés d’une caméra et d’un microphone Les situations d’imposture sont plus difficiles à réaliser. Thème de recherche développé à l’ENST dans le cadre du projet IST-SecurePhone

Visages parlants et vérification d’identité Série de chiffres (PIN code) Mot de passe personnalisé

Fusion Parole et Visage (thèse de Conrad Sanderson, août 2002)

Exemple d’application Insecure Network Serveur distant: Accès à des services sécurisés Validation de transactions Etc. Acquisition des signaux biométriques pour chaque modalité Calcul du score de décision pour chaque système Calcul d’un score de décision final basé sur la fusion des scores mono-modalité

Conclusions et Perspectives La parole permet une vérification d’identité à travers le téléphone. Combiner les approches dépendantes et indépendantes du texte améliore la fiabilité. Si l’on utilise le visage pour vérifier l’identité, il ne coûte pas cher d’ajouter la parole (et cela rapporte gros !). De plus en plus de PC, PDA et téléphones sont équipés d’un microphone et d’une caméra. La reconnaissance audio-visuelle devrait se généraliser.

Perspectives Speech is often the only usable biometric modality (over the telephone network). Fusion of modalities. A number of R&D projects within the EU.