Neural Diarization based on Permutation-invariant Training

Case ID:
C15665

Unmet Need

Speech enhancement and speaker-adaption are essential for several applications, specifically those with much noise that make hearing sounds and differentiating noises difficult. This environment, called an overlapping speech environment, can be broken down into two categories of level of effect and how much the hearing is influenced negatively. Specifically, the worst category is dinner-party, which has shown to have up to 50% of an effect on Automatic Speech Recognition (ASR) performance. The second category in an overlapping speech environment is the meeting environment, which has shown to have 30% affected ASR performance. Therefore, there is a need for speech enhancement and speaker-adaption in these environments based on using time annotations of speaker segments. Applications for this are far reaching, ranging from telephone conversations to especially noisy environment, proving a need. There is a need for accurate speaker diarization that can improve overlapping speech ASR performance. The problem with current state-of-the-art systems include complicated building block, not working online for diarization, and the need for huge amount of speaker variations for higher performance.

 

Technology Overview

The inventors have proposed a novel speaker diarization method that is based on end-to-end learning framework of speaker diarization. Current methods that attempt to enhance ASR performance require the use of speaker identification models and clustering modules, which are both vastly complex and inefficient. The proposed solution integrates with online speech activity detection, allowing for real-time diarization scenes. In addition, it solves the overlapping speech problem by simply using real overlapping speech as the training set. This removes the need for any additional module and increases efficiency as this training data emulates real world scenarios most accurately. The frameworks that have been proposed to solve ASR performance degradation utilize a voice activity detector that can receive audio input and progress through a neural network that is trained with the plurality that accompanies real overlapping speech. Specifically, they developed two neural networks that work in tandem, in which the first minimizes permutation-invariant loss value, and the second is initiated after a time delay and functions on the outputs of the first neural network in the same fashion. This allows for efficient training, less cluster, and a simplified model that improves ASR performance.

 

Stage of Development

The inventors have developed this novel speaker diarization method and are testing validity.

Patent Information:
Title App Type Country Serial No. Patent No. File Date Issued Date Expire Date Patent Status
MULTI-SPEAKER DIARIZATION OF AUDIO INPUT USING A NEURAL NETWORK PRO: Provisional United States 62/896,392   9/5/2019     Expired
MULTI-SPEAKER DIARIZATION OF AUDIO INPUT USING A NEURAL NETWORK PCT: Patent Cooperation Treaty PCT PCT/US2020/048730   8/31/2020     Pending
MULTI-SPEAKER DIARIZATION OF AUDIO INPUT USING A NEURAL NETWORK PCT: Patent Cooperation Treaty Japan 2021-575505 7340630 8/31/2020 8/30/2023 8/31/2040 Granted
MULTI-SPEAKER DIARIZATION OF AUDIO INPUT USING A NEURAL NETWORK PCT: Patent Cooperation Treaty United States 17/595,472   11/17/2021     Pending
Inventors:
Category(s):
Get custom alerts for techs in these categories/from these inventors:
For Information, Contact:
Andrew Wichmann
wichmann@jhu.edu
410-614-0300
Save This Technology:
2017 - 2022 © Johns Hopkins Technology Ventures. All Rights Reserved. Powered by Inteum