Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Architecture for crowd-sourcing wake word voice data #13

Open
HarvsG opened this issue Apr 21, 2024 · 0 comments
Open

[Proposal] Architecture for crowd-sourcing wake word voice data #13

HarvsG opened this issue Apr 21, 2024 · 0 comments

Comments

@HarvsG
Copy link

HarvsG commented Apr 21, 2024

I've just watched Mike's piece from the recent state of the open home livestream and it got me thinking.

This is almost certainly the wrong place to discuss this, so please do redirect me.

The problem

Users' problems:

  • Inaccuracies in wake word detection or STT are frustrating and degrade the user experience
  • You can develop your own Wake Word, but training data is limited to the sample generator

Developers' problems:

  • Voice data is hard and expensive to collect, clean and train on
  • There is a wide variety in microphone quality and hardware that is not easily simulated
  • The variation of voices generated by sample generator does not equal the variation of users' voices, (accents, environments, reverberation, microphones, volumes etc)

Proposed solution

  • A user sets up a local voice pipeline, if they wish to use a custom wake word they are signposted to a webpage I will call "VoiceTrainer"
  • At VoiceTrainer they will go through on onboarding process:
    • Confirm they want to train a custom wake word
    • Privacy page where it explained that recordings of their voice will be contributed to open source databases and that people might be able to hear samples of their voice and it will be used to also train other people's wake words.
      • A user who refused is re-directed to the old wake word training guide
    • The user is provided with an endpoint URL that can be added to home assistant/wyoming
  • The user connects their voice voice pipeline to the VoiceTrainer Endpoint - ideally through an option within the HA or add-on UI.
  • The user defines their custom wake word e.g "Doris"
  • A check is run to see of the wake word already exists in the open database
  • The user is then asked to record several samples of their own wake word on one or more voice satellites and listen back to them
  • The user is then asked to record a random selection of X other people's wake words. (Random selection weighted towards newer wake words)
    • These recordings will be used to help other people (re)train their own voice assistants
    • Recordings of other people's wake words will be used as a negative sample on their own wake word.
  • The user is then asked to record some random passage of text that does not contain wake-words (for use as a negative wake word sample, and possibly for STT training)
  • The user is then asked to listen to a small number of other people's contributions to verify their quality and accuracy.
  • The user is then directed to a google collab instance or similar and is provided with an up to date database of their voice samples (and everyone else's)
  • Once 100 other users have contributed samples of "Doris" the user is notified and a suggestion is made that they re-train their model using the new crowd-sourced data.
  • The model is uploaded and open-sourced

The result

  • An increasing database of wake-words with positive and negative samples
  • A database of open-source STT samples for use in training datasets
    • A wide variety of voices, on a wide variety of hardware
    • From a sample population of people who use local voice hardware
  • An incentive structure so that people help train each other's wake words

Issues

  • This is a bit of a 'pyramid scheme', in that the first people will may benefit the most, and last people will have no-one to train their models. This would be fine if there are enough contributors and a steady stream of new users
  • This is probably quite a substantial undertaking to build this infrastructure
  • Who pays for the training compute if it can't be done within a google collab?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant