Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Streaming Conformer Transducer #178

Draft
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

andreselizondo-adestech
Copy link

@andreselizondo-adestech andreselizondo-adestech commented Apr 15, 2021

This PR is an attempt at adding support for the Streaming Conformer Transducer network.

The changes that have been identified are:

  1. DepthwiseConv2D inside ConvModule needs to have padding='causal'
  2. MHSA layer must receive a mask that indicates which chunks to use for each timestep.
    2.1. A parameter for history_window_size needs to be added to config and dataset preprocessing.
  • Create StreamingConformer class
  • Set ConvModule -> DepthwiseConv2D padding to causal when streaming=True
  • Create ASRMaskedDataset. It must compute the required mask using history_window_size.
  • Pass mask to MHSA layer (Create wrapper StreamingConformerEncoder for this?).
  • Create custom DepthwiseConv1D with support for padding=causal
  • Create ASRMaskedTFRecordDataset for working with TFRecords.
  • Load time_reduction_factor into ASRMaskedDataset dynamically. (currently hardcoded)
  • Add pure TF mask creation functon with optional pre-compute.
  • Clean up StreamingConformer class. Remove unnecessary methods copied from StreamingTransducer.
  • Use correct optimizer and hyper parameters
  • Inference???
  • Cleanup

Deferred:

  • Create MaskedTransducerTrainerGA for working with gradient accumulation. (GA is no longer supported)

All comments and edits are welcome.

@andreselizondo-adestech
Copy link
Author

This PR is aimed to advance the TODO list on #14 .

@andreselizondo-adestech
Copy link
Author

@usimarit Hello!
I've run into a problem.
The Streaming Conformer Transducer (SCT) paper states that we need to convert the depthwise convolution inside the "ConvModel" to a "causal" depthwise convolution. However, this requires it to be a Conv1D, not a Conv2D.
Take a look at: tensorflow_asr/models/conformer.py#L158

My question is... why is a Conv2D being used? I've double checked with the original Conformer paper and it's supposed to be a Conv1D.
Maybe there's something that I'm missing.

@andreselizondo-adestech
Copy link
Author

andreselizondo-adestech commented Apr 16, 2021

I noticed you're using a DepthwiseConv2D with kernel_size=(32,1).
Would you consider replacing this by a SeparableConv1D with kernel_size=(32)?
We would also need to specify a value for filters, I'd guess filters=input_dim would be good enough, no? 🤷‍♂️

Here's what that would look like:

self.dw_conv = tf.keras.layers.SeparableConv1D(
    filters=input_dim,
    kernel_size=(kernel_size), strides=1,
    padding="same" if not streaming else "causal",
    name=f"{name}_dw_conv",
    depth_multiplier=depth_multiplier,
    depthwise_regularizer=kernel_regularizer,
    bias_regularizer=bias_regularizer
)

@andreselizondo-adestech
Copy link
Author

Good news @usimarit, most of the initial work is done. The model is now trainable, though a few things are still missing.
Please take a look and let me know what you think 😃

I had to modify the base Conformer class, but the changes done should not affect anything.

@nglehuy
Copy link
Collaborator

nglehuy commented Apr 17, 2021

I noticed you're using a DepthwiseConv2D with kernel_size=(32,1).
Would you consider replacing this by a SeparableConv1D with kernel_size=(32)?
We would also need to specify a value for filters, I'd guess filters=input_dim would be good enough, no? 🤷‍♂️

Here's what that would look like:

self.dw_conv = tf.keras.layers.SeparableConv1D(
    filters=input_dim,
    kernel_size=(kernel_size), strides=1,
    padding="same" if not streaming else "causal",
    name=f"{name}_dw_conv",
    depth_multiplier=depth_multiplier,
    depthwise_regularizer=kernel_regularizer,
    bias_regularizer=bias_regularizer
)

Sorry for the late reply, SeparableConv1D is the DepthwiseConv2D combine with the Conv1D after that, so the architecture would be wrong if you apply SeparableConv1D.

@nglehuy
Copy link
Collaborator

nglehuy commented Apr 17, 2021

@andreselizondo-adestech We will have a big change in the repo structure as in the PR #177. Please be aware of that 😄 These changes will split the conformer file to the encoder file and the model file like this. I about to finish that PR so you will have to pull the main, create a new branch and cherry pick what you've done into the new structure 😄

@andreselizondo-adestech
Copy link
Author

@andreselizondo-adestech We will have a big change in the repo structure as in the PR #177. Please be aware of that 😄 These changes will split the conformer file to the encoder file and the model file like this. I about to finish that PR so you will have to pull the main, create a new branch and cherry pick what you've done into the new structure 😄

Understood, I'll look into the new format :)

Regarding the SeparableConv1D. I now see what you mean, it seems odd to me that DepthwiseConv1D only exists as a combination of both layers. This means internally the implementation is supported, it's just not exposed for us to use.

I found this issue/PR (tensorflow/tensorflow#48557) on the Tensorflow repository. They intend to add support for the layer we need. However, the issue was opened less than 24hrs ago, so we'll have to wait and see how long it takes to be released into tf-nightly.

@nglehuy
Copy link
Collaborator

nglehuy commented Apr 17, 2021

@andreselizondo-adestech We can build our own DepthwiseConv1D 😄 no need to wait until tensorflow support it.

@nglehuy
Copy link
Collaborator

nglehuy commented Apr 17, 2021

@andreselizondo-adestech The refactor PR is merged 😄

@andreselizondo-adestech
Copy link
Author

@usimarit I'm merging my changes into the refactored code, however.. there appears to be an issue using SentencePiece for training.
Specifically at line tensorflow_asr/featurizers/text_featurizers.py#L342.
Seems like this function is nowhere to be found, but at the same time, the default value for model in that function is None. So when being called from examples/conformer/train.py#L68, model is not specified and is therefore None.

@nglehuy
Copy link
Collaborator

nglehuy commented Apr 19, 2021

@andreselizondo-adestech Ah yeah, I missed that part, I'll update it.

@andreselizondo-adestech
Copy link
Author

@usimarit I've adapted my changes to the refactored repo and everything seems to be working. Next step is to create our own implementation of DepthwiseConv1D.

I've been digging into how TF does the SeparableConv1D, but they just call SeparableConv2D (similar to how you did it).
So I looked into SeparableConv2D and DepthwiseConv2D but I couldn't find the implementation for this TF operation

Could you help me out with this?

@nglehuy
Copy link
Collaborator

nglehuy commented Apr 21, 2021

@usimarit I've adapted my changes to the refactored repo and everything seems to be working. Next step is to create our own implementation of DepthwiseConv1D.

I've been digging into how TF does the SeparableConv1D, but they just call SeparableConv2D (similar to how you did it).
So I looked into SeparableConv2D and DepthwiseConv2D but I couldn't find the implementation for this TF operation

Could you help me out with this?

Seem like it's from tf c/c++ library 😄

@andreselizondo-adestech
Copy link
Author

I'm currently running a test on two VMs: Regular Conformer vs DepthwiseConv1D Conformer
We'll see the results in maybe ~30hrs. (I am training on CommonVoice2 dataset though, so WER results won't be directly comparable to the paper.)

@usimarit In the mean time, I'm not sure how inference should work for the Streaming Conformer.
Can you guide me? Do you see something that's missing in the PR?

@andreselizondo-adestech
Copy link
Author

@usimarit Good news! The two Conformer models converge to the same CER, meaning performance was not impacted negatively by the custom DepthwiseConv1D layer.
I trained on chars and the best CER I got was ~5.2
I'll be training on subwords shortly.

In the meantime, I think we should look at how to do steaming inference on the Streaming Conformer Transducer.

tensorflow_asr/datasets/asr_dataset.py Outdated Show resolved Hide resolved
tensorflow_asr/models/encoders/conformer.py Outdated Show resolved Hide resolved
tensorflow_asr/models/transducer/streaming_conformer.py Outdated Show resolved Hide resolved
examples/streaming_conformer/train.py Outdated Show resolved Hide resolved
tensorflow_asr/models/encoders/conformer.py Outdated Show resolved Hide resolved
@andreselizondo-adestech
Copy link
Author

@usimarit The next step is looking at the file StreamingConformer class.
I based the class on the StreamingTransducer class, so I don't know if there's methods that should be different.
Mind requesting any changes necessary? Or you could also explain to me how it should work.

@andreselizondo-adestech andreselizondo-adestech marked this pull request as ready for review May 4, 2021 14:25
tensorflow_asr/datasets/asr_dataset.py Outdated Show resolved Hide resolved
tensorflow_asr/datasets/asr_dataset.py Outdated Show resolved Hide resolved
@nglehuy
Copy link
Collaborator

nglehuy commented May 16, 2021

@usimarit The next step is looking at the file StreamingConformer class.
I based the class on the StreamingTransducer class, so I don't know if there's methods that should be different.
Mind requesting any changes necessary? Or you could also explain to me how it should work.

I haven't had time to dive into how the StreamingConformer work in the inference mode but I think it's quite different than the RnnTransducer (StreamingTransducer before then). I'll try to make time for this.

But anyway we should complete the whole pipeline (training, inference, testing, tflite) before merging 😄

@andreselizondo-adestech andreselizondo-adestech marked this pull request as draft May 18, 2021 19:20
@andreselizondo-adestech
Copy link
Author

@usimarit Hey there, this is just a gentle ping.
Do you have anything that might guide me on implementing inference for Streaming Conformer? :)

@nglehuy
Copy link
Collaborator

nglehuy commented Jul 23, 2021

@andreselizondo-adestech Sorry, I'm currently a bit busy until the end of July. So after that, I can go back to support this feature 😄

@andreselizondo-adestech
Copy link
Author

Hello @usimarit
Are you still interested in implementing this?
Let me know if you need my help 😄

@nglehuy
Copy link
Collaborator

nglehuy commented Oct 28, 2021

@andreselizondo-adestech Of course, I'll find some free time to help implement the inference of this
In the mean time, can you help me to resolve the conflicts? It's just some conflicts about the code format and imports, I changed from autopep8 to black and used absolute imports instead of relative imports (which is more recommended)

@nglehuy
Copy link
Collaborator

nglehuy commented May 6, 2024

@andreselizondo-adestech hi, are you still working on this?
I think we should compute attention mask in MHA layer instead of in dataset, which is kinda similar to causal attention masking in v2.x version of tfasr, but with history truncation and limited futures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants