In this session, we’ll discuss MatchboxNet, an end-to-end neural network for speech command recognition. MatchboxNet is composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU, and dropout layers. It reaches state-of-the-art accuracy on the Google Speech Commands dataset, while having significantly fewer parameters than similar models. We’ll demonstrate how intensive data augmentation, using an auxiliary noise dataset, improves robustness in the presence of background noise and how the small architecture makes it viable for voice activity detection.