Nowadays Voice User Interfaces (VUIs) have become popular thanks to their easiness of use that makes them accessible to the elderly and people with disability. Nevertheless, their use in embedded systems for the realization of portable devices is limited by the computation complexity, the memory requirements and power consumption of the keyword spotting (KWS) algorithms, usually based on deep neural networks. In this paper we propose a new algorithm based on convolutional neural networks for the keyword spotting task, that offers a good tradeoff among accuracy, power consumption and memory footprint. To select our proposed solution, we compared different neural network architectures to select the best trade-off of these metrics. For further improvements of these performances we implemented our solution on a dedicated hardware platform as Myriad 2 by Movidius. The use of this chip has reduced inference time and energy per inference by 50%.
Keywords: {keyword spotting, speech recognition, neural network, convolutional neural network, neural compute stick, low-power, memory footprint, machine learning}