SpeechPipeline in iOS
If you’ve read any of our other documentation, you know that the speech pipeline is the main way you interact with Spokestack. This guide is here to explain in a little more detail how the iOS version of Spokestack uses this collection of components to recognize wakewords and user speech.
What is it?
As the name implies,
SpeechPipeline is a collection of distinct modular components that work together to process user speech. All components are established at initialization time and can be broken down into two main areas of functionality: processing audio and communicating with your app.
Here’s a simple picture of how the components and callbacks work together:
Now let’s look at the
SpeechPipeline initializer and talk about each component one at a time. For historical reasons, the initializer lists components in a different order than the one laid out in the previous diagram.
@objc public init( _ speechService: SpeechProcessor, // 1 speechConfiguration: SpeechConfiguration, // 2 speechDelegate: SpeechEventListener, // 3 wakewordService: SpeechProcessor, // 4 pipelineDelegate: PipelineDelegate // 5 )
The speech service is responsible for performing automatic speech recognition (ASR) on arbitrary user audio. It’s the component that calls
didRecognize on your
SpeechEventListener (which we’ll talk about later). Currently, Spokestack only supports Apple’s built-in ASR, via the
speechService can be any class that adopts the
SpeechProcessor protocol, so you’re free to incorporate any ASR provider you choose. Singleton instances of all
SpeechProcessors provided by Spokestack are available via the
SpeechProcessors enum; for the default Apple ASR, simply pass
SpeechProcessors.appleSpeech to the initializer above.
The speech configuration is comprehensive enough to have its own guide, but in summary, this is where most of the fine-tuning for both wakeword and ASR happens. See the configuration guide or API reference for more details on each of these, but here are a few examples of the parameters you can change (by instantiating the class and setting the relevant property):
wakePhrases: If you’re using ASR-based wakeword detection, these properties let you change your app’s wakeword(s).
wakeActiveMax: The maximum amount of time (in milliseconds) that ASR will remain active to capture a single user utterance.
ModelName: Names for custom TensorFlow Lite wakeword models. Training custom models is outside the scope of this guide, but you can find a description of their requirements here.
The speech delegate receives most of the interesting system events from the pipeline, including notifications that the wakeword has been recognized (
activate), ASR has completed or timed out (
deactivate), and that ASR has recognized a user utterance (
didRecognize). Your implementations of all these methods will be very important to how your app handles voice interactions, because the delegate acts as a kind of gatekeeper between speech events and pipeline operation.
If you’re using Spokestack’s wakeword feature without ASR, you don’t need to interact with the pipeline in these methods, but if you do want ASR, it’s important to call
pipeline.activate() in your delegate’s
activate implementation so the pipeline can activate the ASR component. You may also wish to display a “listening” indicator in your UI when you receive
activate and remove it when
deactivate is called.
The wakeword service is in charge of, you guessed it, recognizing that the user has said your app’s wakeword. On iOS, this defaults to using Apple’s built-in ASR to detect your chosen wakeword, but for better performance you might want to experiment with a customized TensorFlow Lite model. Spokestack comes with a set of models trained to detect “Spokestack” as a wakeword, but you’re also free to train your own and configure Spokestack to use them at runtime. You can find descriptions of the models’ requirements here, but if building and training them isn’t something you want to take on, send us an email, and we can discuss customization options.
The pipeline delegate receives system events from the pipeline itself, including notifications of both successful initialization and errors during setup. This is where any error handling will occur, and the
stop methods tell you when Spokestack is using the microphone. You’ll recall that earlier we mentioned a UI “listening” indicator would go in
activate method, and that’s true in many cases — often you’ll want to alert the user that you’re expecting a voice command from them (that the ASR component is active and interpreting their speech). If, however, you want to let the user know that the device’s microphone itself is active (which it naturally will be when waiting for a wakeword), you’ll want to condition that indicator on the pipeline’s
OK, so that covers pipeline construction. What about the other methods available?
This acts as a health check for the pipeline, letting you know whether its delegates are properly set so that speech events can be processed appropriately.
SpeechPipeline itself, as a manager of shared resources, is a prime candidate for the singleton pattern. Its delegates, however, might change as the user’s context in the app itself changes. Thus, the delegates can be swapped out at will if desired.
As mentioned before, these methods control the pipeline’s ASR component.
activate should be called in the
activate implementation if ASR is desired, and it can also be called manually—for example, to enable “tap to talk” instead of (or in addition to) using a wakeword. Similarly,
deactivate should be called when the
deactivate implementation is called, or when the user manually cancels ASR via a button you’ve provided for such a purpose.
These methods control the pipeline as a whole. When stopped, the pipeline consumes fewer resources, but no speech recognition, either wakeword or ASR, can happen. Call
start to begin listening for a wakeword, and follow it immediately with a call to
activate if you want to jump straight to ASR (this is most likely to be useful in an app that doesn’t use the wakeword feature at all). Call
stop to deactivate the microphone.
Once created, a single instance of
SpeechPipeline can be stopped and restarted at will; just remember that Spokestack cannot use the device’s microphone when it is stopped.