Sign In / Create

Getting Started

Edit on GitHub

This guide will get you up and running with Spokestack for Android, and you’ll be hearing and talking to your users in no time.

One caveat before we start, though: This is not a collection of best practices. We’re going to be trading thoughtful organization for convenience here, so when we say something like “put this in your main activity”, just know that you might not want to leave it there long-term. OK, now that that’s out of the way, let’s jump in.

Installation

First, you’ll need to declare the Spokestack dependencies in your project. Because Spokestack includes native libraries, this is slightly more involved than a normal dependency. You’ll need to add the following to your app’s top-level build.gradle:

// inside the buildscript block
dependencies {
    // (other dependencies)
    classpath 'com.nabilhachicha:android-native-dependencies:0.1.2'
}

and this to your module’s build.gradle:

// before the android block:
apply plugin: 'android-native-dependencies'

// in the android block:
compileOptions {
        sourceCompatibility JavaVersion.VERSION_1_8
        targetCompatibility JavaVersion.VERSION_1_8
}

// in the dependencies block
dependencies {
    // (other dependencies)
    implementation 'io.spokestack:spokestack-android:4.0.0'

    // if you plan to use Google ASR, also include these
    implementation 'com.google.cloud:google-cloud-speech:1.22.2'
    implementation 'io.grpc:grpc-okhttp:1.25.0'

    // for TensorFlow Lite-powered wakeword detection, add this one too
    implementation 'org.tensorflow:tensorflow-lite:1.14.0'
}

// a new top-level block if you don't already have native dependencies
native_dependencies {
    artifact 'io.spokestack:spokestack-android:4.0.0'
}

Integration

To enable voice control, your app needs three things:

  1. the proper system permissions
  2. an instance of Spokestack’s SpeechPipeline
  3. a place to receive speech events from the pipeline

1. Permissions

To accept voice input, you need at least the RECORD_AUDIO permission, and to perform speech recognition and TTS, you’ll need to network access, so add these lines to the manifest elment of your app’s AndroidManifest.xml:

<uses-permission android:name="android.permission.INTERNET"/>
<uses-permission android:name="android.permission.RECORD_AUDIO"/>

Starting with Android 6.0, the RECORD_AUDIO permission requires you to request it from the user at runtime; see the Android developer documentation for more information on how to do this.

Note that sending audio over the network can use a considerable amount of data, so you may also want to look into WiFi-related permissions and allow the user to disable voice control when using cellular data.

Also note that the Android emulator cannot record audio. You’ll need to test the voice input parts of your app on a real device.

2. SpeechPipeline

With the proper permissions in place, it’s time to decide where you’d like to receive and process speech input. In a single-activity app, the easiest place for this is going to be your main activity. import io.spokestack.spokestack.SpeechPipeline at the top of the file, and add a SpeechPipeline member:

private var pipeline: SpeechPipeline? = null

You’ll probably want to build the pipeline when the activity is created. Remember that you’ll need to have the RECORD_AUDIO permission for this, so make sure you check that permission before trying to start a pipeline.

pipeline = SpeechPipeline.Builder()
    .setInputClass("io.spokestack.spokestack.android.MicrophoneInput")
    .addStageClass("io.spokestack.spokestack.webrtc.AutomaticGainControl")
    .addStageClass("io.spokestack.spokestack.webrtc.VoiceActivityDetector")
    .addStageClass("io.spokestack.spokestack.webrtc.VoiceActivityTrigger")
    .addStageClass("io.spokestack.spokestack.android.AndroidSpeechRecognizer")
    .setAndroidContext(applicationContext)
    .addOnSpeechEventListener(this)
    .build();

There are many options for configuring the speech pipeline. This particular setup will start begin capturing audio when pipeline.start() is called and use a Voice Activity Detection (VAD) component to send any audio determined to be speech through on-device ASR using Android’s SpeechRecognizer API. In other words, the app is always actively listening, and no wakeword detection is performed. See the configuration guide for more information about pipeline building options.

Note also the addOnSpeechEventListener(this) line. This is necessary to receive speech events from the pipeline, which is our next step.

3. OnSpeechEventListener

We’ve declared that the class housing the speech pipeline will also receive its events, so scroll back to the top and make sure it implements the OnSpeechEventListener interface.

class MyActivity : AppCompatActivity(), OnSpeechEventListener {

    // ...

    override fun onEvent(event: SpeechContext.Event?, context: SpeechContext?) {
        when (event) {
            SpeechContext.Event.ACTIVATE -> println("ACTIVATED")
            SpeechContext.Event.DEACTIVATE -> println("DEACTIVATED")
            SpeechContext.Event.RECOGNIZE -> context?.let { handleSpeech(it.transcript) }
            SpeechContext.Event.TIMEOUT -> println("TIMEOUT")
            SpeechContext.Event.ERROR -> context?.let { println("ERROR: ${it.error}") }
            else -> {
                // do nothing
            }
        }
    }

    private fun handleSpeech(transcript: String) {
        // do something with the text
    }
}

We’ve listed all possible speech events here; see the documentation for a description of what each event means. Briefly, though, ACTIVATE and DEACTIVATE reflect the state of ASR—if you want to show any special UI components while your app is actively listening to the user, these events would be useful for showing/hiding them.

If the event is RECOGNIZE, context.transcript will give you the raw text of what the user just said. Translating that raw text into an action in your app is the job of an NLU, or natural language understanding, component. Spokestack currently leaves the choice of NLU up to the app: There’s a variety of NLU services out there (DialogFlow, LUIS, or wit.ai, to name a few), or, if your app is simple enough, you can make your own with string matching or regular expressions.

We know that NLU is an important piece of the puzzle, and we’re working on a full-featured NLU component for Spokestack based on years of research and lessons learned from working with the other services; we’ll update this space when it’s ready.

For the sake of our demo, though, let’s say you’re creating a voice-controlled timer. handleSpeech might look something like this:

private fun handleSpeech(transcript: String) {
    when {
        Regex("(?i)start").matches(transcript) -> {
            // start the timer and change the UI accordingly
        }
        Regex("(?i)stop").matches(transcript) -> {
            // stop the timer and change the UI accordingly
        }
        Regex("(?i)reset|start over").matches(transcript) -> {
            // reset the timer and change the UI accordingly
        }
    }
}

It’s important to note that the speech pipeline runs on a background thread, so any UI changes related to speech events should be wrapped in a runOnUiThread { } block.

Talking back to your users

If you want full hands- and eyes-free interaction, you’ll want to deliver responses via voice as well. This requires a text-to-speech (TTS) component, and Spokestack has one of these too!

See the TTS guide for detailed information about configuration options, but the most basic usage of the TTS subsystem looks like this:

val tts = TTSManager.Builder
    .setTTSServiceClass("io.spokestack.spokestack.tts.SpokestackTTSService")
    .setOutputclass("io.spokestack.spokestack.tts.SpokestackTTSOutput")
    .setProperty("spokestack-id", "f0bc990c-e9db-4a0c-a2b1-6a6395a3d97e")
    .setProperty("spokestack-secret",
                 "5BD5483F573D691A15CFA493C1782F451D4BD666E39A9E7B2EBE287E6A72C6B6")
    .setAndroidContext(applicationContext)
    .setLifecycle(lifecycle)
    .build()

// ...

val request = SynthesisRequest.Builder("hello world").build()
tts.synthesize(request)

The API credentials in this example set you up to use the demo voice available for free with Spokestack; for more configuration options and details about controlling pronunciation, see the TTS concept guide.

Conclusion

That’s all there is to it! Your app is now configured to accept and respond to voice commands. Obviously there’s more we could tell you, and you can have much more control over the speech recognition process (including, but not limited to, configuring the pipeline’s sensitivity and adding your own custom wakeword models). If you’re interested in these advanced topics, check out our other guides. We’ll be adding to them as Spokestack grows.

Thanks for reading!