Porting the Alexa Minecraft Skill to Python Using Spokestack
This is a tutorial on how to port a simple Minecraft recipe skill to Spokestack using the spokestack-python library. It is similar to our mobile tutorial series, but the Python version not have any GUI components. This makes the experience closer to that of a smart speaker. We will discuss the concepts for each part of the user interaction briefly, but for a full description check out our documentation. Before we get into the programming, we will need to get API keys from our Spokestack account.
Signing Up for a Spokestack Account
- Create a Spokestack account.
- Click “Add token” in the API Credentials dashboard.
- Copy the secret key when it is displayed; you’ll need it later.
Setting up the Project
First let’s make a directory to hold the project.
git clone https://github.com/spokestack/minecraft-skill-python
cd minecraft-skill-python
Then there are some system dependencies.
macOS
brew install lame portaudio
Debian/Ubuntu
sudo apt-get install portaudio19-dev libmp3lame-dev
Now let’s set up the python virtual environment. We use pyenv
and pyenv-virtualenv
to manage virtual environments, but any virtual environment will work.
pyenv install 3.7.6
pyenv virtualenv 3.7.6 minecraft
pyenv local minecraft
Then the python dependencies.
pip install -r requirements.txt
TFLite Runtime
In addition to the Python dependencies, you will need to install the TFLite Interpreter. You can install it for your platform by following the instructions at TFLite Interpreter. Note: this is not the full Tensorflow package.
If you would like to try out the final version of the Minecraft app, you can now run it with:
python app.py
To follow along with the tutorial, we recommend making a new python file titled myapp.py
or similar so you can compare it to the original app.py
.
Using the Speech Pipeline
An essential piece to any voice interface is the ability to detect when the user is speaking, then convert the spoken phrase into a text transcript. Spokestack has an easy-to-use speech pipeline that will handle this for us. The speech pipeline consists of three major components: a voice detection module, a wake word trigger, and a speech recognizer.
Microphone Input
Accepting audio input is always the first step in the pipeline. For this demo, we will use the included input class that leverages PyAudio to stream microphone input to the pipeline. The class is initialized like this:
from spokestack.io.pyaudio import PyaudioMicrophoneInput
mic = PyaudioMicrophoneInput()
Voice Activity Detection
The second component we are adding to the pipeline is the VoiceActivityDetector. This module analyzes a single frame of audio to determine if speech is present. This will be the component that allows audio to flow through the rest of the pipeline. For simplicity, we will use the the default voice activity detection settings. The voice activity component can be initialized with the following:
from spokestack.vad.webrtc import VoiceActivityDetector
vad = VoiceActivityDetector()
Now that we have a way to determine if the audio contains speech, let’s move on to the component that activates the pipeline when it hears a specific phrase.
Wake Word Activation
The wake word component of the pipeline looks for a specific phrase in the audio input and signals the pipeline to activate ASR when it is recognized. For our purposes, we will be using “Spokestack” as the wake word. As with most voice assistants, “Hey Spokestack” will work as well. The process to initialize this component mirrors the way we set up voice activity detection. The directory passed to model_dir
should contain three .tflite
files: encode.tflite
, detect.tflite
, and filter.tflite
. These can be found inside the tflite
directory of the project GitHub repository.
from spokestack.wakeword.tflite import WakewordTrigger
wakeword = WakewordTrigger(model_dir="tflite")
Once the skill is actively listening for user speech, all we have to do is transcribe what the user says.
Automatic Speech Recognition (ASR)
ASR is the most critical piece of the speech pipeline, because it produces the transcript that is used to turn speech into actions. However, critical components do not have to be difficult to add. The following initializes the ASR component.
Note: This is where you will need your API keys from the account console.
from spokestack.asr.speech_recognizer import CloudSpeechRecognizer
recognizer = CloudSpeechRecognizer(
spokestack_id="your_spokestack_key",
spokestack_secret="your_secret_key",
)
Activation Timeout (Optional)
An issue you may run into is the ASR not being activated long enough or being active for too long. To configure this for your use case, you can add the ActivationTimeout
component to the pipeline with a minimum and maximum value in milliseconds. This component can be initialized with the following:
from spokestack.activation_timeout import ActivationTimeout
timeout = ActivationTimeout(min_active=100, max_active=5000)
Speech Pipeline
Now, we can put it all together in the pipeline. After this step, you will be able to wake the assistant by saying “Spokestack” and produce a text transcript of what is said next. For the Minecraft skill you would say something like, “Spokestack, what is the recipe for a snow golem?“.
from spokestack.pipeline import SpeechPipeline
# without timeout
pipeline = SpeechPipeline(input_source=mic, stages=[vad, wakeword, recognizer])
# with timeout
pipeline = SpeechPipeline(
input_source=mic, stages=[vad, wakeword, recognizer, timeout]
)
Events
We know that the goal of the pipeline is to produce a transcript of the user’s speech. However, we haven’t discussed how to access that transcript. The pipeline is designed to run continuously, but we can use event handlers to access the transcript without stopping the pipeline. For this tutorial, we want to pass the completed transcript to a module that helps us understand what the user has said. To accomplish this, we register an event handler with the pipeline:
@pipeline.event
def on_speech(context):
transcript = context.transcript
print(transcript)
In the application, we don’t want to print the transcript, but we’ve added that so you can see the results if you’ve been running the code as you follow along. In the subsequent sections, we will discuss a couple new components and also flesh out this event handler to allow the Minecraft skill to understand the user’s request and select an appropriate response.
Natural Language Understanding (NLU)
The Natural Language Understanding, or NLU, component takes a transcript of user speech and distills it into unambiguous instructions for an app. The paradigm used in most systems is the intent and slot model. Essentially, an intent is the function the user intends to invoke, and the slots are the arguments the intent needs to accomplish its action. For example, a user may say What is the recipe for a dark prismarine?
. In this case, the intent is RecipeSearch
, and the slot is dark prismarine
. The initialization of the TFLiteNLU should look familiar at this point. The directory passed to model_dir
contains three files: vocab.txt
, metadata.json
, nlu.tflite
. These files are necessary to run our on-device NLU model and are in the tflite
directory of the GitHub repository.
from spokestack.nlu.tflite import TFLiteNLU
nlu = TFLiteNLUModel(model_dir="tflite")
Now is a good time to add the NLU to our on_speech
event handler:
@pipeline.event
def on_speech(context):
transcript = context.transcript
results = nlu(transcript)
Now that we know what recipe the user is looking for, you may be wondering how we turn this into a response. The following section will explain just that.
Dialogue Management
The Minecraft dialogue manager is fairly simple. The basic component necessary is a way to look up the recipes, and the rest is just string interpolation. Since we have a relatively limited number of recipes, we can implement the lookup as a simple dictionary in Python. Below is a snippet of the recipe “database”.
DB: Dict[str, str] = {
"snow golem": "A snow golem can be created by placing a pumpkin on top of two "
"snow blocks on the ground.",
"pillar quartz block": "A pillar of quartz can be obtained by placing a block of "
"quartz on top of a block of quartz in mine craft.",
}
This makes looking up a recipe very concise: DB.get("snow golem")
. There can be an issue with using the dictionary lookup alone though. Let’s say that due to an ASR error the parsed slot isn’t a full match for snow golem
, but it is something like sow golem
. A simple dictionary lookup will not be able to resolve those slots. However, there is a method that we can add to deal with small errors like that. This method is called fuzzy matching, and based on the similarity between snow golem
and sow golem
we can make sure that the latter resolves to the actual entity. In this tutorial, we will use the python library fuzzywuzzy
to make these matches. Below is the way it is used in the tutorial repository.
from fuzzywuzzy import process
matched, score = process.extractOne(slot["raw_value"], self._names)
if score > self._threshold:
recipe = self._recipes.get(matched)
We are simply overwriting the parsed entity with the one that is the closest match from the set of possible entities. The full dialogue manager can be seen below:
from fuzzywuzzy import process # type: ignore
from minecraft import recipes
from minecraft.responses import Response
class DialogueManager:
"""Simple dialogue manager
Args:
threshold (float): fuzzy match threshold
"""
def __init__(self, threshold=0.5):
self._recipes = recipes.DB
self._names = list(self._recipes.keys())
self._threshold = threshold
self._response = Response
def __call__(self, results):
""" Maps nlu result to a dialogue response.
Args:
results (Result): classification results from nlu
Returns: a string response to be synthesized by tts
"""
intent = results.intent
if intent == "RecipeIntent":
return self._recipe(results)
elif intent == "AMAZON.HelpIntent":
return self._help()
elif intent == "AMAZON.StopIntent":
return self._stop()
else:
return self._error()
def _recipe(self, results):
slots = results.slots
if slots:
for key in slots:
slot = slots[key]
if slot["name"] == "Item":
return self._fuzzy_lookup(slot["raw_value"])
return self._not_found(slot["raw_value"])
else:
return self._response.RECIPE_NOT_FOUND_WITHOUT_ITEM_NAME.value
def _help(self):
return self._response.HELP_MESSAGE.value
def _stop(self):
return self._response.STOP.value
def _error(self):
return self._response.ERROR.value
def _fuzzy_lookup(self, raw_value):
matched, score = process.extractOne(raw_value, self._names)
if score > self._threshold:
recipe = self._recipes.get(matched)
return recipe
return raw_value
def _not_found(self, raw_value):
return self._response.RECIPE_NOT_FOUND_WITH_ITEM_NAME.format(raw_value)
Now we can add the dialogue manager to the event handler.
@pipeline.event
def on_speech(context):
transcript = context.transcript
results = nlu(transcript)
response = dialogue_manager(results)
OK, that was a lot to cover, but we are almost to the finish line. In the next section, we will learn how to convert the app’s text responses into speech.
Text to Speech (TTS)
Much like the name suggests, TTS translates written text into its spoken form with a synthetic voice. This tutorial assumes you are using our default voice, but if you have a paid plan you can replace demo-male
with the name of a custom voice. To initialize the TTSClient, you simply do the following:
Note: This is another part where you will need your Spokestack API keys. However, notice that the URL for TTS is slightly different than for ASR.
from spokestack.tts.clients.spokestack import TextToSpeechClient
client = TextToSpeechClient("your_key", "your_secret_key")
Another important aspect of this section is playback. We have a PyAudio-based output class that will play through your system’s default playback device. As a convenient way to manage speech synthesis and playback, we have the TTSManager. Look below to see how to initialize that with an output source.
from spokestack.io.pyaudio import PyAudioOutput
from spokestack.tts.manager import TextToSpeechManager
output = PyAudioOutput()
manager = TextToSpeechManager(client, output)
Now we can update our event handler to read the response.
@pipeline.event
def on_speech(context):
transcript = context.transcript
results = nlu(transcript)
response = dialogue_manager(results)
manager.synthesize(response, "text", "demo-male")
Let’s Run it!
We now have a fully working example except for two final commands. We have to start and run the pipeline!
pipeline.start()
pipeline.run()
Then you can start the skill by running this command in the terminal. The skill will remain running until you say “stop”.
python myapp.py
Now you should be able ask your skill about how to craft things in Minecraft. If you get a chance, I recommend trying it out in-game; despite its simplicity, it is actually very useful.
At this point you may have already cloned the repository, but if you have not, check out the full example on GitHub.
Contact Us
If you have any questions while getting this set up we have a forum, or you can open an issue on GitHub. In addition, I am more than happy to help if you want to reach out to me personally via email or Twitter.
Originally posted October 05, 2020