Lab 06a

The questions below are due on Thursday March 22, 2018; 09:55:00 PM.


Partners: You have not yet been assigned a partner for this lab.
You are not logged in.

If you are a current student, please Log In for full access to the web site.
Note that this link will take you to an external site (https://oidc.mit.edu) to authenticate, and then you will be redirected back to this page.

Goals:

  • First we'll explore the audio signal from our microphone as it exists on the ESP32
  • Second, we'll look at how to store that audio signal before introducing two encodings: base64 and mu-law, which are used in different parts of the transcription process for our speech-to-text system.
  • Third, we'll examine the difference in run-time between code that isn't optimized for speed versus code that is, in the context of the mulaw encoder.
  • Fourth, we will integrate with Google Speech.
  • Finally, we'll use the returned transcriptions to control our ESP32 and make it display different things and build up some other features

1) Getting Started

The code for today's Lab can be found HERE. There are a number of files in this. First is one new library:

  • base64: A library that handles base-64 encoding. Place that in your Documents/libraries folder and restart Arduino.

There are also a number of starter scripts we'll be using:

  • audio_displayer.ino: A script that we'll use first to explore capturing audio using analog measurements.
  • mulaw_tester.ino: A script that allows us to do a speed test with the mu-law encoding function we'll write.
  • speech_to_text.ino: A script that allows us to interface with the Google Speech API.

2) Looking at Audio

Grab a microphone from the front of the class. This is the last standard part we'll be integrating into the system. (Exciting, isn't it?)

Our microphone

Connect your microphone as shown in the schematic below (of course please keep everything else you've added previously in place).

Connect your microphone like shown. Please refer to the ESP32 documentation for pin numbering!

3) Visualizing Audio

The ESP32 is capable of running quite fast. With minimal background processes, the loop function can run at 1MHz, meaning it can sample data more than sufficiently fast to record audio. Audio usually needs sampling at around at least a few kHz in order to start to be useful. We'll eventually sample at 8 kHz when interfacing with Google Speech.

Just like how we've regulated loop speed previously with the millis() function, for timing of events that need even faster operation rates, we can use the micros() function which is very similar, except it increments at 1MHz (rather than 1kHz). In audio_displayer.ino the main body of the code (loop()) has the following within it:

void loop() {
  while (!digitalRead(button_pin)){
    uint16_t raw_reading = analogRead(A0);
    uint8_t scaled_reading_for_oled = 0;
    float scaled_reading_for_serial = raw_reading*3.3/4096;
    serial_counter++;
    oled_counter++;
    if (serial_counter%serial_update_period==0) Serial.println(scaled_reading_for_serial);
    if (oled_counter%oled_update_period==0){
      oled.clearBuffer();
      oled.drawBox(50, 63-scaled_reading_for_oled, 30,scaled_reading_for_oled);
      oled.sendBuffer();
    }
    while (micros()>timer && micros()-timer < sample_period); // prevents rollover glitch
    timer = micros();
  }
}

In this code, if the button is held down, a reading is taken and stored in a new data type to us, a uint16_t, which is a two-byte unsigned-integer, rather than the four-byte standard we usually get with a regular int. We can use a uint16_t to hold this analog reading since our analog read resolution is 12 bits and 12 is less than the 16 bits we have to play with in this data type. Remember the analog reading is expressed as a 12 bit number representing voltage from 0 to 3.3V in a linear mapping as discussed in lecture 05.

What is the voltage per incremet (bin) of our analogRead?

This simple script then converts the analog reading into two different forms for display. The first (which is incomplete and wrong and you must fix) is for display on our OLED. The second is for display on our Serial plotter, which we render in Volts (view serial plotter from under Tools to see this data). As for the first conversion, we'd like to be able to express our reading on the OLED using the full-scale of the OLED, so if there if the voltage from the microphone is 0 V, the rectangle has a perceived height of 0 and if the signal is maxed out at 3.3V the rectangle is at its full height. Keeping in mind the range of possible incoming values from making an analog reading, what expression do we need to put in place so that scaled_reading_for_oled uses the full range of the OLED height (and no more or less) like shown in the video below:

Enter a C++ expression (ignore the semicolon) to set the value of scaled_reading_for_oled such that the box displayed on the oled takes advantage of the full height of the screen.

Try Now:

Using the serial plotter and/or serial monitor, study the nature of the analog signal. What voltage is present when there is little sound? What about when there is a lot? Does the signal seem to float around a particular value?

It'd be nice to start storing this audio data, and that's what we'll do in the next section is add in a way to store that recorded audio for later use.

3.1) Google Speech

Open up speech_to_text.ino from this week's code distribution. The overall flow diagram for what we're trying to accomplish is shown below:

Flowchart of System Design.

We will be using one of our buttons to trigger recording (when it gets pushed we'll start to record three seconds of audio). We record some audio, process it, and then store it in memory. Then we check to see if we're finished recording. If not, we record/process/store more audio data. Once we are finished, the system sends the audio to the Google Speech API, awaits the response, and finally process it.

3.2) Memory

Let's first spend some time thinking about memory. Our ESP32 has ~500 kB (kiloBytes where a Byte is 8 bits) of RAM, which is a lot compared to an Arduino Uno (2 kB) but a lot less than your laptop (4+ GB). Since we're recording audio messages, we need to think about where to store that audio data, and we need to make sure we don't try to store too much. Memory constraints are one fun challenge of working with microcontrollers.

First, we can't even use all of that 500 kB RAM for our audio, since other program variables (global and local variables) need to be stored there as well as well as other libraries we bring in. To be safe, let's assume our memory budget for our audio message is 60 kB (60 kiloBytes) 1

If we have 60 kB, how long of a message can we store? It depends on two things: 1) how frequently we sample the signal, e.g., the sample rate, and 2) how much resolution we use when sampling, e.g., the bit depth or sample resolution.

Let's first think about getting great audio. Humans can hear audio frequencies up to about 20 kHz, and we could show mathematically that to capture these frequencies we need to sample at more than twice that maximimum frequency. This is why CDs (remember what a CD is?) use audio sampled at 44 kHz2.

Let's also assume that we sample audio at our ADC's resolution. After all, we want the highest fidelity audio! Our ADC can sample at 12 bits. So in one second we will use:

12 bits/sample * 1 Byte/8 bits * 40 ksamples/sec * 1 sec = 60 kBytes
This is quite a bit and means we can only save one second of audio if we want to stick to our 60kB budget.

What to do? Well, speech is usually recorded at a much lower sampling rate, because human speech is dominated by low frequencies. Modern telephone systems use a sample rate of 8 kHz, which is good enough to transmit audio frequencies up to about 4 kHz3.

How long (in sec) a message can we store in 60 kB of memory at a 8 kHz sample rate at 12 bit depth?

OK, that's a bit better. Engineers worry about this issue enough that there is a specific term for \text{sampleRate}\times\text{bitDepth}, which is bit rate, which in our case is 96 kbps (kilobits per second). You'll see different bit rates used for different audio purposes.

3.3) base64 encoding

On top of the sample rate issues above we also need to be thinking about our ultimate goal here, which is to send it to Google to process, and they have certain requirements. This data is fundamentally a bunch of 1's and 0's, but via HTTP we are used to sending things as strings (like our numbersapi or wikipedia GET requests). Luckily, there is a widely used encoding called base64 encoding, that takes a set of 6 bits and translates them into an ASCII character. This will turn our audio data into a long string that can be decoded by Google, as long as we both agree on the protocol ahead of time.

Let's dig into Base64 a bit. Let's look at three analog measurements we acquired from some random sensor that are currently sitting in memory as two-byte signed integers in memory: 28650, 16890, 8955. The memory map might look like this:

Raw Data in Memory.

In terms of binary, it looks like this:

Raw Data in Memory.

What Base64 encoding does is to turn a set of 6 binary digits into one of 2^6 = 64 different ASCII symbols. What happens is that the 6 bytes of data in memory are split into eight 6-bit groups, like so:

Raw Data in Memory.

Then each 6 bit group is translated into an ASCII symbol using an index table. You can find a base64 index table on Wikipedia.

Based on that index table, what is the base64 encoding of our sample above? Assume that the left-most group will end up as the left-most character of the base64 string.

Base64 encoding is very commonly used to send data around and so there are, naturally, libraries in both Python and C to do the work for us.

But before we can do that, we run into two issues. First, base64 encoding takes 6 bytes of audio data and turns it into 8 bytes of ASCII data. So we need to redo our calculations of how long a message we can store.

How long (in sec) a message can we store in 60 kB of memory at a 8 kHz sample rate at 12 bit depth if the data is stored as base64 ascii?

That's a pretty big hit, but we need to do this since the Google Speech API wants our data to be encoded that way (this is an unfortunate side-effect of transferring what is essentially binary data over a system that was intended to originally convey textual information). What's more unfortunate is another stipulation of Google's, that if we send up raw data like we are collecting (not modified), it needs to be in 16 bit form. So even though we only have 12 bits of data from our measurements (thanks to our ADC), we'll need to package this up into 16 bits (so buffer it with four leading 0's), and then do base64 encoding. This will unfortunately increase the size of our readings and decrease how much we can store with little benefit other than compatibility with Google's API.

How long (in sec) a message can we store in 60 kB of memory at a 8 kHz sample rate but stored in 16 bits, if the data is then converted to base64 ascii?

It isn't wise to record all that audio data into memory and then process it into base64, since we'd need to store two large arrays to do so, each of which is quite large. So we need to be smart about how we do this. What we'll do instead is that as we record samples, we'll store them in a local array, and every time we have 3 samples in our local array (comprising 6 bytes of memory since the readings are held in 2 byte signed integers) we'll convert them to 8 bytes of base64, and then store those 8 bytes in our big array.

This piece of code does that, which is part of a larger while loop:

value = analogRead(AUDIO_IN);
raw_samples[sample_num%3] = value - 1241;
sample_num++;
if (sample_num%3 == 0) {
  base64_encode(enc_samples, (char *) raw_samples, 6);
  for (int i = 0; i < 8; i++) {
        speech_data[enc_index+i] = enc_samples[i];
  }    
  enc_index += 8;
}

sample_num is a local variable that increments each time we record a sample and store it into the local array raw_samples. When we get to 3 raw samples, sample_num%3==0 will be true, and we will use base64_encode to turn the six bytes of data in that raw_sample array into base64 and store it in enc_samples. We then use a for loop to move the enc_samples data into the larger array holding the base64 data, speech_data.

Try Now:

Why are we subtracting 1241 off of each sample? Where does this come from?

Checkoff 1:
Explain your memory calculations and base64 encoding to a staff member and the predicament we are in.

3.4) mulaw encoding

Is there a way to reduce our memory footprint so that we could fit more audio into our 60 kB? If we could reduce sampling rate or bit depth even further, that would be one way. Unfortunately, we run into two issues with the system we're working with. One is that the Google Speech API doesn't allow sampling rates lower than 8 kHz nor does it allow raw audio samples at less than < 16 bits/sample (and top of this it requires that data to be sent in base64). Even if we used a different speech transcription service, we'd still have the issue that lowering the sampling rate or bit depth would degrade the audio, making it difficult to analyze.

Google does allow us alternative ways to reduce the space required for a single measurement using some industry-standard forms of shrinking measurements using some industry-standard forms of compression, although this usually comes at the cost of decreased quality (and loss of information).

We can mitigate the loss from data compression, realizing that we are interested in speech, not just any generic signal (not video, or music, etc.), and therefore use the specific qualities of the speech signal to help us out. In particular, human hearing tends to respond linearly to logarithmic changes in intensity, i.e., this is why we encode sound intensity as decibels, which is a logarithmic scale. To take advantage of this, we'll take our 16-bit sample (which again is coming from our 12 bit reading with 4 leading 0's in place) and decrease it to 8-bits using what's known as \mu-law encoding (or mulaw, or ulaw). In \mu-law encoding, each input value is assigned an output value based on the input-output characteristic shown below. Google's Speech API will allow us to send up mu-law encoded data so this is a way to potentially greatly decrease how much data we need to send up!

Mu-Law Encoding.

We see from the plot that input signals over a very large range are converted into output signals over a smaller range. So we can fit 16-bit numbers (with 65,536 values) into an 8-bit number (with 256 bins). \mu-law companding or encoding is used commercially in several modern telecom standards, including on your mobile phones.

Mathematically the algorithm is implemented by taking an input value x and converting it to an output value z via:

y = \frac{x}{32768}

followed by:

z = \text{sign}(y) \cdot \frac{\ln\left(1+ \mu \left|y\right| \right)}{\ln\left(1+\mu\right)}

where \mu = 255 (for 8-bit output), and \text{sign}(y) is the sign function.

First we normalize the input sample x so it lies between -1 and 1. Then we put it through the \mu-law equation. The output of this equation lies between -1 and 1. In practice we'd like to turn this into an integer lying between -128 and 127. This is the range of an int8_t, aka a signed 8-bit integer value.

Note that we are doing a decent amount of math on each sample: we are dividing by two different constants and taking a natural logarithm. We need to perform this computation in real time, faster than we sample, so that we don't accumulate unprocessed data. Our sample rate is 8 kHz, so we have 1/8000 Hz = 125 microseconds to do this computation. Let's see how long it actually takes.

First, create a function below called mulaw_encode that:

  • Takes in an int16_t integer, aka a signed 16-bit integer that ranges from -32768 to 32767
  • Computes the mulaw-encoded value according to the equation above, and
  • Returns the mulaw-encoded value as an int8_t integer that ranges from -128 to 127.

For testing, note that:

  • mulaw_encode(-32768) = -128
  • mulaw_encode(32767) = 127
  • mulaw_encode(0) = 0
  • mulaw_encode(13110) = 106 (or very close to it)

Be careful when doing math on values to cast when needed and use floats or doubles when doing non-integer math. We've included the C math.h library, so you can use abs(), log(), etc. Note that the C abs() function only works correctly with integers, but we're including a version that works with floats as well in this checker. A list of functions in this library is here.

int8_t mulaw_encode(int16_t sample) { int mu = 255; // YOUR CODE HERE }

Now, let's profile your solution, meaning figure out how long and how many clock cycles this takes. To do this, we've created a skeleton mu_law_profiler.ino that allows you to test the code. Let's take a look at the key part:

for (int i=0;i<1000;i++) value[i] = analogRead(AUDIO_IN);
  
// Run mulaw_encode 1000 times
long start = micros();
for (int i=0; i<1000; i++) {
  out[i] = mulaw_encode(value[i]);    
}
long duration = micros() - start; // Stop

Serial.println("processing time: " + String(duration));

// Print out sum, so compiler doesn't optimize the benchmark
for (int i=0; i<1000; i++) {
  sum += out[i]; 
}
Serial.println(sum);

delay(1000);

When profiling a piece of code, we want to know how long that piece of code takes and only that piece of code. So we set up our profiler to do as little work as possible in-between setting our timer and running our code.

In the code bit above, we first fill up an array value with "random" data. A simple way to do that is to sample audio. Then we use a timer and record the starting time in start. Then we run our code. But we do it by running it 1000 times. Why? So that we get enough resolution on the timing. If the code takes 0.8 microseconds to run and we run it once, we'll get a duration of 1 microsecond. But if we run it 1000 times, we'll get a duration of 800 microseconds, and can get the 0.8 usec running time by dividing out the 1000. The for loop adds a couple of extra instructions, but that should be negligible compared to our main code.

In addition, we also sum all the values we encode outside of the code we are profiling. Why? Sometimes the compiler is too smart and realises that certain results are not utilised later on in the code, therefore it just optimises a bunch of operations out. By printing the total sum of the encoded values, we ensure that the optimizer doesn't remove the operations that we want to do from the code.

Insert your mulaw_encode algorithm into mu_law_profiler.ino and run it on your ESP32.

Try Now:

Based on the observed results, how long does each run of mulaw_encode take?

Try Now:

Knowing that the ESP32 clock runs at 240 MHz, how many clock cycles does that duration correspond to?

Try Now:

Does that duration allow us to do on-the-fly mulaw encoding with audio sampling at 8kHz?

As you can see, running this small piece of code actually takes some time and clock cycles. That's because of the divisions and floating-point math involved. Interestingly, our algorithm is potentially fast enough to to be used for mu-law encoding with audio, but we're going to put it aside right now, since it won't work with the industry standard mu-law encoding. Instead, commercial uses of \mu-law encoding use a much simpler and faster algorithm and protocol optimized for use in computers. One implementation looks like the code below which is the G.711 protocol for telephony. It is a slightly different mu-law implementation than the one we have you implement above, working on 14 bit-signed data and producing an 8-bit signed value out. Details of this implementation can be found here HERE for those interested.

int8_t mulaw_encode_fast(int16_t sample){
   const uint16_t MULAW_MAX = 0x1FFF;
   const uint16_t MULAW_BIAS = 33;
   uint16_t mask = 0x1000;
   uint8_t sign = 0;
   uint8_t position = 12;
   uint8_t lsb = 0;
   if (sample < 0)
   {
      sample = -sample;
      sign = 0x80;
   }
   sample += MULAW_BIAS;
   if (sample > MULAW_MAX)
   {
      sample = MULAW_MAX;
   }
   for (; ((sample & mask) != mask && position >= 5); mask >>= 1, position--)
        ;
   lsb = (sample >> (position - 4)) & 0x0f;
   return (~(sign | ((position - 5) << 4) | lsb));
}

This is the type of mulaw encoding that is often used commercially. You can see that there are no logarithms or divisions. Instead, there are lots of ">>", "&", etc. This algorithm takes advantage of the fact that each bit in a byte is a factor of 2 larger or smaller than the one next to it, which is a log_2 scaling. By shifting bits we can do a rough approximation of logarithms, as well as division by factors of 2. Change your profiler to use this function instead of the one your wrote. How long does it take?

How long (in nanoseconds) does this improved algorithm take for each encoding?

How many clock cycles does this duration correspond to?

Pretty nice speed-up, eh? Creative algorithms that take advantage of the inherent nature of our digital environment are a very exciting area of research. Furthermore, Google expects this algorithm and its base-two-logarithmic treatment to be used, so using our own mu-law algorithm from earlier will not work in production. Therefore, we'll stick with this one going forward.

So to review then now, by using \mu-law encoding, we get a 2x reduction in required storage from our previous low-point earlier, allowing us to achieve more audio recording with the numbers we wanted to use. Not bad.

4) WiFi library

We'll be using a more secure WiFi library with our ESP32 in order to interface with the Google Speech API. Google only allows access to its resources over HTTPS (HTTP (secure)), which our standard WiFi client library doesn't work with. To fix that, we use a https-compatible client (we'll discuss this more in Week 7/8).

We have this:

const String PREFIX = "{\"config\":{\"encoding\":\"MULAW\",\"sampleRate\":8000},\"audio\": {\"content\":\"";

We are declaring and initializing a String called PREFIX that starts creating the JSON. The first part is the config section, where you can see that we are specifying MULAW encoding at a sample rate of 8000 Hz. Then we start the content section. The base64-encoded data will then be appended to this String and the entire thing placed into the array speech_data. Note that where we have our POST part of code, we need to send it up in pieces. The client print infrastructure lacks an insufficiently large input buffer to handle the amount of data we need so we will send it up in 3000 byte chunks until we're done with little pauses in between.

As opposed to the numbersapi.com or Open Trivia API, which was fine to allow limited access to its API without a key, Google is closer to how that Open Weather API works in that they require an API key for their service. More unfortunately, the Google Speech functionality is not free, so we are providing you an API key that we are paying for to use with your device for 6.08 this semester. Do not abuse it. We expect you to only use this API key for this class and not provide it to anyone else.

5) Let's transcribe

OK, now that we know what's going on, let's transcribe. Compile/Upload speech_to_text.ino to the ESP32 with the mu-law function. Then open the serial monitor.

For initial testing, we've set SAMPLE_DURATION to 3 seconds. Think of a short phrase to say, and then press the push-button switch. When the ESP32's oled prompts you, say your phrase.

The Google Speech API then transcribes that audio message. The transcription time takes as long as the message, so 3 seconds in this case, and then it sends back its results.

5.1) Decoding the results

Check out the Serial Monitor and find where the data from the Google Speech API comes back. It comes back as, yes, a JSON object. It should look something like the following4

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "What has gone wrong?",
          "confidence": 0.71339726
        }
      ]
    }
  ]
}

You can see what Google thought the transcript would be, and it's confidence (from 0 to 1) in the transcription. It is rarely 1, but it can get pretty close.

We want to pull the transcript out of the returned JSON object. If this was Python, we'd convert it into a dictionary object and then we could search through it. We'd just use indexing and keys. Here in C++, we can't do that (at least, not without adding in a JSON library). So instead we pull out the part of the string holding the transcript. We do that in the code below the return in the function.

Once your system is working, feel free to increase the SAMPLE_DURATION up to 5 seconds to transcribe longer messages, and modify it so that it either uses a profanity filter or some guidance feature like "hints."

The Google Speech API Documentation CAN BE FOUND HERE.. Add one additional feature to your system (profanity filter, hint words, etc...)

Checkoff 2:
Show your working transcriber to a staff member with one additional feature to modify/aid in the trascribing.

6) Limiting What We Send

For an initial modification, we'd like to edit the way we collect data so that we only collect audio when we are actively holding down the button (up to a limit of maybe 5.5 seconds total). Study the code as it currently exists and make the appropriate changes.

Checkoff 3:
Show your working limited-recording system to a staff member.

7) Drawing on Command

Now that we have a working transcription system, we should be able to use spoken commands to control what our ESP32 does. Edit the code so that if your transcription includes a certain word or phrase, it will cause a certain action.

In particular, we'd like you to implement the following three behaviors:

  • transcript includes "draw" (or a similar word) and "rectangle" ==> draw a rectangle on the screen
  • transcript includes "draw" (or a similar word) and "circle" ==> draw a circle on the screen
  • a third behavior of your choosing that has some complexity to it

Feel free to adjust the stringency of your recognition to improve performance. For example, checking for a complete string "draw a rectangle" is more stringent than looking for the words "draw" and "rectangle", which is in turn more stringent than looking for "draw" and "rect", and so on. Use your judgement! You may also want to look into the Google Speech API docs so that you can provide "hints" which will help it lock into words you're expecting. Finally, watch the capitalization of the phrases as they come back. The Arduino String class (DOCS HERE) has some member functions that might aid in getting around such complications.

Checkoff 4:
Show your working Drawing Device to a staff member and discuss how you've just outsourced graphic design to AI.

 
Footnotes

1 There's also a few other limitations we want to put in place we don't need to go into here involving limitations with the String class as well as our desire to limit the amount of data you use with this API key. (click to return to text)

2 the exact sample rate of CD's, 44.1kHz has a lot more history in it than we need to go into here. (click to return to text)

3 a little less since nothing is ideal, but that's ok (click to return to text)

4 What famous first message did I try to say here and this is Google Speech's interpretation? (click to return to text)