{"id":13427016,"url":"https://github.com/atomic14/diy-alexa","last_synced_at":"2025-03-15T22:31:36.057Z","repository":{"id":41897014,"uuid":"291019886","full_name":"atomic14/diy-alexa","owner":"atomic14","description":"DIY Alexa","archived":false,"fork":false,"pushed_at":"2023-12-07T16:44:50.000Z","size":7350,"stargazers_count":523,"open_issues_count":15,"forks_count":183,"subscribers_count":20,"default_branch":"master","last_synced_at":"2024-10-28T06:01:43.071Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/atomic14.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null},"funding":{"github":["atomic14"],"ko_fi":"atomic14"}},"created_at":"2020-08-28T10:36:23.000Z","updated_at":"2024-10-27T04:01:42.000Z","dependencies_parsed_at":"2023-01-20T02:18:29.435Z","dependency_job_id":null,"html_url":"https://github.com/atomic14/diy-alexa","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/atomic14%2Fdiy-alexa","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/atomic14%2Fdiy-alexa/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/atomic14%2Fdiy-alexa/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/atomic14%2Fdiy-alexa/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/atomic14","download_url":"https://codeload.github.com/atomic14/diy-alexa/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243801601,"owners_count":20350105,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T00:01:50.993Z","updated_at":"2025-03-15T22:31:31.027Z","avatar_url":"https://github.com/atomic14.png","language":"Jupyter Notebook","funding_links":["https://github.com/sponsors/atomic14","https://ko-fi.com/atomic14","https://ko-fi.com/Z8Z734F5Y"],"categories":["Jupyter Notebook"],"sub_categories":[],"readme":"[![ko-fi](https://ko-fi.com/img/githubbutton_sm.svg)](https://ko-fi.com/Z8Z734F5Y)\n# DIY Alexa With the ESP32 and Wit.AI\n\nAll the source code for this tutorial is in [GitHub](https://github.com/atomic14/diy-alexa)\n\n## Introduction\n\nThis tutorial will guide you through the process of creating your own DIY Alexa using the ESP32 and Wit.ai.\n\nThere's a full video tutorial to accompany this available here:\n\n[![Demo Video](https://img.youtube.com/vi/re-dSV_a0tM/0.jpg)](https://www.youtube.com/watch?v=re-dSV_a0tM)\n\nFirst off, let's define what an Alexa is? What are we going to build?\n\nThe first thing we're going to need is some kind of \"wake word detection system\". This will continuously listen to audio, waiting for a trigger phrase or word.\n\nWhen it hears this word it will wake up the rest of the system and start recording audio to capture whatever instructions the user has.\n\nOnce the audio has been captured it will send it off to a server to be recognised.\n\nThe server processes the audio and works out what the user is asking for.\n\n![An Alexa System](https://blog.cmgresearch.com/assets/marvin/alexa.png)\n\nIn some systems the server may process the user request, calling out to other services to execute the user's wishes. In the system we are going to build we'll just be using the server to work out what the user's intention was and then our ESP32 will execute the command.\n\nWe'll need to build three components:\n\n- Wake word detection\n- Audio capture and Intent Recognition\n- Intent Execution\n\nWe'll wire these together to build our complete system.\n\n---\n\n## Getting Started\n\nWe're going to be using some hardware for our project - most of these components can be readily sourced from Amazon, eBay and Adafruit. You may also have local stockists in your own country who can supply the components.\n\nWe will need:\n\n### An ESP32 dev kit\n\nThese are readily available from a number of suppliers include [Adafruit](https://www.adafruit.com/product/4693)\n\n![ESP32 Dev Kit](https://blog.cmgresearch.com/assets/marvin/esp32.jpg)\n\nA good environment for developing for the ESP32 is [Platform.io](https://platformio.org/) and [Visual Studio Code](https://code.visualstudio.com/).\n\n### A microphone break out board\n\nI recommend using an I2S MEMS microphone board. These are very low noise microphones that can be connected directly to the ESP32 using a digital interface and require only a few wires. A good choice is either the INMP441 microphone (available from Amazon or eBay) or the ICS-43434 (available from [Tindie](https://www.tindie.com/products/21519/)).\n\n![MEMS Microphone Board](https://blog.cmgresearch.com/assets/marvin/mems.jpg)\n\n### A Speaker\n\nTo get our Alexa to talk to us we'll need an amplifier and a speaker. For the amplifier I recommend an I2S breakout board such as this one from [Adafruit](https://www.adafruit.com/product/3006). This will drive any 4Ω or 8Ω speaker.\n\n![Amplifier Board](https://blog.cmgresearch.com/assets/marvin/amp.jpg)\n\n### Python3+\n\nFor the machine learning part of this project you'll need Python 3+ installed. To check to see what you have available try running:\n\n```\npython --version\n```\n\nor\n\n```\npython3 --version\n```\n\nIf you need to install Python 3 please follow the instructions [here](https://www.python.org/).\n\n---\n\n## Wake Word Detection\n\nLet's start off with the Wake word detection. We need to create something that will tell use when a \"wake\" word is heard by the system. This will need to run on our embedded devices - an ideal option for this is to use TensorFlow and TensorFlow Lite.\n\n### Training Data\n\nOur first port of call is to find some data to train a model against. We can use the [Speech Commands Dataset](https://www.tensorflow.org/datasets/catalog/speech_commands). This dataset contains over 100,000 audio files consisting of a set of 20 core commands words such as \"Up\", \"Down\", \"Yes\", \"No\" and a set of extra words. Each of the samples is 1 second long.\n\nOne of these words in particular looks like a good candidate for a wake word - I've chosen to use the word \"Marvin\" for my wake word as a tribute to the android from The Hitch Hikers Guide to the Galaxy.\n\nHere's a couple of samples of the word \"Marvin\":\n\n[Marvin1](https://blog.cmgresearch.com/assets/marvin/marvin1.wav)\n|[Marvin1](https://blog.cmgresearch.com/assets/marvin/marvin2.wav)\n\nAnd here's a few of the other random words from the dataset:\n\n[Forward](https://blog.cmgresearch.com/assets/marvin/forward.wav)\n|[Left](https://blog.cmgresearch.com/assets/marvin/left.wav)\n|[Right](https://blog.cmgresearch.com/assets/marvin/right.wav)\n\nTo augment the dataset you can also record ambient background noise, I recorded several hours of household noises and TV shows to provide a large amount of random data.\n\n### Features\n\nWith our training data in place we need to think about what features we are going to train our neural network against. It's unlikely that feeding a raw audio waveform into our neural network will give us good results.\n\n![Audio Waveform](https://blog.cmgresearch.com/assets/marvin/waveform.jpg)\n\nA popular approach for word recognition is to translate the problem into one of image recognition.\n\nWe need to turn our audio samples into something that looks like an image - to do this we can take a spectrogram.\n\nTo get a spectrogram of an audio sample we break the sample into small sections and then perform a discrete Fourier transform on each section. This will give us the frequencies that are present in that slice of audio.\n\nPutting these frequency slices together gives us the spectrogram of the sample.\n\n![Spectrogram](https://blog.cmgresearch.com/assets/marvin/spectrogram.webp)\n\nIn the `model` folder you'll find several Jupyter notebooks. Follow the setup instructions in the `README.md` to configure your local environment.\n\nThe notebook `Generate Training Data.ipynb` contains the code required to extract our features from our audio data.\n\nThe following function can be used to generate a spectrogram from an audio sample:\n\n```python\ndef get_spectrogram(audio):\n    # normalise the audio\n    audio = audio - np.mean(audio)\n    audio = audio / np.max(np.abs(audio))\n    # create the spectrogram\n    spectrogram = audio_ops.audio_spectrogram(audio,\n                                              window_size=320,\n                                              stride=160,\n                                              magnitude_squared=True).numpy()\n    # reduce the number of frequency bins in our spectrogram to a more sensible level\n    spectrogram = tf.nn.pool(\n        input=tf.expand_dims(spectrogram, -1),\n        window_shape=[1, 6],\n        strides=[1, 6],\n        pooling_type='AVG',\n        padding='SAME')\n    spectrogram = tf.squeeze(spectrogram, axis=0)\n    spectrogram = np.log10(spectrogram + 1e-6)\n    return spectrogram\n```\n\nThis function first normalises the audio sample to remove any variance in volume in our samples. It then computes the spectrogram - there is quite a lot of data in the spectrogram so we reduce this by applying average pooling.\n\nWe finally take the log of the spectrogram so that we don't feed extreme values into our neural network which might make it harder to train.\n\nBefore generating the spectrogram we add some random noise and variance to our sample. We randomly shift the audio sample the 1-second segment - this makes sure that our neural network generalises around the audio position.\n\n```python\n# randomly reposition the audio in the sample\nvoice_start, voice_end = get_voice_position(audio, NOISE_FLOOR)\nend_gap=len(audio) - voice_end\nrandom_offset = np.random.uniform(0, voice_start+end_gap)\naudio = np.roll(audio,-random_offset+end_gap)\n```\n\nWe also add in a random sample of background noise. This helps our neural network work out the unique features of our target word and ignore background noise.\n\n```python\n# get the background noise files\nbackground_files = get_files('_background_noise_')\nbackground_file = np.random.choice(background_files)\nbackground_tensor = tfio.audio.AudioIOTensor(background_file)\nbackground_start = np.random.randint(0, len(background_tensor) - 16000)\n# normalise the background noise\nbackground = tf.cast(background_tensor[background_start:background_start+16000], tf.float32)\nbackground = background - np.mean(background)\nbackground = background / np.max(np.abs(background))\n# mix the audio with the scaled background\naudio = audio + background_volume * background\n```\n\nTo make sure we have a balanced dataset we add more samples of the word \"Marvin\" to our dataset. This also helps our neural network generalise as there will be multiple samples of the word with different background noises and in different positions in the 1-second sample.\n\n```python\n# process all the words and all the files\nfor word in tqdm(words, desc=\"Processing words\"):\n    if '_' not in word:\n        # add more examples of marvin to balance our training set\n        repeat = 70 if word == 'marvin' else 1\n        process_word(word, repeat=repeat)\n```\n\nWe then add in samples from our background noise, we run through each background noise file and chop it into 1-second samples, compute the spectrogram, and add these to our negative examples.\n\nWith all of this data we end up with a reasonably sized training, validation and testing dataset.\n\n![Marvin Spectrograms](https://blog.cmgresearch.com/assets/marvin/marvin.jpg)\n\nHere's some examples spectrograms of the \"Marvin\", and here's some examples of the word \"yes\".\n\n![Yes Spectrograms](https://blog.cmgresearch.com/assets/marvin/yes.jpg)\n\nThat's our training data prepared, let's have a look at how we train our model up.\n\n### Model Training\n\nIn the `model` folder you'll find another Jupyter notebook `Train Model.ipynb`. This takes the training, test and validation data that we generated in the previous step.\n\nFor our system we only really care about detecting the word Marvin so we'll modify our Y labels so that it is a 1 for Marvin and 0 for everything else.\n\n```python\nY_train = [1 if y == words.index('marvin') else 0 for y in Y_train_cats]\nY_validate = [1 if y == words.index('marvin') else 0 for y in Y_validate_cats]\nY_test = [1 if y == words.index('marvin') else 0 for y in Y_test_cats]\n```\n\nWe feed this raw data into TensorFlow datasets - we set up our training data repeat forever, randomly shuffle, and to come out in batches.\n\n```python\n# create the datasets for training\nbatch_size = 30\n\ntrain_dataset = Dataset.from_tensor_slices(\n    (X_train, Y_train)\n).repeat(\n    count=-1\n).shuffle(\n    len(X_train)\n).batch(\n    batch_size\n)\n\nvalidation_dataset = Dataset.from_tensor_slices((X_validate, Y_validate)).batch(X_validate.shape[0])\n\ntest_dataset = Dataset.from_tensor_slices((X_test, Y_test)).batch(len(X_test))\n```\n\nI've played around with a few different model architectures and ended up with this as a trade-off between time to train, accuracy and model size.\n\nWe have a convolution layer, followed by a max-pooling layer, following by another convolution layer and max-pooling layer. The result of this is fed into a densely connected layer and finally to our output neuron.\n\n```python\nmodel = Sequential([\n    Conv2D(4, 3,\n           padding='same',\n           activation='relu',\n           kernel_regularizer=regularizers.l2(0.001),\n           name='conv_layer1',\n           input_shape=(IMG_WIDTH, IMG_HEIGHT, 1)),\n    MaxPooling2D(name='max_pooling1', pool_size=(2,2)),\n    Conv2D(4, 3,\n           padding='same',\n           activation='relu',\n           kernel_regularizer=regularizers.l2(0.001),\n           name='conv_layer2'),\n    MaxPooling2D(name='max_pooling2', pool_size=(2,2)),\n    Flatten(),\n    Dropout(0.2),\n    Dense(\n        40,\n        activation='relu',\n        kernel_regularizer=regularizers.l2(0.001),\n        name='hidden_layer1'\n    ),\n    Dense(\n        1,\n        activation='sigmoid',\n        kernel_regularizer=regularizers.l2(0.001),\n        name='output'\n    )\n])\nmodel.summary()\n```\n\nWhen I train this model against the data I get the following accuracy:\n\n| Dataset            | Accuracy |\n| ------------------ | -------- |\n| Training Dataset   | 0.9683   |\n| Validation Dataset | 0.9567   |\n| Test Dataset       | 0.9562   |\n\nThese are pretty good results for such a simple model.\n\nIf we look at the confusion matrix using the high threshold (0.9) for the true class we see that we have very few examples of background noise being classified as a \"Marvin\" and quite a few \"Marvin\"s being classified as background noise.\n\n|        | Predicted Noise | Predicted Marvin |\n| ------ | --------------- | ---------------- |\n| Noise  | 13980           | 63               |\n| Marvin | 1616            | 11054            |\n\nThis is ideal for our use case as we don't want the device waking up randomly.\n\n### Converting the model to TensorFlow Lite\n\nWith our model trained we now need to convert it for use in TensorFlow Lite. This conversion process takes our full model and turns it into a much more compact version that can be run efficiently on our micro-controller.\n\nIn the `model` folder there is another workbook `Convert Trained Model To TFLite.ipynb`.\n\nThis notebook passes our trained model through the `TFLiteConverter` along with examples of input data. Providing the sample input data lets the converter quantise our model accurately.\n\nOnce the model has been converted we can run a command-line tool to generate C code that we can compile into our project.\n\n```\nxxd -i converted_model.tflite \u003e model_data.cc\n```\n\n---\n\n## Intent Recognition\n\nWith our wake word detection model complete we now need to move onto something that can understand what the user is asking us to do.\n\nFor this, we will use the [Wit.ai](https://wit.ai) service from Facebook. This service will \"Turn What Your Users Say Into Actions\".\n\n![Wit.ai Landing Page](https://blog.cmgresearch.com/assets/marvin/witai.png)\n\nThe first thing we'll do is create a new application. You just need to give the application a name and you're all set.\n\n![Wit.ai Create Application](https://blog.cmgresearch.com/assets/marvin/create_app.png)\n\nWith our application created we need to train it to recognise what our users will say. There are three main building blocks of a Wit.ai application:\n\n- Intents\n- Entities\n- Traits\n\nWe'll give our application sample phrases and train it to recognise what intent it should map the phase onto.\n\nFor our project we want to be able to turn devices on and off. Some sample phrases that we can use to train Wit.ai are:\n\n    \"Turn on bedroom\"\n    \"Turn off kitchen\"\n    \"Turn on the lights\"\n\nWe feed these phrases into Wit.ai - for the first phrase we enter we'll create a new intent \"Turn_on_device\".\n\nAs we add more phrases we'll assign them to this new intent. As we give [Wit.ai](https://wit.ai) more examples it will learn what kind of phrase should map onto the same intent. In the future when it sees a new phrase it has never seen before - e.g. \"Turn on the table\" it will be able to recognise that this phrase should belong to the Turn_on_device intent.\n\n![Wit.ai Create Intent](https://blog.cmgresearch.com/assets/marvin/create_intent.png)\n\nThis gives us the user's intention - what are they trying to do? - we now need to work out what the object is that they are trying to effect. This is handled by creating entities.\n\n![Wit.ai Entity](https://blog.cmgresearch.com/assets/marvin/entity.png)\n\nWe want to turn off and on devices so we will highlight the part of the phrase that corresponds to the device name. In the following phrase \"bedroom\" is the device: \"Turn on **bedroom**\". When we highlight a piece of text in the utterance Wit.ai will prompt us to assign it to an existing or a new Entity.\n\n![Wit.ai Entity](https://blog.cmgresearch.com/assets/marvin/entity2.png)\n\nFinally we want to be able detect what the user is trying to do to the device. For this we use Traits. Wit.ai has a built-in trait for detecting \"on\" and \"off\" so we can use this for training.\n\n![Wit.ai Entity](https://blog.cmgresearch.com/assets/marvin/trait.png)\n\nOnce we've trained [Wit.ai](https://wit.ai)  on a few sample phrases it will start to automatically recognise the Intent, Entity and Trait. If it fails to recognise any of these then you can tell it what it should have done and it will correct itself.\n\n![Wit.ai Entity](https://blog.cmgresearch.com/assets/marvin/trained.png)\n\nOnce we are happy that [Wit.ai](https://wit.ai) is performing we can try it out with either text or audio files and see how it performs on real audio.\n\nHere's a sample piece of audio:\n\n[Test Audio](https://blog.cmgresearch.com/assets/marvin/turn_on.wav)\n\nTo send this file to Wit.ai we can use `curl` from the command line.\n\n```\ncurl -XPOST -H 'Authorization: Bearer XXX' -H \"Content-Type: audio/wav\" \"https://api.wit.ai/speech?v=220201015\" --data-binary \"@turn_on.wav\"\n```\n\nThis `curl` command will post the contents of the audio file specified by `turn_on.wav` to the Wit.ai.\n\nYou can get the exact values for the `Authorization` header and the `URL` from the settings page of your Wit.ai application.\n\nWit.ai will process the audio file and send us back some JSON that contains the intent, entity and trait that it recognised.\n\nFor the audio sample above we get back:\n\n```json\n{\n  \"text\": \"turn on kitchen\",\n  \"intents\": [\n    {\n      \"id\": \"796739491162506\",\n      \"name\": \"Turn_on_device\",\n      \"confidence\": 0.9967\n    }\n  ],\n  \"entities\": {\n    \"device:device\": [\n      {\n        \"id\": \"355753362231708\",\n        \"name\": \"device\",\n        \"role\": \"device\",\n        \"start\": 8,\n        \"end\": 15,\n        \"body\": \"kitchen\",\n        \"confidence\": 0.9754,\n        \"entities\": [],\n        \"value\": \"kitchen\",\n        \"type\": \"value\"\n      }\n    ]\n  },\n  \"traits\": {\n    \"wit$on_off\": [\n      {\n        \"id\": \"535a80f0-6922-4680-b678-0576f248cdcc\",\n        \"value\": \"on\",\n        \"confidence\": 0.9875\n      }\n    ]\n  }\n}\n```\n\nAs you can see, it's worked out the intent \"Turn_on_device\", it's recognised the name of the device as \"kitchen\" and it's worked out that we want to turn the device \"on\".\n\nPretty amazing!\n\n---\n\n## Wiring it all up\n\nSo that's our building blocks completed. We have something that will detect a wake word and we have something that will work out what the user's intention was.\n\nLet's have a look at how this is all wired up on ESP32 side of things\n\nI've created a set of libraries for the main components of the project.\n\n![ESP32 Libraries](https://blog.cmgresearch.com/assets/marvin/libraries.png)\n\nThe `tfmicro` library contains the code from TensorFlow Lite and includes everything needed to run a TensorFlow Lite mode.\n\nThe `neural_network` library contains a wrapper around the TensorFlow Lite code making it easier to interface with the rest of our project.\n\nTo get audio data into the system we use the `audio_input` library. We can support both I2S microphones directly and analogue microphones using the analogue to digital converter. Samples from the microphone are read into a circular buffer with room for just over 1 seconds worth of audio.\n\nOur audio output library `audio_output` supports playing WAV files from SPIFFS via an I2S amplifier.\n\nTo actually process the audio we need to recreate the same process that we used for our training data. This is the job of the `audio_processor` library.\n\nThe first thing we need to do is work out the mean and max values of the sample so that we can normalise the audio.\n\n```c++\nint startIndex = reader-\u003egetIndex();\n// get the mean value of the samples\nfloat mean = 0;\nfor (int i = 0; i \u003c m_audio_length; i++)\n{\n    mean += reader-\u003egetCurrentSample();\n    reader-\u003emoveToNextSample();\n}\nmean /= m_audio_length;\n// get the absolute max value of the samples taking into account the mean value\nreader-\u003esetIndex(startIndex);\nfloat max = 0;\nfor (int i = 0; i \u003c m_audio_length; i++)\n{\n    max = std::max(max, fabsf(((float)reader-\u003egetCurrentSample()) - mean));\n    reader-\u003emoveToNextSample();\n}\n```\n\nWe then step through the 1 second of audio extracting a window of samples on each step and computing the spectrogram at each step.\n\nThe input samples are normalised and copied into our FFT input buffer. The input to the FFT is a power of two so there is a blank area that we need to zero out.\n\n```c++\n// extract windows of samples moving forward by step size each time and compute the spectrum of the window\nfor (int window_start = startIndex; window_start \u003c startIndex + 16000 - m_window_size; window_start += m_step_size)\n{\n    // move the reader to the start of the window\n    reader-\u003esetIndex(window_start);\n    // read samples from the reader into the fft input normalising them by subtracting the mean and dividing by the absolute max\n    for (int i = 0; i \u003c m_window_size; i++)\n    {\n        m_fft_input[i] = ((float)reader-\u003egetCurrentSample() - mean) / max;\n        reader-\u003emoveToNextSample();\n    }\n    // zero out whatever else remains in the top part of the input.\n    for (int i = m_window_size; i \u003c m_fft_size; i++)\n    {\n        m_fft_input[i] = 0;\n    }\n    // compute the spectrum for the window of samples and write it to the output\n    get_spectrogram_segment(output_spectrogram);\n    // move to the next row of the output spectrogram\n    output_spectrogram += m_pooled_energy_size;\n}\n```\n\nBefore performing the FFT we apply a Hamming window and then once we have done the FFT we extract the energy in each frequency bin.\n\nWe follow that by the same average pooling process as in training. And then finally we take the log.\n\n```c++\n// apply the hamming window to the samples\nm_hamming_window-\u003eapplyWindow(m_fft_input);\n// do the fft\nkiss_fftr(\n    m_cfg,\n    m_fft_input,\n    reinterpret_cast\u003ckiss_fft_cpx *\u003e(m_fft_output));\n// pull out the magnitude squared values\nfor (int i = 0; i \u003c m_energy_size; i++)\n{\n    const float real = m_fft_output[i].r;\n    const float imag = m_fft_output[i].i;\n    const float mag_squared = (real * real) + (imag * imag);\n    m_energy[i] = mag_squared;\n}\n// reduce the size of the output by pooling with average and same padding\nfloat *output_src = m_energy;\nfloat *output_dst = output;\nfor (int i = 0; i \u003c m_energy_size; i += m_pooling_size)\n{\n    float average = 0;\n    for (int j = 0; j \u003c m_pooling_size; j++)\n    {\n        if (i + j \u003c m_energy_size)\n        {\n            average += *output_src;\n            output_src++;\n        }\n    }\n    *output_dst = average / m_pooling_size;\n    output_dst++;\n}\n// now take the log to give us reasonable values to feed into the network\nfor (int i = 0; i \u003c m_pooled_energy_size; i++)\n{\n    output[i] = log10f(output[i] + EPSILON);\n}\n```\n\nThis gives us the set of features that our neural network is expecting to see.\n\nFinally we have the code for talking to [Wit.ai](https://wit.ai). To avoid having to buffer the entire audio sample in memory we need to perform a chunked upload of the data.\n\nWe create the connection to wit.ai and then upload the chunks of data until we've collected sufficient audio data to capture the user's command.\n\n```c++\nm_wifi_client = new WiFiClientSecure();\nm_wifi_client-\u003econnect(\"api.wit.ai\", 443);\nm_wifi_client-\u003eprintln(\"POST /speech?v=20200927 HTTP/1.1\");\nm_wifi_client-\u003eprintln(\"host: api.wit.ai\");\nm_wifi_client-\u003eprintf(\"authorization: Bearer %s\\n\", access_key);\nm_wifi_client-\u003eprintln(\"content-type: audio/raw; encoding=signed-integer; bits=16; rate=16000; endian=little\");\nm_wifi_client-\u003eprintln(\"transfer-encoding: chunked\");\nm_wifi_client-\u003eprintln();\n```\n\nWe decode the results from [Wit.ai](https://wit.ai) and extract the pieces of information that we are interested in - we care about the intent, the device and whether the users wants to turn the device on or off.\n\n```c++\nconst char *text = doc[\"text\"];\nconst char *intent_name = doc[\"intents\"][0][\"name\"];\nfloat intent_confidence = doc[\"entities\"][\"device:device\"][0][\"confidence\"];\nconst char *device_name = doc[\"entities\"][\"device:device\"][0][\"value\"];\nfloat device_confidence = doc[\"entities\"][\"device:device\"][0][\"confidence\"];\nconst char *trait_value = doc[\"traits\"][\"wit$on_off\"][0][\"value\"];\nfloat trait_confidence = doc[\"traits\"][\"wit$on_off\"][0][\"confidence\"];\n```\n\nOur application consists of a very simple state machine - we can be in one of two states - we can either be waiting for the wake word, or we can recognising a command.\n\nWhen we are waiting for the wake word we process the audio as it streams past grabbing a 1-second window of samples and feeding it through the audio processor and neural network.\n\nWhen the neural network detects the wake word we switch into the command recognition state.\n\nThis state makes a connection to Wit.ai - this can take up to 1.5 seconds as making an SSL connection on the ESP32 is quite slow.\n\nWe then start streaming samples up to the server - to allow for the SSL connection time we rewind 1 second into the past so that we don't miss too much of what the user said.\n\nOnce we've streamed 3 seconds of samples we ask wit.ai what the user said. We could get more clever here and wait until the user has stopped speaking.\n\n[Wit.ai](https://wit.ai) processes the audio and tells us what the user asked, we pass that onto our intent processor to interpret the request and move to the next state which will put us back into waiting for the wake word.\n\nOur intent processor simply looks at the intent name that wit.ai provides us and carries out the appropriate action.\n\n---\n\n## What's next?\n\nSo there we have it, a DIY Alexa.\n\nAll the source code is in [GitHub](https://github.com/atomic14/diy-alexa). It's MIT licensed so feel free to take the code and use it for your own projects.\n\nHow well does it actually work?\n\nReasonably well, we have a very lightweight wake word detection system, it runs in around 100ms and still has room for optimisation.\n\nAccuracy is ok. We need more training data to make it really robust, you can easily trick it into activating by using similar words to \"Marvin\" such as \"marvellous\", \"martin\", \"marlin\" etc... More negative example words would help with this problem.\n\nYou may want to try changing the wake word for a different one or using your own audio samples to train the neural network.\n\nThe [Wit.ai](https://wit.ai) system works very well and you can easily add your own intents and traits to build a very powerful system. I've added additional intents to my own project to tell me jokes and you could easily hook the system up to a weather forecast service if you wanted to.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fatomic14%2Fdiy-alexa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fatomic14%2Fdiy-alexa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fatomic14%2Fdiy-alexa/lists"}