
Series: Exploring MCUs
Running machine learning models on microcontrollers — TinyML — is one of the most exciting frontiers in embedded systems. In this post, I’ll walk through the process of training a simple model and deploying it on an STM32F4 Discovery board.
Why TinyML?
Microcontrollers are everywhere: sensors, wearables, industrial controllers. Adding ML inference at the edge means:
- No cloud latency
- Privacy-preserving (data never leaves the device)
- Low power consumption
- Offline operation
The Workflow
1. Train a Model
I used TensorFlow to train a simple 2-layer dense network for gesture recognition using accelerometer data.
model = tf.keras.Sequential([ tf.keras.layers.Dense(16, activation='relu', input_shape=(6,)), tf.keras.layers.Dense(4, activation='softmax')])model.compile(optimizer='adam', loss='categorical_crossentropy')2. Quantize and Convert
Full-precision floats are too heavy for an MCU. Post-training quantization converts weights from float32 to int8:
converter = tf.lite.TFLiteConverter.from_keras_model(model)converter.optimizations = [tf.lite.Optimize.DEFAULT]converter.target_spec.supported_types = [tf.int8]tflite_model = converter.convert()3. Deploy with CMSIS-NN
The STM32F4 has no hardware acceleration for neural networks, but ARM’s CMSIS-NN library provides optimized software kernels for matrix multiplication and activation functions.
The inference loop on the MCU is straightforward:
static tflite::MicroInterpreter interpreter( model, resolver, tensor_arena, kTensorArenaSize);
interpreter->Invoke();Results
The quantized model used only 12 KB of flash and 4 KB of RAM. Inference took approximately 3 ms per frame — well within the requirements for real-time gesture recognition.
Key Takeaways
- TinyML makes edge intelligence practical
- Quantization is essential — int8 over float32 gives 4x memory reduction
- CMSIS-NN provides excellent performance on ARM Cortex-M cores
- Start simple and iterate: a small model that runs is better than a large model that doesn’t fit