STM32F4 development board with neural network inference
2 min read

Running a Neural Network on an STM32 Microcontroller


Running machine learning models on microcontrollers — TinyML — is one of the most exciting frontiers in embedded systems. In this post, I’ll walk through the process of training a simple model and deploying it on an STM32F4 Discovery board.

Why TinyML?

Microcontrollers are everywhere: sensors, wearables, industrial controllers. Adding ML inference at the edge means:

  • No cloud latency
  • Privacy-preserving (data never leaves the device)
  • Low power consumption
  • Offline operation

The Workflow

1. Train a Model

I used TensorFlow to train a simple 2-layer dense network for gesture recognition using accelerometer data.

model = tf.keras.Sequential([
tf.keras.layers.Dense(16, activation='relu', input_shape=(6,)),
tf.keras.layers.Dense(4, activation='softmax')
])
model.compile(optimizer='adam', loss='categorical_crossentropy')

2. Quantize and Convert

Full-precision floats are too heavy for an MCU. Post-training quantization converts weights from float32 to int8:

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.int8]
tflite_model = converter.convert()

3. Deploy with CMSIS-NN

The STM32F4 has no hardware acceleration for neural networks, but ARM’s CMSIS-NN library provides optimized software kernels for matrix multiplication and activation functions.

The inference loop on the MCU is straightforward:

static tflite::MicroInterpreter interpreter(
model, resolver, tensor_arena, kTensorArenaSize
);
interpreter->Invoke();

Results

The quantized model used only 12 KB of flash and 4 KB of RAM. Inference took approximately 3 ms per frame — well within the requirements for real-time gesture recognition.

Key Takeaways

  • TinyML makes edge intelligence practical
  • Quantization is essential — int8 over float32 gives 4x memory reduction
  • CMSIS-NN provides excellent performance on ARM Cortex-M cores
  • Start simple and iterate: a small model that runs is better than a large model that doesn’t fit