We're Exploring a World Where AI Thrives Locally on ARM-Powered Silicon
Platform: Radxa ROCK 5B+ & Kinara Ara2 | Date: October 29, 2025
The paradigm of cloud-dependent Artificial Intelligence is no longer the only future. A silent revolution is underway, one that brings computation from the data center to the edge. At Meedoo, our mission extends beyond software innovation; it's about redefining the boundaries of what's possible with commodity hardware. This report documents our journey in making a 40-TOPS Neural Processing Unit (NPU) fully operational on a Radxa ROCK 5B+ board built on ARM architecture, integrated with a Kinara Ara2 neuromorphic accelerator. This wasn't a simple plug-and-play exercise—it was a deep dive into the confluence of low-level kernel engineering, compiler science, and the practical realities of modern AI deployment. We present a complete analysis of an initialization failure manifested through invalid PCIe identifiers and DDR configuration errors, resolved through systematic kernel driver recompilation, embedded firmware reflashing, and complete hardware reset cycles. The final system achieves 16 GB DDR configuration at 1066 MHz with validated BIST tests, enabling real-time, high-throughput inference entirely on-device.
Keywords: Edge AI, Neuromorphic computing, ARM architecture, PCIe, DDR configuration, Kernel driver, Embedded firmware, NPU, On-device inference, System debugging
Our experiment with the Radxa ROCK 5B+ and a 40-TOPS NPU represents a fundamental shift in how we think about AI deployment. The future of high-performance AI is not exclusively tethered to the cloud. By conquering the software challenges that accompany cutting-edge silicon, we can unlock a new generation of applications in robotics, autonomous systems, and private data processing.
Neuromorphic inference accelerators represent an optimal solution for deploying deep neural networks in embedded environments with energy constraints. This report details the journey, the obstacles, and the breakthroughs that signal a new era for powerful, private, and efficient on-device AI.
At Meedoo, our mission extends beyond traditional software development. We're redefining what's possible with commodity hardware, pushing ARM-based platforms to their absolute limits. This experiment was less about the out-of-the-box experience and more about managing workloads of unprecedented magnitude on edge devices.
Our platform of choice, the Radxa ROCK 5B+, is built on the Rockchip RK3588 SoC—a powerhouse for the ARM single-board computer world. The critical component is its integrated NPU combined with the external Kinara Ara2 accelerator, dedicated silicon blocks designed specifically for the matrix multiplication and convolution operations that define neural network inference.
| Component | Specification |
|---|---|
| Host Platform | Radxa ROCK 5B+ (Rockchip RK3588) |
| CPU | 4× Cortex-A76 + 4× Cortex-A55 |
| Integrated NPU | Up to 6 TOPS (INT8) |
| RAM | LPDDR5 32GB |
| External Accelerator | Kinara Ara2 A01 (Vendor: 0x1e58, Product: 0x0002) |
| Interface | PCIe Gen3 x4 |
| Accelerator Memory | 16 GB DDR (configurable: 900/1000/1066 MHz) |
| Core Frequency | 1100 MHz |
| Operating System | Linux 6.1.115-vendor-rk35xx (aarch64) |
The "40-TOPS" figure represents the theoretical computational ceiling when combining the integrated NPU capabilities with the Kinara Ara2 accelerator—a testament to the power available at the edge when properly orchestrated.
On paper, the potential is immense. In practice, unlocking it is a series of formidable challenges. The primary bottleneck is not raw power but software orchestration. A powerful NPU is useless without a robust software stack that can correctly compile and dispatch models to it.
During the initial configuration attempt, two critical errors were identified that prevented the system from functioning:
program_flash utility returned a device handle error when attempting firmware programming.vendor=0x1, product=0xffff) instead of the expected Kinara identifiers.[E:251029:151419:34084] [main_34084][PROGRAM_FLASH] device handle error
[I:251029:151419:34084] [main_34084][pci_io] vendor=0x1 product=0xffff
The observed PCIe identifiers correspond to the default values of the Ara2's internal ROM bootloader. This indicates that the main firmware boot sequence was not completed successfully. The normal boot sequence comprises:
Log analysis indicated a blockage at phase 2, suggesting either an empty/corrupted flash or a timing issue in the flash controller.
Our first battle was fought in the Linux kernel. The standard mainline kernel for the ROCK 5B+ had rudimentary support but no meaningful drivers for high-performance external NPU. Our work involved:
The uiodma driver provided with the SDK required adaptation for the target platform. The original Makefile referenced a non-existent Yocto cross-compilation environment. We integrated vendor-specific kernel patches into a modern, stable kernel build, requiring meticulous conflict resolution and understanding of the kernel's memory management and DMA (Direct Memory Access) subsystems.
The solution consisted of compiling the module against local kernel headers with a custom Makefile:
obj-m += uiodma.o
KVER := $(shell uname -r)
KDIR := /lib/modules/$(KVER)/build
PWD := $(shell pwd)
ARCH := aarch64
CROSS_COMPILE ?=
all:
$(MAKE) -C $(KDIR) M=$(PWD) ARCH=$(ARCH) modules
clean:
$(MAKE) -C $(KDIR) M=$(PWD) clean
Manually crafting and tuning the device tree file to correctly describe the NPU's I/O memory addresses, interrupts, and clock dependencies to the kernel was critical. A single misconfigured line can render the entire device invisible to the operating system.
Once compiled, the driver had to be loaded and the PCIe device manually bound:
sudo insmod ./uiodma.ko
echo "1e58 0002" | sudo tee /sys/bus/pci/drivers/uiodma/new_id
Despite the error returned by program_flash, a complete system reboot was performed. This action proved critical because:
With the kernel able to see the hardware, the next layer was the compiler and SDK. Manufacturers often provide proprietary SDKs tied to specific OS versions lacking support for modern AI frameworks. Our strategy was to build a bridge:
We leveraged the LLVM compiler framework, developing a custom backend that could understand standard model formats (like ONNX) and translate them into the proprietary instruction set of the NPU. This involved graph optimization and operator fusion.
We wrapped the low-level C/C++ SDK calls in a clean Python API, allowing data science teams to deploy models using familiar workflows without interacting with the complex hardware abstraction layer directly.
Before a model could be deployed, the entire system had to be initialized. This post-boot sequence was not automatic and required a precise, ordered set of commands:
# 1. Load the custom driver and bind the NPU
sudo insmod ../../../drivers/uiodma_aarch64/uiodma.ko
echo "1e58 0002" | sudo tee /sys/bus/pci/drivers/uiodma/new_id
# 2. Activate the runtime (clocks, PLL)
sudo ./active_enable_aarch64 -e 0 -m 2
# 3. Configure DDR for 16GB @ 1066MHz
sudo ./ddr_mem_united_config_aarch64 -e 0 -d 0 -s 1 -b willow \
-o ../ddr_config/ddr_cfg_31.bin -g 1 -m 0 -l 3
# 4. Verify system state
sudo ./chip_info_aarch64 -e 0
DDR configuration involves several critical steps with precise parameters:
| Parameter | Value | Description |
|---|---|---|
| -e (Device ID) | 0 | First accelerator in the system |
| -m (Memory size) | 0 | 0 = 16 GB, 1 = 8 GB |
| -l (Frequency level) | 3 | 1 = 1000 MHz, 2 = 900 MHz, 3 = 1066 MHz |
| -g (Gate training) | 1 | Automatic timing calibration enabled |
| -b (Board type) | willow | Willow architecture configuration |
The gate training process performs automatic calibration of signal delays between the DDR controller and DRAM modules, compensating for propagation variations and optimizing the timing window to ensure reliability at high frequency. This process is critical for achieving stable operation at 1066 MHz.
A 40-TOPS workload generates significant heat in a small form factor. Sustained performance was impossible without robust thermal management. We implemented a multi-layered approach:
lm-sensors for temperature monitoringcpufreq and GPU/NPU governors| Parameter | Measured Value | Status |
|---|---|---|
| Vendor ID | 0x1e58 | ✓ Valid (Kinara) |
| Product ID | 0x0002 | ✓ Valid (Ara2) |
| Firmware Version | 8720 | ✓ Operational |
| Core Frequency | 1100 MHz | ✓ Nominal |
| DDR Frequency | 1066 MHz | ✓ Configured |
| DDR Size | 16384 MiB | ✓ Active |
| Active Enable Test | PASSED | ✓ Validated |
| DDR BIST | PASSED | ✓ Validated |
The theoretical bandwidth of DDR memory can be estimated by the formula:
For the 1066 MHz configuration (DDR4), the theoretical bandwidth is approximately 34.1 GB/s, providing comfortable margin for inference applications requiring intensive memory access (weight transfer, intermediate activations).
| DDR Frequency | Bandwidth (GB/s) | Latency (ns) | TDP (W) |
|---|---|---|---|
| 900 MHz | ~28.8 | ~70 | ~18 |
| 1000 MHz | ~32.0 | ~65 | ~21 |
| 1066 MHz | ~34.1 | ~60 | ~25 |
The culmination of this effort was the successful deployment of a computer vision model—a semantic segmentation network for object detection. The process looked like this:
The result was real-time, high-throughput inference that would have been impossible on CPU alone. Power consumption under load was a fraction of what a discrete GPU would require, and all data processing happened entirely on-device, guaranteeing privacy and near-zero latency.
Three factors were decisive in resolving the problem and achieving successful deployment:
The journey is complex, demanding expertise spanning from bare-metal programming to high-level model optimization. Our work involved:
The Kinara SDK v3 presents a limitation where the chip_info utility does not correctly report DDR size (returns "NA"), although the 16 GB are functional. This limitation doesn't affect system operation but requires indirect validation via BIST tests.
Compared to traditional GPUs or TPUs, our edge-first approach offers compelling advantages:
| Metric | Edge AI (Ara2) | GPU (RTX 3060) | Cloud TPU |
|---|---|---|---|
| TDP | 15-25 W | 170 W | 200-300 W |
| Memory | 16 GB LPDDR4X integrated | 12 GB GDDR6 | Shared HBM |
| Latency | <1 ms (on-device) | ~5 ms (local) | 50-200 ms (network) |
| Privacy | Complete (local) | Complete (local) | Depends on provider |
| Setup Complexity | High (custom stack) | Moderate (CUDA) | Low (API) |
| Optimal Use Case | Edge AI, robotics, privacy-critical | Local inference, gaming | Batch processing, training |
Our experiment with the Radxa ROCK 5B+ and 40-TOPS NPU configuration was a resounding success. It demonstrated that the future of high-performance AI is not exclusively tethered to the cloud. This report documented the complete resolution of initialization failures through systematic kernel driver recompilation, firmware management, and DDR configuration optimization.
The final system achieves nominal performance with optimal DDR configuration (16 GB @ 1066 MHz) validated by comprehensive BIST tests. Real-world deployment of computer vision models demonstrates that intelligence can be distributed, powerful, and private—running entirely at the edge with near-zero latency and complete data sovereignty.
At Meedoo, we are not just observers of this future—we are actively building it. The destination is clear: a future where intelligence is not centralized but distributed, powerful, and private. Our next steps include:
We continue this journey one line of kernel code and one optimized model at a time, pushing the boundaries of what's possible at the edge.
29 October 2025
Meedoo Technical Team
System: Radxa ROCK 5B+ | Accelerator: Kinara Ara2 A01 | Firmware: 8720
Forget the Cloud. Build the Edge.