Intro
What can you do with Parallella? – that question still haunts me in the night. Even though we wrote the scheduler application before, I think we all agree that there must be a better way to utilize Epiphany chip…
In the search for a better usecase, I found Tensorflow port for microcontrollers. That got me thinking… If we can run at least a subset of TF operations on low powered devices such as Arduino Nano or STM32, we should be able to do the same on Epiphany. Wouldn’t that be great to turn Parallella into a legit ML accelerator?
Yes, it would! But again, it likely won’t be as straightforward as running Arduino IDE.
Before you read…
You should know I failed to run TensorFlow on Epiphany, so if you’re looking for a tutorial-alike post, you hit the wrong page. My objective for this post was to run a modern ML/AI library on Parallella’s coprocessor, and even though I didn’t make it, I feel it would be harsh to treat this endeavor as a failure. The win here is the experience itself, and how it builds up to the broader story of “What happened to Parallella?”.
Though, if I were to spend even more time on the matter, I’d likely get the uTensor library running (bold statement, duh).. In the current stage, the code silently fails execution, and as usual, the issue is likely within the lower bits; this is between the compiler, the linker, and the driver. Also, even if I succeeded, the performance would be too poor to bother (explained later), so instead of pursuing the goal, I decided to wrap up the miniseries and move on to another project.
Tensorflow
TensorFlow is Google’s Open Source Machine Learning Framework for dataflow programming across a range of tasks. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them.
An example tensor dataflow that we’d like to execute on the board is this:
As you see, the graph is just a collection of mathematical operations such as add, sub, multiply, or even more complex ones like matrix multiplications. This workflow is suitable for Epiphany, because:
- graphs are good candidates for parallelization
- plenty CPU bound operations (math)
- small data chunks must be present at the time (usually)
The applications range from simple object detection through speech recognition to complex Neural Networks supporting the MRIs brain scanning process.
TF Micro
TensorFlow Lite for Microcontrollers is an experimental port of TensorFlow Lite designed to run machine learning models on microcontrollers and other devices with only kilobytes of memory.
It doesn’t require operating system support, any standard C or C++ libraries, or dynamic memory allocation. The core runtime fits in 16 KB on an Arm Cortex M3, and with enough operators to run a speech keyword detection model, takes up a total of 22 KB.
Wow! 22 KB would fit into the 32 KB Epiphany local memory bank, right? Well, it would, except that once you compile the hello_world (without symbols, size optimized etc.) you’ll notice that the binary size is ~26x larger ( 584 KB). The readelf lookup says the 486K is the actual program (.text) which is ~83% of the whole program:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[10] .text PROGBITS 8e00ec38 01ec38 07997c 00 AX 0 0 8
So even if we relink some libraries to DRAM we still might be short of local memory. On a side note, I also compiled the program for Arduino Nano and the binary size for speech recognition example is around ~900 KB, which is fine for Arduino since it has:
- CPU Flash Memory 1MB (nRF52840)
- SRAM 256KB (nRF52840)
uTensor
uTensor is an extremely light-weight machine learning inference framework built on Tensorflow and optimized for Arm targets. It consists of a runtime library and an offline tool that handles most of the model translation work. This repo holds the core runtime and some example implementations of operators, memory managers/schedulers, and more, and the size of the core runtime is only ~2KB!
~2 KB sounds even more promising! I used the MNIST handwriting recognition demo and compiled it with the Epiphany C/C++ compiler. The final binary size is 952 KB after optimizations, that’s more than TF lite, but the demo includes test data so that we can discount a few kilobytes… still a lot, right?
The .text size is at ~760 KB:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 7] .text PROGBITS 8e01a208 02a208 0be244 00 AX 0 0 8
Linking options
In the previous post we learned how the linker script choice relates to the data placement on the heap, but we haven’t run any benchmark yet. Meaning, we might have a sense of how much worse the data placement on DRAM vs. SRAM might be speed-wise, but it’ll be just a guess.
Since we can’t fit the whole program into the local eCore memory, we need to evaluate the performance of other linking options, so we ought to run some tests. In short, we have to estimate how many floating-point operations per second, we can run on a single core but with a caveat… Typically, we’d write a short assembly code that would perform a single floating-point operation and measure the execution time. This way, we’d get flops per-core approximation, which might be useful to assess the eCore performance, but won’t be as helpful to differentiate between linker scripts.
A better idea is to use a floating operation from some library ie. math.h
and relocate that library with the linker scripts, such as:
- legacy.ldf – both library and the main code are placed in DRAM
- internal.ldf – both library and the main code are placed in SRAM
- fast.ldf – the library is stored in DRAM and the main code (i.e., stack, .text) in SRAM
The last method (fast.ldf) sounds similar to text-on-hugepages (or however it was called) trick, where we’d place hot functions on the huge pages to leverage cache locality. Here, we operate on “raw memory”, meaning there is no paging involved, but the general idea is that we put the hot symbols on the eCore’s local memory bank for faster access. Taken the local eCore memory bank is only 32kb the “hybrid” linking makes a lot of sense, otherwise we would quickly run out of space (internal.ldf) or get way worse performance (legacy.ldf).
Speaking of performance, that’s the code we are about to benchmark with:
#include <stdlib.h>
#include <math.h>
volatile unsigned long long *count = (void *) 0x7000;
int main(void) {
(*count) = 0;
while(1) {
sqrt(2000);
(*count)++;
}
return EXIT_SUCCESS;
}
Once you compile the above program with 3 linker scripts you might want to compare the sections placement:
Tables | NEW_LIB_RO | .text |
---|---|---|
internal.ldf | 0000010c | 00000708 |
fast.ldf | 8e000000 | 00000130 |
legacy.ldf | 8e000000 | 8e000600 |
As you see, all is as we expected. It’s time to run the benchmark!
The results are conclusive… storing the whole program in DRAM sucks, resulting in ~14x slowdown comparing to fast/internal linking, but do we any other choice? No.
Linking to DRAM
Okay, the DRAM access is slow, but it’s the only option we have. To place the whole program on the external memory, we need to use the legacy.ldf
linker script, but with extra setup:
- the linker script base address is set to
0x8e000000
, meaning the program will be located at this address for every core. Obviously, if we have 16 eCores using the same sections (read/write) the program will fail unexpectedly. To fix that we should pre-allocate a memory space for each program (i.e., 1MB) and place it at the right offset (see the script below) - load each binary from AD 1 separately via
e_load("elf/e_prime_X_X.elf", &dev, X, X, E_FALSE);
A short script to correctly place the executable(s) on the external DRAM:
BASE_ADDR=0x8e000000
OFFSET=0x100000
for row in $(seq 0 3); do
for col in $(seq 0 3); do
absolute_row=<code>expr 32 + $row </code>
absolute_col=<code>expr 8 + $col </code>
echo "building $row $col -> $absolute_row $absolute_col -> $BASE_ADDR"
e-gcc -g -O2 -T legacy.ldf src/e_main.c -o "elf/e_prime_${row}_${col}.elf" -le-lib -lm \
-Xlinker --defsym=_CORE_ROW_=$absolute_row \
-Xlinker --defsym=_CORE_COL_=$absolute_col \
-Xlinker --defsym=_SHARED_DRAM_=$BASE_ADDR
printf -v BASE_ADDR '%#x' "$((BASE_ADDR + OFFSET))"
done;
done;
To build TensorFlow libraries you’d need to modify the code, for example:
- TensorFlow Lite – https://gist.github.com/mkaczanowski/2552763d44e5002a875eb181c9bb927f
- uTensor – https://gist.github.com/mkaczanowski/117d6e011eea751865654687b190bd33 (legacy-master branch)
The takeaway
By looking at the hardware specs, the Parallella board might seem like a great board for running AI / Machine learning workloads, but as we have seen in this post, the hardware without reliable software isn’t that useful. Having first-class support for the C/C++ compiler and the most commonly used libraries like the TensorFlow is vital in the platform adoption. Unfortunately, that’s what lacks here.
The key problems we saw in this post:
- program size – modern libraries tend to have a larger than 32 KB footprint. Even if we relocate selected binary parts to external memory, the local memory might not be enough to hold stack, heap, or writable sections. Solution: expand local memory bank, add flash memory (~1MB) in the NoC (like fast.ldf but faster access than DRAM)
- reliable toolchain/platform – while it’s technically possible to run a large program stored in DRAM, I didn’t manage to do it, and I’d consider myself as an experienced user. Troubleshooting is still kind of hard since you need to cut through each layer (software, compiler, linker, or the driver).
Deep down I hoped it’ll work, but well, maybe I’ll come back to this one day…
See other posts!
- # Parallella (part 1): Case study
- # Parallella (part 10): Power efficiency
- # Parallella (part 11): malloc
- # Parallella (part 12): Tensorflow?
- # Parallella (part 13): Closing notes
- # Parallella (part 2): Hardware
- # Parallella (part 3): Kernel
- # Parallella (part 4): ISA
- # Parallella (part 5): elibs
- # Parallella (part 6): FreeRTOS