# Parallella (part 11): malloc

memory allocataor

Intro

If I were to name one thing that makes modern programming languages great, I’d pick dynamic memory allocation (often interfaced via malloc/free functions). Imagine yourself keeping track of memory allocations, drowning in seemingly random addresses, and handling the fragmentation manually… Madness to say, at least.

Parallella board comes with C/C++ SDK, but unfortunately, without a reliable memory allocator. In the end, calling the malloc from eCore won’t work for you.

If you ask me, I think a lack of reliable malloc is the worst mistake Adapteva made and could be the turning point in the platform adoption. Let me explain why…

Memory allocation

In fact, we covered a good bit of memory management in the kernel section, but let’s take quick recap here.

A memory allocator is responsible for … memory allocation. Thank you, Mr obvious! Imagine you have a memory block, ie. RAM and now you have X processes where each one requires a "chunk" of it. How would you divide the main block into pieces?

You could take a naive approach and split memory into equal-sized chunks, but then you’ll waste a lot of space. A better solution is to divide memory into "x-sized classes" (i.e., 4, 8, 16 … bytes) that are distributed based on the requested block size. That way, we increase overall utilization but also fragmentation, so it’d be great to have a piece of code that would slice and dice the memory automagically.

That piece of code is called a "memory allocator," and you have used it, for sure, it in your C/C++, Go, Python, you_name_it applications via malloc / free function.

    AD 1                 AD 2                 AD 3
+----------+         +----------+         +----------+
|          |         |          |         |  stack   |
|          |         |          |         +----+-----+
|          |         |          |         |    |     |
|          |         +----------+         |    v     |
|          |         |          |         |          |
|          |         |          |         |          |
|          |         +----------+         |          |
|          |         |          |         |          |
|  MEMORY  | +-----> |          | +-----> |          |
|          |         |          |         |          |
|          |         |          |         |          |
|          |         +----------+         |    ^     |
|          |         |          |         |    |     |
|          |         +----------+         +----+-----+
|          |         |          |         |   heap   |
|          |         +----------+         +----------+
|          |         |          |         |   bss    |
|          |         |          |         +----------+
|          |         |          |         |   data   |
|          |         +----------+         +----------+
|          |         |          |         |   text   |
+----------+         +----------+         +----------+

The above illustration should help you understand the program/process memory layout:

  • AD 1 – every application gets a piece of memory
  • AD 2 – which then gets split into pieces
  • AD 3 – in details, the ELF format specifies the "data" placement policy

The dynamic allocations happen on the heap that grows with (s)brk call (or mmap, but that’s not relevant to Epiphany)

Here are a few commonly used allocators for your reference:

Why malloc?

Epiphany III comes with 32kb of memory per core and ~32 Mb of shared (slow) DRAM. It’s not a lot, so does it makes sense to use dynamic memory allocations at all?

I must admit, dynamic slice and dice over 32kb sounds like a bad idea. But, lets put the performance considerations on the side and focus on the portability aspect of modern software.

The parallella is a bit of hybrid hardware that can be placed in the middle of the embedded-to-HPC spectrum, meaning it has a wide range of use cases. Taken the things we can do with the board, I feel the static/manual memory management is not enough.
For example, if we were to use Epiphany as a compression/decompression accelerator, we would likely look at well-established libraries such as zlib, bzip, zstd, etc. instead of writing one on our own. It’s fair to assume the library of our choosing:

  • calls malloc somewhere in the source code
  • likely has fine-grained control over the buffer sizes etc. (so it’ll fit in 32kb)

But what if parallella doesn’t provide malloc call? Our attempt fails… The code won’t even compile, and most of the users will get frustrated thinking, "why I can’t do such a simple thing?".

Quoting myself:

I think a lack of reliable malloc is the worst mistake Adapteva made and could be the turning point in the platform adoption.

IMHO the reliable malloc present at the release date would change the history, and today the platform could be thriving. Not because the dynamic allocation on Epiphany is a very sensible thing, but because it provides a portability layer for existing software.

malloc doesn’t work?

Straighten up your story, dude!

doesn’t work != doesn’t exist

Well, the parallella SDK comes with a libc (newlib) that provides most of standard library calls such as malloc, free etc.

But here is 100$ dollar question, how does it work on Epiphany? Will it execute a system call? Call mmap? What memory range is being returned?… Wait! There is no "kernel" or virtual memory present on the chip…

Quick look at the source unravels where the heap is likely going to be allocated:

void *
_sbrk (int  incr)
{
    extern char __heap_start;//set by linker
    extern char __heap_end;//set by linker
     ....
}    /* _sbrk () */

The second look at the linker script (fast.ldf) gives us a definite answer:

EXTERNAL_DRAM_0 (WXAI)   : ORIGIN = 0x8e000000, LENGTH = 0x01000000 /* .text, data, rodata, bss and .stack  */
EXTERNAL_DRAM_1 (WXAI)   : ORIGIN = 0x8f000000, LENGTH = 0x01000000 /* .heap                                */

_HEAP_SIZE_FOR_CORE_ = 512K;
  PROVIDE (__heap_start = ORIGIN(EXTERNAL_DRAM_1) + _HEAP_SIZE_FOR_CORE_ * _CORE_NUM_);
  PROVIDE (___heap_start = __heap_start);

Reading the above:

  • the heap is placed on the DRAM (0x8e000000 - 0x8fffffff). Precisely on the upper half 0x8f000000
  • the heap size = 512kb
  • the _CORE_NUM_ suggests the heap address is different for each core

The last point got my attention, a separate heap for each core is totally the right approach, but we compile and link only once, right? Reading further:

_FIRST_CORE_ROW_     = 32;
_FIRST_CORE_COL_     = 8;
_CORE_ROW_ = DEFINED(_CORE_ROW_) ? _CORE_ROW_ : _FIRST_CORE_ROW_;
_CORE_COL_ = DEFINED(_CORE_COL_) ? _CORE_COL_ : _FIRST_CORE_COL_;

_CORE_NUM_ = (_CORE_ROW_ - _FIRST_CORE_ROW_) * _NUM_COLS_IN_CHIP_ + (_CORE_COL_ - _FIRST_CORE_COL_);

As you can see, the default _CORE_ROW is equal to _FIRST_CORE_ROW unless we explicitly set it. This implies that each core has the same __heap_start unless we compile and load the program 16 times (once for each eCore).

Just to be sure we can check the address on each core by running this snippet:

extern void* __heap_start;

volatile int* debug = (int*) 0x7036;
*debug = &__heap_start;

Yup, no matter what eCore we run on, the heap starts at the same address: 0x8f000000, boo!

Potential improvements

Okay, we know it’s broken, we know why, but why can’t we fix it? Yes, we can!

Option 1: Update loader

We could fix the existing implementation by tweaking up the loader, I think. In a nutshell, the loader sets up a program in eCore’s memory, so we could use it to update the __heap_* symbols on the fly.

The clear advantage of allocating the heap on the DRAM is the "large" memory space. The drawback is the speed, so if we are looking for a faster solution, we should consider Option 2.

Option 2: Custom allocator

This is the easiest option that doesn’t require any changes in the SDK. We can simply use an alternative memory allocator and place the heap on the local memory bank (32kb), which would result in much faster memory access.

I managed to successfully set up umm_malloc on the local memory. Indeed it’s faster, but this alternative brings on other challenges.

Here is how to "override" stdlib functions:

#include "umm_malloc/src/umm_malloc.h"

#define malloc umm_malloc
#define calloc umm_calloc
#define free umm_free

int* test = (int*) malloc(4 *sizeof(int));

Before we go any further, let’s take a moment to recap addressing schema:

  • 0x0 - 0x7fff = 32kb of local memory (split to 4 banks)
  • 0x7fff is the top of the stack (__stack_start_) and grows downwards
  • 0x0 is the bottom address and the program code (might) be placed on the low address

In case you’re wondering where those values come from:

e-readelf -a e_binary.elf | grep main
94: 00000130   146 FUNC    GLOBAL DEFAULT   10 main

Hopefully, you have guessed why we look at addressing again. If the selected linker script places the IVT, .test, .bss, etc. segments on the eCore local memory, we need to carefully adjust the UMM_HEAP_ADDR and UMM_HEAP_SIZE to avoid stepping on already allocated memory.
In addition to that, every malloc call needs to be overridden by the preprocessor macro. Otherwise, the newlib malloc symbols will be linked and used instead of the allocator of our choosing.

This is the configuration I have used, but please note the boundaries might be different as your program size or placement changes.

-DHEAP_ADDR=0x1000 -DHEAP_SIZE=0x5000

Option 3: tcmalloc

To amortize memory access costs, we could perform smaller allocations on the local memory and spill out to DRAM only when needed. In fact, this is how the tcmalloc (Thread-Caching Malloc) works in a nutshell.

TCMalloc assigns each thread a thread-local cache. Small allocations are satisfied from the thread-local cache. Objects are moved from central data structures into a thread-local cache as needed, and periodic garbage collections are used to migrate memory back from a thread-local cache into the central data structures.

The implementation would have to be tuned to the Epiphany needs (i.e., do we keep GC?), but the idea of local caching would stay as illustrated below:

0x7fff +---> +----------+   +---------------------+ <---+  0x8fffffff
             |  stack   |   |                     |
0x5000 +---> +----+-----+   |                     |
             |    |     |   |                     |
             |    v     |   |                     |
             |          |   |                     |
             |          |   |                     |
             |  LOCAL   |   |                     |
             |  CACHE   |   |                     |
             |          |   |                     |
             |          |   |      DRAM HEAP      |
             |    ^     |   |       CENTRAL       |
             |    |     |   |                     |
0x1000 +---> +----+-----+   |                     |
             |   heap   |   |                     |
             +----------+   |                     |
             |   bss    |   |                     |
             +----------+   |                     |
             |   data   |   |                     |
             +----------+   |                     |
             |   text   |   |                     |
0x0000 +---> +----------+   +---------------------+ <---+ 0x8e000000

One day I'll work on the implementation, but for now I'll leave the idea here. Have anyone else played with malloc on Epiphany?

Summary

Without much further ado... Adapteva could do a better job on libc Epiphany port.

I am convinced a robust libc with spotless malloc function would attract more developers and ease the development by order of magnitude.

The worst part is not the lack of documentation, nor the "untold development story," but the fact that we are given a function that should work but doesn't. Troubleshooting those low-level issues in between systems (Epiphany is a separate chip connected via FPGA) is not trivial and requires a lot of patience.

I've put a substantial effort into solving all the "silent failures" caused by stepping on a wrong memory region. It's been a fun time, and I am happy to dive into compilers and linkers even more, but I can imagine other user's frustration.

In the end, I'd rather have a compile-time failure with the message "sbrk unimplemented" than undefined behavior.

See other posts!