# Parallella (part 7): Application


"I have parallella board… gathering dust somewhere on the shelf" – that’s a comment I found while scrolling through random Reddit post. Who hasn’t been there? How many of your raspberry pi’s are left forgotten?

Making use of hardware is much harder than owning it, especially when we speak about uniquely designed SBCs such as Parallella. Why? Because problems that can be solved best with parallel computing are hard, for example:

  • imaging (i.e., face detection, object tracking, video analytics) – lots of math and academic research!
  • robotics(i.e., space electronics, multi-sensor inertial navigation) – lots of physics, duh!
  • speech (i.e., real-time speech recognition, real-time translation) – lots of signals theory!
  • medical (i.e., ultrasound, DNS sequencing) – you know programming for sure, but do you know biology?!

Okay, we agree the problem domain space is hard, so let’s make it harder by adding other physical constraints such as data transfer/access cost, low available memory, or limited power consumption. So how can an intermediate-level developer write a "useful" application?
Well, truth to be said, I think it’s going to be a difficult task, but there are a few open-source examples anyone can look up to get started…

In this post, we’ll write an application that I think does something useful, but also it is not too difficult to get started. Treat this as a gist of the idea that you can build on top if you want!


Before we start thinking about possible use cases, we should understand the capabilities and, more importantly, the limitations of the hardware we work with. It’s worth to mention that communication with Epiphany coprocessor is via eLink device "rendered" on FPGA PL. As a result, whenever we need to process data on a chip, it has to be transferred there first. The available options are:

  • 32kb SRAM local eCore memory
  • 32mb DRAM external shared memory

To not waste precious CPU cycles on data transfer, we should use the DMA controller. As you can see below, it does make a difference:

[root@parallella2 e-bandwidth-test]# ./bin/e-bandwidth-test-host.elf 
e-bandwidth-test-host.elf: e_init(): No Hardware Definition File (HDF) is specified. Trying "platform.hdf".
ARM Host    --> eCore(0,0) write speed       =   52.61 MB/s
ARM Host    --> eCore(0,0) read speed        =    8.10 MB/s
ARM Host    --> ERAM write speed             =   86.87 MB/s
ARM Host    <-- ERAM read speed              =  132.12 MB/s
ARM Host    <-> DRAM: Copy speed             =  345.29 MB/s
eCore (0,0) --> eCore(1,0) write speed (DMA) = 1271.70 MB/s
eCore (0,0) <-- eCore(1,0) read speed (DMA)  =  333.72 MB/s
eCore (0,0) --> ERAM write speed (DMA)       =  238.54 MB/s
eCore (0,0) <-- ERAM read speed (DMA)        =  161.10 MB/s

As we design our application, we have to be aware of:

  • data locality – is it faster to chunk data into small chunks (32kb) and rapidly copy or load large blocks into shared memory?
  • data type – data is often relational. For instance, we might need pieces A, B, and D to process block E. Can our data be distributed, and will it fit on one node?
  • overall transfer costs

Taken the benchmark, I would say the Parallella board is the best fit for CPU-bound applications with a small data footprint.


I have been thinking about a real-life problem that doesn’t require very specific domain knowledge, is CPU-intensive, and doesn’t depend on too much data… Scheduler!

Imagine you’re running a Kubernetes cluster with hundreds of tasks/jobs and machines. A scheduler needs to allocate tasks on specific hosts taken multiple factors into account, for example:

  • resource usage (RAM, HDD, CPU time, Network)
  • Location (region, pod)
  • exclusions (a task A can’t be colocated with task B on the same machine)
  • role-awareness (one replica master per pod)
  • fairness (can’t overload or starve machines)
  • availability (too frequent task move lowers the service availability)

There are many ways we can implement the scheduler algorithm (i.e., this is ~standard bin packing problem), but first, we need to decide what our priority is. Do we want the most correct solution or the algorithm should be very responsive (i.e., when a host goes down, find a host replacement quicky)? Or maybe a compromise?
In any case, this is a CPU-bound problem that can be nicely partitioned across many cores.

The example code is here: https://github.com/mkaczanowski/epiphany-scheduler-example


In brief, the application works as follows:

  1. allocate (shared ram) 300 services with 100 tasks and 50 hosts per service.
  2. iterate through each service and dispatch it for "processing" to first ide eCore
  3. in a meanwhile randomly mutate the state to simulate real-life conditions, for example, "a host going down" event
  4. on the eCore side use worst-fit algorithm to assign a task to a host
  5. updated service state write back to shared memory and read it on the host side

As you see, there is nothing extraordinary here… Yet we built a highly parallelizable, low-mem application that processes multiple services independently to provide a better response time, compared to a program running on host ARM CPU with a bunch of pthreads.


At first, we’ll define 3 entities:

  • service – a group of tasks and hosts
  • task – represents a job/task with basic constraints defined (i.e., required RAM)
  • host – represents a machine with basic resources defined (i.e., available RAM)
typedef struct {
    int id;
    int max_ram;
    int curr_ram;
} __attribute__((__packed__)) host_t ;

typedef struct {
    int id;
    int reserved_ram;
    int allocated_host_id;
} __attribute__((__packed__)) task_t;

typedef struct {
    int id;

    host_t host_list[HOSTS_PER_SERVICE];
    task_t task_list[TASKS_PER_SERVICE];
} __attribute__((__packed__)) service_t;

It’s important to use packed compiler directive to avoid padding. Otherwise e_write / e_read call gives you invalid data!

Initialize data

At first we need to allocate a bigger chunk of shared memory to hold 300 serrvices:

const unsigned total_iterations = 10000;
const unsigned total_services = 300;
const unsigned shm_size = sizeof(service_t) * total_services;

rc = e_shm_alloc(&mbuf, shm_name, shm_size);
if (rc != E_OK)
    rc = e_shm_attach(&mbuf, shm_name);

The e_shm_alloc comes from elibs library and is a convenient method for allocating shared memory segments. Once we have it done, we need to initialize that chunk with the actual structures:

void initialize_services(e_mem_t* mbuf, int total_services) {
    for (int i = 0; i < total_services; i++) {
        service_t service = { .id = i };

        for (int j = 0; j < HOSTS_PER_SERVICE; j++ ) { ... }
        for (int j = 0; j < TASKS_PER_SERVICE; j++ ) { ... }

        e_write(mbuf, 0, 0, sizeof(service_t)*i, (void*) &service, sizeof(service_t));

Free eCore?

The question is, how do we let the eCore know which service to pickup for processing? We could use some fancy queue or special shared memory state, but for simplicity, we write service_id to the eCore local memory. The device program loops forever, waiting for service_id != -1 condition to occur.

int find_free_cpu(e_epiphany_t* dev, int rows, int cols) {
    for(int i = 0; i < rows; i++) {
        for (int j = 0; j < cols; j++) {
            int service_id;
            e_read(dev, i, j, 0x7000, &service_id, sizeof(int));
            if (service_id == -1) {
                return (int) ((i << 3) | j);

    return -1;

Here we read service_id local variable placed on eCore address 0x7000:

e_read(dev, i, j, 0x7000, &service_id, sizeof(int));

Because in C it’s easier to return one value instead of two:

return (int) ((i << 3) | j); // shift row to higher part of integer and store col to lower part

Host side

On the host side we need to send services for processing to available eCores:

while(1) {
    int service_to_be_updated = counter % total_services;
    int free_cpu = find_free_cpu(&dev, platform.rows, platform.cols);

    if (free_cpu == -1) {

    int row = (free_cpu >> 3) & 0x7;
    int col = free_cpu & 0x7;

    // simulate 1 host going down
    simulate_hosts_ram_change(&mbuf, service_to_be_updated, 1, 0);

    // dispatch the service
    e_write(&dev, row, col, 0x7000, &service_to_be_updated, sizeof(int));


    if (counter > total_iterations) break;

Device side

On the device side (Epiphany chip) we do the analogous thing and read the scheduled serivce_id:

    while (1) {
        // wait for a new service id to process
        if (*service_id == -1) {

        // read service state
        service_t service;
        e_read((void*) &emem, (void*) &service, (*service_id) * sizeof(service_t), 0, 0, sizeof(service_t));

Once we loaded service to local memory, we can process it with the algorithm:

        // sort hosts by amount of available ram

        // allocate a task to a host (worst fit)
        for(int i = 0; i < TASKS_PER_SERVICE; i++){
            for(int j = 0; j < HOSTS_PER_SERVICE; j++){
                if( service.host_list[j].curr_ram >= service.task_list[i].reserved_ram ){
                    service.host_list[j].curr_ram -= service.task_list[i].reserved_ram;
                    service.task_list[i].allocated_host_id = service.host_list[j].id;

            if(j == HOSTS_PER_SERVICE) {
                service.task_list[i].allocated_host_id = -1;

Finally write back to shared memory:

e_write((void*)&emem, (void*) &service, 0, 0, (*service_id) * sizeof(service_t), sizeof(service_t));


In the end, we can check whether our algorithm works the way we expected. As we see below, tasks are correctly allocated to the hosts, respecting the resource limits but unfortunately doesn’t spread them out nicely. We could improve it in the future:

[root@parallella2 hello-world]# ./run.sh | column -t -s,
0        0               512       0         1024
1        0               512       0         1024
2        1               512       0         1024
3        1               512       0         1024
24        1              512      1024    1024

In this example program, we cheat a lot… For instance, in real life, you would use dynamically allocated memory, but instead, we use statically-sized arrays. This is one of many shortcuts present in the code, so please treat it merely as a gist!

See other posts!