# Parallella (part 5): elibs

Intro

In previous articles, such as Kernel or ISA, we saw how complicated interacting with Epiphany chip could be. A whole bunch of registers, memory banks, and funky addressing scheme. It’s a lot to know, for someone who just wants to run a "hello world" program… If only we had some C library that would hide all this complex logic from us?!

Thankfully Epiphany SDK provides the e-lib shim C library that simplifies the development workflow by order of magnitude. What does it do? Among many things it provides:

  • hardware abstraction functions (i.e. program loader, data transfer, system control)
  • register and interrupt access functions
  • timer access functions
  • DMA management
  • locking primitives

Stable C API is a commonly used pattern in the embedded world. Why C? While there are many reasons, the most notable one is code reusability via foreign function interface (FFI). Often higher-level languages, ie. Python leverages FFI to reference C/C++ libraries, rather than (re)writing the logic again and again.

In this article, we’ll touch base with the e-lib library, selectively looking into its functions.

Loader

One might ask, how do we execute a program on the Epiphany chip? The compiler spits out the executable file, and then what? We need to somehow configure an eCore to execute the compiled code, but how?
Previously we looked into the compiling process, but we have never taken a dive into the binary. In fact, understanding ELF binary format gives us a lot of clues on what the loading process is, so buckle up, here comes ELF!

ELF

"Executable and Linkable Format" is a common standard file format for executable files, object code, shared libraries, and core dumps. Unlike many proprietary executable file formats, ELF is very flexible and extensible, and it is not bound to any particular processor or Instruction set architecture. This has allowed it to be adopted by many different operating systems on many different platforms.

In short, when OS tries to load a program into memory, it needs to understand where to place different sections, such as program code, constants, functions, etc. ELF defines the way of organizing the program data so that the OS can load it into memory, find the first function (usually main) and execute it.

If you followed the series, you surely remember the linker scripts that came in three flavors: legacy, fast, and internal. Each flavor defines where the program code should be placed i.e., internal (SRAM) or external (DRAM) memory. But where does the linker store that information? Yup, it must be stored as some kind of metadata along with the actual program…

File layout

Each ELF file is made up of one ELF header, followed by file data. The data can include:

  • Program header table
  • Section header table
  • Data referred to by entries in the program header table or section header table

Epiphany ELF

We are going to use eprime example program to lookup on it’s ELF header and section table.

$ e-gcc -g -O2 -T internal.ldf src/isprime.c src/e_prime.c -o e_prime.elf -le-lib -lm
$ e-readelf -a e_prime.elf

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] ivt_reset         PROGBITS        00000000 008000 000004 00  AX  0   0  1
  [ 2] workgroup_cfg     PROGBITS        00000028 008028 000028 00  WA  0   0  8
  [ 3] ext_mem_cfg       PROGBITS        00000050 008050 000008 00  WA  0   0  8
  [ 4] loader_cfg        PROGBITS        00000058 008058 000010 00  WA  0   0  8
  [ 5] .reserved_crt0    PROGBITS        00000100 008100 00000c 00  AX  0   0  4
  [ 6] NEW_LIB_RO        PROGBITS        0000010c 00810c 0005a0 00  AX  0   0  4
  [ 7] NEW_LIB_WR        PROGBITS        000006b0 0086b0 000458 00  WA  0   0  8
  [ 8] GNU_C_BUILTIN_LIB PROGBITS        00000b08 008b08 0011b4 00  AX  0   0  8
  [ 9] .init             PROGBITS        00001cbc 009cbc 000024 00  AX  0   0  2
  [10] .text             PROGBITS        00001ce0 009ce0 0005bc 00  AX  0   0  8
  [15] .data             PROGBITS        000022cc 00a2cc 000018 00  WA  0   0  4
  ...

Symbol table '.symtab' contains 150 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
     0: 00000000     0 NOTYPE  LOCAL  DEFAULT  UND 
   136: 00007ff0     0 NOTYPE  GLOBAL DEFAULT   28 __stack

Okayyy, we see some unexpected sections: ivt_reset, workgroup_cfg, ext_mem_cfg, loader_cfg; where those comes from?

IVT_RESET

The first 40 bytes of local memory (0x0 - 0x28) is reserved for the IVT (Interrupt Vector Table), so that space needs to remain reserved and untouched by our program. If you don’t know what interrupts are, don’t worry, we cover it in the next article. For now, it’s okay to remember how the IVT is defined and where it’s placed in the eCore local memory:

This is the relevant part of the linker script:

ivt_reset               0x00 : {*.o(IVT_RESET)}                    > IVT_RAM
ivt_software_exception  0x04 : {*.o(ivt_entry_software_exception)} > IVT_RAM
...
ivt_user                0x24 : {*.o(ivt_entry_user)}               > IVT_RAM

WORKGROUP_CFG

The workgroup is defined in terms of the coordintaes relative to the platform’s effective chip area and can be as amall as a single core or as large as the whole available effective chip. This way we can load multiple programs on different workgroups, so that the chip area becomes a program multi-tenant environment.

But how does your program know to which workgroup it belongs? As you can imagine, this information must be somehow passed over from the host side (ARM) to the device (Epiphany) during the load time. In fact, the loading program will fill that section of memory with the struct:

typedef struct {
    e_objtype_t  objtype;           // 0x28
    e_chiptype_t     chiptype;          // 0x2c
    e_coreid_t   group_id;          // 0x30
    unsigned     group_row;         // 0x34
    unsigned     group_col;         // 0x38
    unsigned     group_rows;            // 0x3c
    unsigned     group_cols;            // 0x40
    unsigned     core_row;          // 0x44
    unsigned     core_col;          // 0x48
    unsigned     alignment_padding;         // 0x4c
} e_group_config_t;

On the library side we read the structure via:

#define SECTION(x)  __attribute__ ((section (x)))
e_group_config_t const e_group_config SECTION("workgroup_cfg");

The workgroup is essential structure for a few reasons:

  • provides locality information, such as "who am I", "where I am" or "what are my closest neighbours"
  • provides addressing reference
  • can be used for synchronisation methods ie. barrier

EXT_MEM_CFG

Depending on the linker script we place the program either in external or internal memory. We would need to allocate a piece of memory per core and pass it over to the program for all sort of read/write operations. Thus the struct:

typedef struct {
    e_objtype_t objtype;            // 0x50
    unsigned    base;               // 0x54
} e_emem_config_t;

Quiz time! What is the e_emem_config.base value assuming we use legacy.ldf? No idea? Let’s use e-readelf again:

  [ 9] .text             PROGBITS        8e001b50 011b50 0004dc 00  AX  0   0  8

0x8e001b50 looks familiar, somehow it fits into 0x8e000000 - 0x8fffffff shared memory range we defined while writting the kernel module. Everything works as expected, the code is placed on the shared DRAM, but let’s take a look at counter example: fast.ldf:

  [10] .text             PROGBITS        00001c60 009c60 0004dc 00  AX  0   0  8

Right, now we see the program stored in local memory at address 00001c60

LOADER_CFG

Nothing to see here, it’s just a structure to pass some loader specific flags:

struct loader_cfg {
    uint32_t flags;
    uint32_t __pad1;
    uint32_t args_ptr;
    uint32_t __pad2;
} __attribute__((packed));

Stack

In the symbol table we find the stack entry:

   136: 00007ff0     0 NOTYPE  GLOBAL DEFAULT   28 __stack

Since the linker decides where the stack is placed, it’s worthwhile to understand where this address comes from.
In particular, we have 32kb of local memory per eCore, we know that usually stack grows downwards, concluding the stack must be placed at the top of local address space:

0x8000 (32 kb) - 0x10 (16) = 0x7fff0

However we need to remember that lower parts of the local memory are reserved by IVT, program code and extra sections, so the stack is effectively smaller than 32kb and we need to watchout to not overstep on it.

Loading …

Finally! We can leave the theory behind and focus on loader implementation. To load test program ie e_prime.elf onto eCore we need to (without sanity checks etc):

  1. read the file and map it to specified memory location
  2. reset eCore registers
  3. reset eCore local memory
  4. parse the ELF header to setup the *_cfg sections
  5. copy file contents into memory location specified in ELF header
  6. run the program

At first we need to load e_prime.elf into memory, so that we can inspect the ELF headers:

fd = open(executable, O_RDONLY);
fstat(fd, &st);
file = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);

The next step would be to reset eCore registers, but first we need to halt the eCore by writing to DEBUGCMD register:

cmd = 0x1; // 0x0 is used to resume the CPU
e_write(dev, row, col, E_REG_DEBUGCMD, &cmd, sizeof(int));

Once we checked the status and the eCore is for sure halted we start resetting various registers:

// all the DMA0* and DMA1* registers
ee_write_reg(dev, row, col, E_REG_<REGISTER>, 0);

// reset timers
ee_write_reg(dev, row, col, E_REG_CONFIG, 0);

// clear interrupt related registers
ee_write_reg(dev, row, col, E_REG_ILATCL, ~0); // the ILAT register records all interrupt events, the ILATCL register will clear the it
ee_write_reg(dev, row, col, E_REG_IMASK, 0);
ee_write_reg(dev, row, col, E_REG_IRET, 0x2c); // clear_ipend (see below)
ee_write_reg(dev, row, col, E_REG_PC, 0x2c); // clear_ipend

And now… magic:

uint8_t soft_reset_payload[] = {
    0xe8, 0x16, 0x00, 0x00, 0xe8, 0x14, 0x00, 0x00, 0xe8, 0x12, 0x00, 0x00,
    0xe8, 0x10, 0x00, 0x00, 0xe8, 0x0e, 0x00, 0x00, 0xe8, 0x0c, 0x00, 0x00,
    0xe8, 0x0a, 0x00, 0x00, 0xe8, 0x08, 0x00, 0x00, 0xe8, 0x06, 0x00, 0x00,
    0xe8, 0x04, 0x00, 0x00, 0xe8, 0x02, 0x00, 0x00, 0x1f, 0x15, 0x02, 0x04,
    0x7a, 0x00, 0x00, 0x03, 0xd2, 0x01, 0xe0, 0xfb, 0x92, 0x01, 0xb2, 0x01,
    0xe0, 0xfe
};

ee_write_buf(dev, row, col, 0, soft_reset_payload, sizeof(soft_reset_payload));

One thing to notice about that snippet: it writes 62 bytes to the 0x0 address. As we learned before IVT entries occupy the first 40 bytes, so the above must "initialize" that space.
The above is representation of assembly code:

ivt:
    0:              b.l     clear_ipend
    4:              b.l     clear_ipend
    8:              b.l     clear_ipend
    c:              b.l     clear_ipend
    10:              b.l     clear_ipend
    14:              b.l     clear_ipend
    18:              b.l     clear_ipend
    1c:              b.l     clear_ipend
    20:              b.l     clear_ipend
    24:              b.l     clear_ipend
    28:              b.l     clear_ipend
clear_ipend:
    2c:              movfs   r0, ipend
    30:              orr     r0, r0, r0
    32:              beq     1f
    34:              rti
    36:              b       clear_ipend
1:
    38:              gie
    3a:              idle
    3c:              b       1b

The IPEND is a status register that keeps track of the interrupt service routines currently being processed. The above code sets the clear_ipend as default interrupt handler, where the clear_ipend routine simply resets the ipend register with a simple loop. In the end we see the interrupts are enabled again with gie and the eCore is put into idle state.

What’s left is to reset the remaining registers:

for (i = E_REG_R0; i <= E_REG_R63; i += 4)
    ee_write_reg(dev, row, col, i, 0);
...

Okay, we are done with clearing registers, so let’s move to the next step, "clear eCore local memory." Unsurprisingly that part is quite simple:

empty = alloca(sram_size);
memset(empty, 0, sram_size);

for (i = row; i < row + rows; i++)
    for (j = col; j < col + cols; j++)
        e_write(dev, i, j, 0, empty, sram_size);

At this point, eCore is all good and ready for action, but still, it doesn’t have any program to run… To "copy a program" we need to read the ELF program headers and copy specified memory ranges:

Elf32_Ehdr *ehdr;
Elf32_Phdr *phdr;
int        ihdr;
uintptr_t  dst;

uint8_t   *src = (uint8_t *) file;

ehdr = (Elf32_Ehdr *) &src[0];
phdr = (Elf32_Phdr *) &src[ehdr->e_phoff];

for (ihdr = 0; ihdr < ehdr->e_phnum; ihdr++) {
    // core_mem_base is a pointer to local memory "mapped" via mmap call (this assumes internal.ldf)
    dst = ((uintptr_t) *core_mem_base) + phdr[ihdr].p_vaddr;
    memcpy((void *) dst, &src[phdr[ihdr].p_offset], phdr[ihdr].p_filesz);
}

Since the program operates on virtual memory, the dst is going to be a virtual address that is mapped to a physical address via kernel module and mmap syscall (see the previous article). The data we copy is determined by ELF header, and it’s much easier to visualize once you look at the example:

$ e-readelf -a e_prime.elf

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  LOAD           0x008000 0x00000000 0x00000000 0x00004 0x00004 R E 0x8000
  LOAD           0x008028 0x00000028 0x00000028 0x00040 0x00040 RW  0x8000
  LOAD           0x008100 0x00000100 0x00000100 0x02208 0x02210 RWE 0x8000

For the above binary the memcpy could look like (assuming the base is 0xb6fad000):

  1. dst = 0xb6fad000; src = 0x00008000; size = 0x00000004
  2. dst = 0xb6fad028; src = 0x00008028; size = 0x00000040
  3. dst = 0xb6fad100; src = 0x00008100; size = 0x00002208

Now, my dear reader, we can finally execute our program!!! How might you ask? Simply trigger SYNC interrupt:

ee_write_reg(dev, row, col, E_REG_ILATST, SYNC)

e-lib functions

The goal of e-libs library is to abstract low-level bits from the user and provide a general workflow, such as:

  • open & initialize the device
  • configure workgroup
  • run program
  • handle memory management

If you got this far, you likely noticed that dealing with memory is a bit complicated and feels like a waste of time if you have to do it on your own. We even used shorthand memory access functions (ie. e_write, e_read) throughout the Loader section to make things simple(r). So I hope you agree with me that memory management is the most crucial part of the whole library.

Speaking of other memory access functions we should highlight the DMA (direct memory access) functions that make the interaction with DMA controller a whole lot easier:

  • e_dma_start()
  • e_dma_copy()
  • e_dma_wait()
  • e_dma_busy()
  • e_dma_set_desc()

Without those abstractions, we would have to deal with interrupts, registers, memory maps manually… sucks! Just to let you know, there are a few other function families available in SDK, such as:

  • Interrupt Service Functions
  • Timer Functions
  • Mutex and Barrier Functions
  • Core ID and Workgroup functions

We won’t cover those here, but I would encourage you to do it as homework since we use those functions more often in the next blog posts.

hello world

The last thing I want to show you is the most boring example ever, "hello world." Why? Again? Wut?
As we went through all this stuff, hardware, machine code, assembly, compilers, registers, memory, kernel … you may think that writing applications for Epiphany CPU must be incredibly hard… and you’re right unless you use the e-libs. See it for yourself:

const unsigned ShmSize = 128;
const char ShmName[] = "hello_shm"; 
const unsigned SeqLen = 20;

int main(int argc, char *argv[])
{
    unsigned row, col, coreid, i;
    e_platform_t platform;
    e_epiphany_t dev;
    e_mem_t   mbuf;
    int rc;

    srand(1);

    e_init(NULL);
    e_reset_system();
    e_get_platform_info(&platform);

    rc = e_shm_alloc(&mbuf, ShmName, ShmSize);
    if (rc != E_OK)
        rc = e_shm_attach(&mbuf, ShmName);

    for (i=0; i<SeqLen; i++) {
        char buf[ShmSize];

        // Draw a random core
        row = rand() % platform.rows;
        col = rand() % platform.cols;
        coreid = (row + platform.row) * 64 + col + platform.col;
        printf("%3d: Message from eCore 0x%03x (%2d,%2d): ", i, coreid, row, col);

        e_open(&dev, row, col, 1, 1);
        e_reset_group(&dev);

        e_load("e_hello_world.elf", &dev, 0, 0, E_TRUE)

        // Wait for core program execution to finish
        usleep(10000);

        e_read(&mbuf, 0, 0, 0, buf, ShmSize);

        printf("\"%s\"\n", buf);
        e_close(&dev);
    }

    // Release the allocated buffer
    e_shm_release(ShmName);
    e_finalize();

    return 0;
}

On the Epiphany end we load:

coreid = e_get_coreid();
e_coords_from_coreid(coreid, &my_row, &my_col);

if ( E_OK != e_shm_attach(&emem, ShmName) )
    return EXIT_FAILURE;

snprintf(buf, sizeof(buf), Msg, coreid);

if ( emem.size >= strlen(buf) + 1 ) {
    e_write((void*)&emem, buf, my_row, my_col, NULL, strlen(buf) + 1);
} else {
    return EXIT_FAILURE;
}

I think the program doesn’t need a long explanation. Simply the host program reads from a shared buffer (shmem) that is populated with messages from each eCore.
As you see, the length of the total program is quite compact, easy to write, and read, all thanks to e-lib.

Summary

I hope it’s clear to you why shim libraries such as e-libs are crucial for new architectures/devices such as Epiphany. Without them, the entry-level for the average developer becomes very high rather than moderate. Thus it lowers down the adoption rate by order of magnitude, and that often lacks clients and financial issues. Not to mention compability issues or hundreds libraries doing the same one thing.

I am glad Adapteva invested time into providing a solid SDK! I am not a fortune teller, but I could foresee how things would look like if they hadn’t…

See other posts!