Intro
In previous articles, such as Kernel or ISA, we saw how complicated interacting with Epiphany chip could be. A whole bunch of registers, memory banks, and funky addressing scheme. It’s a lot to know, for someone who just wants to run a "hello world" program… If only we had some C library that would hide all this complex logic from us?!
Thankfully Epiphany SDK provides the e-lib
shim C library that simplifies the development workflow by order of magnitude. What does it do? Among many things it provides:
- hardware abstraction functions (i.e. program loader, data transfer, system control)
- register and interrupt access functions
- timer access functions
- DMA management
- locking primitives
Stable C API is a commonly used pattern in the embedded world. Why C? While there are many reasons, the most notable one is code reusability via foreign function interface (FFI). Often higher-level languages, ie. Python leverages FFI to reference C/C++ libraries, rather than (re)writing the logic again and again.
In this article, we’ll touch base with the e-lib
library, selectively looking into its functions.
Loader
One might ask, how do we execute a program on the Epiphany chip? The compiler spits out the executable file, and then what? We need to somehow configure an eCore to execute the compiled code, but how?
Previously we looked into the compiling process, but we have never taken a dive into the binary. In fact, understanding ELF binary format gives us a lot of clues on what the loading process is, so buckle up, here comes ELF!
ELF
"Executable and Linkable Format" is a common standard file format for executable files, object code, shared libraries, and core dumps. Unlike many proprietary executable file formats, ELF is very flexible and extensible, and it is not bound to any particular processor or Instruction set architecture. This has allowed it to be adopted by many different operating systems on many different platforms.
In short, when OS tries to load a program into memory, it needs to understand where to place different sections, such as program code, constants, functions, etc. ELF defines the way of organizing the program data so that the OS can load it into memory, find the first function (usually main
) and execute it.
If you followed the series, you surely remember the linker scripts that came in three flavors: legacy
, fast
, and internal
. Each flavor defines where the program code should be placed i.e., internal (SRAM) or external (DRAM) memory. But where does the linker store that information? Yup, it must be stored as some kind of metadata along with the actual program…
File layout
Each ELF file is made up of one ELF header, followed by file data. The data can include:
- Program header table
- Section header table
- Data referred to by entries in the program header table or section header table
Epiphany ELF
We are going to use eprime example program to lookup on it’s ELF header and section table.
$ e-gcc -g -O2 -T internal.ldf src/isprime.c src/e_prime.c -o e_prime.elf -le-lib -lm
$ e-readelf -a e_prime.elf
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] ivt_reset PROGBITS 00000000 008000 000004 00 AX 0 0 1
[ 2] workgroup_cfg PROGBITS 00000028 008028 000028 00 WA 0 0 8
[ 3] ext_mem_cfg PROGBITS 00000050 008050 000008 00 WA 0 0 8
[ 4] loader_cfg PROGBITS 00000058 008058 000010 00 WA 0 0 8
[ 5] .reserved_crt0 PROGBITS 00000100 008100 00000c 00 AX 0 0 4
[ 6] NEW_LIB_RO PROGBITS 0000010c 00810c 0005a0 00 AX 0 0 4
[ 7] NEW_LIB_WR PROGBITS 000006b0 0086b0 000458 00 WA 0 0 8
[ 8] GNU_C_BUILTIN_LIB PROGBITS 00000b08 008b08 0011b4 00 AX 0 0 8
[ 9] .init PROGBITS 00001cbc 009cbc 000024 00 AX 0 0 2
[10] .text PROGBITS 00001ce0 009ce0 0005bc 00 AX 0 0 8
[15] .data PROGBITS 000022cc 00a2cc 000018 00 WA 0 0 4
...
Symbol table '.symtab' contains 150 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 00000000 0 NOTYPE LOCAL DEFAULT UND
136: 00007ff0 0 NOTYPE GLOBAL DEFAULT 28 __stack
Okayyy, we see some unexpected sections: ivt_reset
, workgroup_cfg
, ext_mem_cfg
, loader_cfg
; where those comes from?
IVT_RESET
The first 40 bytes of local memory (0x0 - 0x28
) is reserved for the IVT (Interrupt Vector Table), so that space needs to remain reserved and untouched by our program. If you don’t know what interrupts are, don’t worry, we cover it in the next article. For now, it’s okay to remember how the IVT is defined and where it’s placed in the eCore local memory:
This is the relevant part of the linker script:
ivt_reset 0x00 : {*.o(IVT_RESET)} > IVT_RAM
ivt_software_exception 0x04 : {*.o(ivt_entry_software_exception)} > IVT_RAM
...
ivt_user 0x24 : {*.o(ivt_entry_user)} > IVT_RAM
WORKGROUP_CFG
The workgroup is defined in terms of the coordintaes relative to the platform’s effective chip area and can be as amall as a single core or as large as the whole available effective chip. This way we can load multiple programs on different workgroups, so that the chip area becomes a program multi-tenant environment.
But how does your program know to which workgroup it belongs? As you can imagine, this information must be somehow passed over from the host side (ARM) to the device (Epiphany) during the load time. In fact, the loading program will fill that section of memory with the struct:
typedef struct {
e_objtype_t objtype; // 0x28
e_chiptype_t chiptype; // 0x2c
e_coreid_t group_id; // 0x30
unsigned group_row; // 0x34
unsigned group_col; // 0x38
unsigned group_rows; // 0x3c
unsigned group_cols; // 0x40
unsigned core_row; // 0x44
unsigned core_col; // 0x48
unsigned alignment_padding; // 0x4c
} e_group_config_t;
On the library side we read the structure via:
#define SECTION(x) __attribute__ ((section (x)))
e_group_config_t const e_group_config SECTION("workgroup_cfg");
The workgroup is essential structure for a few reasons:
- provides locality information, such as "who am I", "where I am" or "what are my closest neighbours"
- provides addressing reference
- can be used for synchronisation methods ie. barrier
EXT_MEM_CFG
Depending on the linker script we place the program either in external or internal memory. We would need to allocate a piece of memory per core and pass it over to the program for all sort of read/write operations. Thus the struct:
typedef struct {
e_objtype_t objtype; // 0x50
unsigned base; // 0x54
} e_emem_config_t;
Quiz time! What is the e_emem_config.base
value assuming we use legacy.ldf
? No idea? Let’s use e-readelf
again:
[ 9] .text PROGBITS 8e001b50 011b50 0004dc 00 AX 0 0 8
0x8e001b50
looks familiar, somehow it fits into 0x8e000000 - 0x8fffffff
shared memory range we defined while writting the kernel module. Everything works as expected, the code is placed on the shared DRAM, but let’s take a look at counter example: fast.ldf
:
[10] .text PROGBITS 00001c60 009c60 0004dc 00 AX 0 0 8
Right, now we see the program stored in local memory at address 00001c60
LOADER_CFG
Nothing to see here, it’s just a structure to pass some loader specific flags:
struct loader_cfg {
uint32_t flags;
uint32_t __pad1;
uint32_t args_ptr;
uint32_t __pad2;
} __attribute__((packed));
Stack
In the symbol table we find the stack entry:
136: 00007ff0 0 NOTYPE GLOBAL DEFAULT 28 __stack
Since the linker decides where the stack is placed, it’s worthwhile to understand where this address comes from.
In particular, we have 32kb of local memory per eCore, we know that usually stack grows downwards, concluding the stack must be placed at the top of local address space:
0x8000 (32 kb) - 0x10 (16) = 0x7fff0
However we need to remember that lower parts of the local memory are reserved by IVT, program code and extra sections, so the stack is effectively smaller than 32kb and we need to watchout to not overstep on it.
Loading …
Finally! We can leave the theory behind and focus on loader implementation. To load test program ie e_prime.elf
onto eCore we need to (without sanity checks etc):
- read the file and map it to specified memory location
- reset eCore registers
- reset eCore local memory
- parse the ELF header to setup the
*_cfg
sections - copy file contents into memory location specified in ELF header
- run the program
At first we need to load e_prime.elf
into memory, so that we can inspect the ELF headers:
fd = open(executable, O_RDONLY);
fstat(fd, &st);
file = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
The next step would be to reset eCore registers, but first we need to halt the eCore by writing to DEBUGCMD
register:
cmd = 0x1; // 0x0 is used to resume the CPU
e_write(dev, row, col, E_REG_DEBUGCMD, &cmd, sizeof(int));
Once we checked the status and the eCore is for sure halted we start resetting various registers:
// all the DMA0* and DMA1* registers
ee_write_reg(dev, row, col, E_REG_, 0);
// reset timers
ee_write_reg(dev, row, col, E_REG_CONFIG, 0);
// clear interrupt related registers
ee_write_reg(dev, row, col, E_REG_ILATCL, ~0); // the ILAT register records all interrupt events, the ILATCL register will clear the it
ee_write_reg(dev, row, col, E_REG_IMASK, 0);
ee_write_reg(dev, row, col, E_REG_IRET, 0x2c); // clear_ipend (see below)
ee_write_reg(dev, row, col, E_REG_PC, 0x2c); // clear_ipend
And now… magic:
uint8_t soft_reset_payload[] = {
0xe8, 0x16, 0x00, 0x00, 0xe8, 0x14, 0x00, 0x00, 0xe8, 0x12, 0x00, 0x00,
0xe8, 0x10, 0x00, 0x00, 0xe8, 0x0e, 0x00, 0x00, 0xe8, 0x0c, 0x00, 0x00,
0xe8, 0x0a, 0x00, 0x00, 0xe8, 0x08, 0x00, 0x00, 0xe8, 0x06, 0x00, 0x00,
0xe8, 0x04, 0x00, 0x00, 0xe8, 0x02, 0x00, 0x00, 0x1f, 0x15, 0x02, 0x04,
0x7a, 0x00, 0x00, 0x03, 0xd2, 0x01, 0xe0, 0xfb, 0x92, 0x01, 0xb2, 0x01,
0xe0, 0xfe
};
ee_write_buf(dev, row, col, 0, soft_reset_payload, sizeof(soft_reset_payload));
One thing to notice about that snippet: it writes 62 bytes to the 0x0
address. As we learned before IVT entries occupy the first 40 bytes, so the above must "initialize" that space.
The above is representation of assembly code:
ivt:
0: b.l clear_ipend
4: b.l clear_ipend
8: b.l clear_ipend
c: b.l clear_ipend
10: b.l clear_ipend
14: b.l clear_ipend
18: b.l clear_ipend
1c: b.l clear_ipend
20: b.l clear_ipend
24: b.l clear_ipend
28: b.l clear_ipend
clear_ipend:
2c: movfs r0, ipend
30: orr r0, r0, r0
32: beq 1f
34: rti
36: b clear_ipend
1:
38: gie
3a: idle
3c: b 1b
The IPEND is a status register that keeps track of the interrupt service routines currently being processed. The above code sets the clear_ipend
as default interrupt handler, where the clear_ipend
routine simply resets the ipend register with a simple loop. In the end we see the interrupts are enabled again with gie
and the eCore is put into idle state.
What’s left is to reset the remaining registers:
for (i = E_REG_R0; i <= E_REG_R63; i += 4)
ee_write_reg(dev, row, col, i, 0);
...
Okay, we are done with clearing registers, so let's move to the next step, "clear eCore local memory." Unsurprisingly that part is quite simple:
empty = alloca(sram_size);
memset(empty, 0, sram_size);
for (i = row; i < row + rows; i++)
for (j = col; j < col + cols; j++)
e_write(dev, i, j, 0, empty, sram_size);
At this point, eCore is all good and ready for action, but still, it doesn't have any program to run... To "copy a program" we need to read the ELF program headers and copy specified memory ranges:
Elf32_Ehdr *ehdr;
Elf32_Phdr *phdr;
int ihdr;
uintptr_t dst;
uint8_t *src = (uint8_t *) file;
ehdr = (Elf32_Ehdr *) &src[0];
phdr = (Elf32_Phdr *) &src[ehdr->e_phoff];
for (ihdr = 0; ihdr < ehdr->e_phnum; ihdr++) {
// core_mem_base is a pointer to local memory "mapped" via mmap call (this assumes internal.ldf)
dst = ((uintptr_t) *core_mem_base) + phdr[ihdr].p_vaddr;
memcpy((void *) dst, &src[phdr[ihdr].p_offset], phdr[ihdr].p_filesz);
}
Since the program operates on virtual memory, the dst
is going to be a virtual address that is mapped to a physical address via kernel module and mmap syscall (see the previous article). The data we copy is determined by ELF header, and it's much easier to visualize once you look at the example:
$ e-readelf -a e_prime.elf
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x008000 0x00000000 0x00000000 0x00004 0x00004 R E 0x8000
LOAD 0x008028 0x00000028 0x00000028 0x00040 0x00040 RW 0x8000
LOAD 0x008100 0x00000100 0x00000100 0x02208 0x02210 RWE 0x8000
For the above binary the memcpy
could look like (assuming the base is 0xb6fad000
):
- dst =
0xb6fad000
; src =0x00008000
; size =0x00000004
- dst =
0xb6fad028
; src =0x00008028
; size =0x00000040
- dst =
0xb6fad100
; src =0x00008100
; size =0x00002208
Now, my dear reader, we can finally execute our program!!! How might you ask? Simply trigger SYNC
interrupt:
ee_write_reg(dev, row, col, E_REG_ILATST, SYNC)
e-lib functions
The goal of e-libs
library is to abstract low-level bits from the user and provide a general workflow, such as:
- open & initialize the device
- configure workgroup
- run program
- handle memory management
If you got this far, you likely noticed that dealing with memory is a bit complicated and feels like a waste of time if you have to do it on your own. We even used shorthand memory access functions (ie. e_write
, e_read
) throughout the Loader section to make things simple(r). So I hope you agree with me that memory management is the most crucial part of the whole library.
Speaking of other memory access functions we should highlight the DMA (direct memory access) functions that make the interaction with DMA controller a whole lot easier:
e_dma_start()
e_dma_copy()
e_dma_wait()
e_dma_busy()
e_dma_set_desc()
Without those abstractions, we would have to deal with interrupts, registers, memory maps manually... sucks! Just to let you know, there are a few other function families available in SDK, such as:
- Interrupt Service Functions
- Timer Functions
- Mutex and Barrier Functions
- Core ID and Workgroup functions
We won't cover those here, but I would encourage you to do it as homework since we use those functions more often in the next blog posts.
hello world
The last thing I want to show you is the most boring example ever, "hello world." Why? Again? Wut?
As we went through all this stuff, hardware, machine code, assembly, compilers, registers, memory, kernel ... you may think that writing applications for Epiphany CPU must be incredibly hard... and you're right unless you use the e-libs.
See it for yourself:
const unsigned ShmSize = 128;
const char ShmName[] = "hello_shm";
const unsigned SeqLen = 20;
int main(int argc, char *argv[])
{
unsigned row, col, coreid, i;
e_platform_t platform;
e_epiphany_t dev;
e_mem_t mbuf;
int rc;
srand(1);
e_init(NULL);
e_reset_system();
e_get_platform_info(&platform);
rc = e_shm_alloc(&mbuf, ShmName, ShmSize);
if (rc != E_OK)
rc = e_shm_attach(&mbuf, ShmName);
for (i=0; i < SeqLen; i++) {
char buf[ShmSize];
// Draw a random core
row = rand() % platform.rows;
col = rand() % platform.cols;
coreid = (row + platform.row) * 64 + col + platform.col;
printf("%3d: Message from eCore 0x%03x (%2d,%2d): ", i, coreid, row, col);
e_open(&dev, row, col, 1, 1);
e_reset_group(&dev);
e_load("e_hello_world.elf", &dev, 0, 0, E_TRUE)
// Wait for core program execution to finish
usleep(10000);
e_read(&mbuf, 0, 0, 0, buf, ShmSize);
printf("%s", buf);
e_close(&dev);
}
// Release the allocated buffer
e_shm_release(ShmName);
e_finalize();
return 0;
}
On the Epiphany end we load:
coreid = e_get_coreid();
e_coords_from_coreid(coreid, &my_row, &my_col);
if ( E_OK != e_shm_attach(&emem, ShmName) )
return EXIT_FAILURE;
snprintf(buf, sizeof(buf), Msg, coreid);
if ( emem.size >= strlen(buf) + 1 ) {
e_write((void*)&emem, buf, my_row, my_col, NULL, strlen(buf) + 1);
} else {
return EXIT_FAILURE;
}
I think the program doesn't need a long explanation. Simply the host program reads from a shared buffer (shmem) that is populated with messages from each eCore.
As you see, the length of the total program is quite compact, easy to write, and read, all thanks to e-lib.
Summary
I hope it's clear to you why shim libraries such as e-libs
are crucial for new architectures/devices such as Epiphany. Without them, the entry-level for the average developer becomes very high rather than moderate. Thus it lowers down the adoption rate by order of magnitude, and that often lacks clients and financial issues. Not to mention compability issues or hundreds libraries doing the same one thing.
I am glad Adapteva invested time into providing a solid SDK! I am not a fortune teller, but I could foresee how things would look like if they hadn't...
See other posts!
- # Parallella (part 1): Case study
- # Parallella (part 10): Power efficiency
- # Parallella (part 11): malloc
- # Parallella (part 12): Tensorflow?
- # Parallella (part 13): Closing notes
- # Parallella (part 2): Hardware
- # Parallella (part 3): Kernel
- # Parallella (part 4): ISA
- # Parallella (part 5): elibs
- # Parallella (part 6): FreeRTOS