# Parallella (part 4): ISA

Intro

In previous posts, we were introduced to the hardware and made it operational with the kernel module. Soooo… (drum rolls) it’s time to write and run a program on the Epiphany chip!
But how can one do it? What compiler do I use? What "language" does the Epiphany chip understands? Is it C, C++, assembly, binary, or what?

The Epiphany chip on our parallella board comes with 16 high-performance CPU cores. Similarily to power-efficient ARM (Advanced RISC Machine), the eCore has a RISC architecture. The choice comes with no surprise since the RISC architecture typically requires fewer transistors than in CISC architecture (e.g. x86), which improves the power consumption, and that is a top priority for Epiphany design (up to 70 GFLOP/Watt processing efficiency at 28nm).

This means we can write programs similarily to how we do it for ARM CPU. In fact, the SDK comes with prebuilt GCC, so we can write C/C++ without any hassle.

Instruction set

Every processor or processor family has its own instruction set. Instructions are patterns of bits that, by physical design, correspond to different commands to the machine. Systems may also differ in other details, such as memory arrangement, operating systems, or peripheral devices. Because a program normally relies on such factors, different systems will typically not run the same machine code, even when the same type of processor is used (source: wikipedia).

With this being said, we understand that the machine code produced by the compiler will be different on each processor. So, if the machine code is different, the assembly language is going to be different too (the assembly tightly corresponds to the machine level code).

Let’s take a look at the compilation stages:
compilation stages

As you can see, the C compiler produces the intermediate assembly code, which later is compiled into machine code. Following that trail we realize that in order to compile "C" code for a specific CPU we need to:

  • write a C compiler backend that produces valid assembly code (C -> assembly)
  • implement an assembly compiler (assembly -> machine code)

Here is an example of the machine language code:

0001100000001011 000000001110010

since it’s really hard to operate with such long numbers, the assembly represents the machine code to the user with a set of mnemonics, such as:

mov r0,0x7c0

RISC

We have 16 RISC CPUs available, but what does the "RISC" mean?

RISC stands for "reduced instruction set computer", the term "reduced" in that phrase was intended to describe the fact that the amount of work any single instruction accomplishes is reduced at most a single data memory cycle-compared to the "complex instructions" of CISC CPUs that may require dozens of data memory cycles in order to execute a single instruction (source: wikipedia).

In short, a RISC CPU has a small set of simple and general instructions rather than a large set of complex ones.

Instructions

The Epiphany assembler instructions are presented on the chart below. If you have ever written any assembly, you can notice there similarity but also some differences such as "mov" family.
Please note that I shamelessly copied the image from Epiphany-V sheet (not Epiphany-III), so there are some minor changes.

One quite interesting thing you will notice is the split-width instruction encoding method used in Epiphany ISA. This is, all instructions are available as both 16- and 32-bit, with the instruction width depending on the registers used in operation. Any command that uses registers 0 through 7 only and does not have a large immediate constant is encoded as a 16-bit instruction.
Commands that use higher-numbered registers are encoded as 32-bit instructions.

For example a move command with small immediate constant (ie. value) is translated to 32-bit instruction:

mov r0,0x7c0 -> 0001100000001011 000000001110010 (<- 32 bits)

where a small immediate is translated to 16-bit instruction:

mov r0,0xff -> 0001111111100011 (16 <- bits)

Okayyy… Where those bits come from? What registers?

eCore

Epiphany ISA is a load-store architecture that divides instructions into two categories:

  • memory access (load and store between memory and registers),
  • ALU operations (which only occur between registers)

For instance, to sum two integers placed in memory (RAM), the following needs to happen:

  1. load value_1 and value_2 from memory to two registers, i.e. r0, r1
  2. add values of two registers with appropriate instruction
  3. likely store the value back to memory (i.e., place it on the stack)

On the hardware view, the CPU needs a few registers and already mentioned ALU unit. In fact, the eCore looks like this:

The processor includes:

  • a general-purpose program sequencer that supports all standard program flows (i.e., loops, functions, interrupts)
  • large general-purpose register file – 64-word size registers serves as a temporary power-efficient storage place instead of memory
  • integer ALU (IALU) – performs a single 32-bit integer operation per clock cycle
  • floating-point unit (FPU) – complies with the single-precision floating-point IEEE754 standard and executes one floating-point instruction per clock cycle
  • debug unit – provide multicore debug capabilities such as single stepping, breakpoints, halt and resume
  • interrupt controller – supports up to 10 interrupts and exceptions, with full support for nested interrupts and interrupt masking

Assembly to machine code

The assembly compiler translates "lowest-human-readable" assembly syntax into the machine code. But how does that happen?

The Epiphany architecture uses a split-width instruction encoding method; the maximum instruction size is 32-bit and the lowest 16-bit. The general type of instruction is given by the op (operation) field, the lowest bits. The other bits are used for remaining parameters such as "RD" (destination register) or "IMM\<SIZE>" (immediate). Let’s take a look at "MOV" operation:

Description:
    The MOV immediate instruction copies an unsigned
    immediate constant in the destination register (RD).

Syntax:  MOV <RD>, #<IMM8>; MOV <RD>, #<IMM16>;

<RD> Destination register for the move operation.
<IMM8> An 8-Bit unsigned immediate value.
<IMM16> A 16-Bit unsigned immediate value.

To move immediate value "25" to register "R0" we execute:

main:
    MOV R0,#25

The assembly uses mnemonic codes to refer to machine code instructions. Usually, the code instructions are provided by the vendor in the form of a decoding table, such as this:

It seems we have everything we need to compile assembly to machine code:

# Step 1: Because "25" fits into 8-bit boundary, we'll select the 16-bit move instruction
25 (decimal) = 11001 (binary)
output = xxxxxxxxxxxxxxxx (16-bit binary)

# Step 2: Write opcode to the output (00011 -> see the decoding table)
output = xxxxxxxxxxx00011

# Step 3: Write the immediate value (I7 -> I0)
output = xxx 00011001 00011

# Step 4: Write the destination register r0
output = 000 000 11001 00011 = 0x323

Hopefully that wasn’t too difficult, so lets try the same but for a 16-bit number: 1984:

# Step 1: Because "1984" fits into 16-bit boundry, we'll select the 32-bit move instruction
1984 (decimal) = 11111000000 (binary)
output = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

# Step 2: Write opcode
output = xxxxxxxxxxxxxxxx xxxxxxxxxxx01011

# Step 3: Write the immediate value (I0 -> I15)
I0 -> I7 = 11000000
output = xxxxxxxxxxxxxxxx xxx11000000 01011

# Step 4: Write the destination register
output = xxxxxxxxxxxxxxxx 00011000000 01011 // 0x180B

# Step 5: Write I8 -> I10
output = 00000000111x xxx 00011000000 01011 // 0x7 0x180b

Let’s validate if your calculations are correct:

e-as test.S -o test.out
e-objdump -D test.out

00000000 <main>:
   0:   0323            mov r0,0x19
   2:   180b 0072       mov r0,0x7c0

Okayyy… mov r0,0x25 looks fine but the first part of the 0x7c0 (0x7 != 0x72). It looks like the compiler adds the 0010 for the empty bits, evaluating again:

output = 00000000111x xxx 00011000000 01011 // 0x7 0x180b
output = 000000001110010 00011000000 01011 // 0x72 0x180b

The 0010 is used in "MOV\<COND> (32)", so this must be a "compiler trick" / "optimization", anyhow those 4 bits are ignored.

One interesting fact about Epiphany ISA is the movt instruction that copies the upper 16-bits of immediate into the register. Why not simply use the mov instruction? Well, as you see in the example above, the maximum instruction size is 32-bit, so there is no place how to fit the 32-bits immediate. According to the Epiphany reference, mov and movt can both be executed within one clock cycle, not bad!

Binutils

I bet you understand the assembly compiling process now. But where does the compiler comes from? Do we need to write it on our own?

Fortunately, we don’t have to do it! Instead, we can use binutils with cgen:

  • CGEN is a framework for developing generators of CPU-related tools such as assemblers, disassemblers, and simulators. It specifies a description language for describing the architecture and organization of a CPU without reference to any particular application. Additional applications can be written within the framework. CGEN is written in Scheme and can be run under the GNU Guile interpreter (source)
  • binutils are a set of programming tools for creating and managing binary programs, object files, libraries, profile data, and assembly source code. The package usually includes tools such as: nm, as, objdump, ld, etc.

Where cgen and binutils cross paths is the assembly compiler (GAS – "GNU Assembly"), as shown below.

source: https://www.youtube.com/watch?v=ciR3Fw6U85o

CGEN & SCHEME

We won’t dive too deep into the SCHEME language, but let’s try to decode the mov (32) instruction defined in epiphany.cpu:

(dni_wrapper mov16
         "mov imm16"
         ()
         "mov.l $rd6,$imm16"
         (+ OP4_IMM32 (f-opc-4-1 #x0) (f-opc-19-4 #x2) (f-dc-28-1 #x0) rd6 imm16)
         (set rd6 (zext SI imm16))
         ()
         )

huh, it looks familiar, isn’t it? It’s almost similiar to the decoding table we used to manually transale assembly to machine code. The longest line stands out, let’s take a closer look:

# definition
(+ OP4_IMM32 (f-opc-4-1 #x0) (f-opc-19-4 #x2) (f-dc-28-1 #x0) rd6 imm16)

# opcode enum
OP4_IMM32 = 1011

# define the fields of the instruction.
;;   name            description              ATTR  MSB LEN
(dnf f-opc-4-1    "secondary opcode"           ()     4 1)
(dnf f-opc-19-4   "additional opcode bits"     ()    19 4)
(dnf f-dc-28-1    "DC"                 (RESERVED)    28 1)

# destination register
(dnop rd6 "destination register" () h-registers f-rd6)

Even if you’re not familiar with the language (as I am), the above snippet resembles the defined format. For instance, the opcode used for mov is 1011 or the imm16 that stands for 16-bit immediate value.

Using binutils

The binutils package already has epiphany arch support upstreamed, so we sure can build it from the main repo (or use AUR package instead):

pkgver=2.33.1
target=epiphany-elf

wget ftp://ftp.gnu.org/gnu/binutils/binutils-$pkgver.tar.bz2
tar -xjf binutils-$pkgver.tar.bz2
cd binutils-$pkgver

./configure
    --target=$_target \
    --prefix=/usr \
    --enable-multilib \
    --enable-interwork \
    --with-gnu-as \
    --with-gnu-ld \
    --disable-nls \
    --disable-werror

make configure-host
make

GCC

Compiling C is the next thing in the stack, but it’s far more complicated than just assembly. If you’re interested in writing a new backend, I would recommend the tutorial by Krister Walfridsson.

The above graph illustrates the high-level view on GCC compilation process. While it’s non-trivial to understand all its parts, we are not going to discuss it here. It’s important, however, to understand that we need to write a backend to enable a new architecture. For instance, the compiler needs to understand how many registers are available or how to perform some optimizations (i.e., hardware loops), etc.

Thanks to Adapteva & Embecosm cooperation the GCC backend is written by professionals in the field and upstreamed for everyone’s use. To build it, you additionally need libc (newlib):

_snapshot=9-20200111
_newlibver=3.1.0
_target=epiphany-elf

wget ftp://sourceware.org/pub/newlib/newlib-$_newlibver.tar.gz
wget ftp://gcc.gnu.org/pub/gcc/snapshots/$_snapshot/gcc-$_snapshot.tar.xz
...

../configure --target=$_target \
            --prefix=/usr \
            --libexecdir=/usr/lib \
            --with-pkgversion='Arch Repository' \
            --with-bugurl='https://bugs.archlinux.org/' \
            --enable-multilib \
            --enable-interwork \
            --enable-languages=c,c++ \
            --with-newlib \
            --with-gnu-as \
            --with-gnu-ld \
            --disable-nls \
            --disable-libcc1 \
            --with-headers=newlib/libc/include \
            --disable-werror

make

LLVM

Alternatively, there is LLVM backend written by the community, but it doesn’t have feature parity with GCC.

LIBC

Huh, did I mention that libc needs to be ported to the given platform? Yes, it’s not particulariy difficult but requires some time to wrap your head around it.

LD

Epiphany GCC Linker comes with the linker script. The purpose of the linker script is to describe how the sections in the input files should be mapped into the output file and to control the memory layout of the output file.
In the Kernel section, we discussed different memory regions available, such as internal memory (32Kb SRAM) and external memory (32 MB DRAM). With linker scripts, we can define where to place .text,.dataor stack.

This is how you pass the linker script:

e-gcc -g -O2 -T /opt/adapteva/esdk/bsps/current/{internal, legacy, fast}.ldf src.c -o out.elf -le-lib

Summary

You should treat this article as a mere introduction to the compilers world. We took a ride through machine code and assembly, but we haven’t touched the GCC too much because the length of such an article would be much longer than this one. I’d encourage you to take a look into GCC sources, to learn more about compilers optimizations and how these relate to a given architecture.

I am quite impressed by the work done by Adapteva & Embescom team. It’s just a handful of people who did all this work in quite a short period of time (afaik). Taken the exhaustive documentation that comes with it, I can only say, "Thank you!".

See other posts!