Intro
Power consumption undoubtedly is one of the most important factors we look at today. Whether it’s your smartphone, tablet, laptop, or machine in the datacenter, the goal is the same. You want to get the best possible performance for the lowest power cost.
Today’s datacenters hold tens of thousands of machines in a building complex, why not more?
The problem is not with the physical space, nor with the costs of machines but with the heat and power! Things to consider:
- local power station capabilities x2 (redundancy is quite important)
- environmental factors (green energy)
- equipment efficiency (PDUs, UPS, AC/DC conversion costs, etc.)
- backup capabilities (ie. diesel gensets)
- cooling (fans, water, air tunnels)
With all that in mind it’s clear why the power per watt race accelerated in past decades. But, hey! How does it all relate to Parallella board?
Well, Epiphany chip is/was yet another product that joined the race. Its strong selling point was low power consumption. This is maximum 2 Watts for 16-core chip (up to 1Ghz per core), with an overall score of 35 Gflops per Watt. The next chip versions improved to 70Gflops but unfortunately weren’t released to the public.
In this blog post I’ll try to evaluate the parallella performance with a special take on power consumption.
What’s the plan?
Overall, our goal is to investigate the power usage under real-world application workload. In the end, we would like to know if:
- the Epiphany power to Watt ratio checks out
- the Epiphany chip many-core design improves performance and by how much
- the Epiphany chip still stands in competition with other architectures (ie. CPU, GPU)
While I was at the university I published a note on bilinear interpolation with CUDA where I used the algorithm to scale up images. Since the algorithm is not too complex, easily parallelizable, and commonly used in computer vision I thought it would be a good pick to experiment on CPU, GPU, and many-core architectures. Simply stated, we’ll scale up images in three different ways, measure it up, and draw conclusions at the end.
In case you’re not familiar with bilinear interpolation checkout out this video:
Implementation(s)
The high-level flow:
- initialize the device (ie. epiphany chip)
- open a video file with
opencv
and pull out one frame - convert frame to
SDL_Surface
(not strictly required but I took advantage of old implementation) - copy frame to remote memory (shared mem – epiphany, device mem – CUDA)
- allocate a buffer for scaled image (shared mem for both epiphany and CUDA)
- run bilinear interpolation on the device (CPU, GPU, Epiphany)
- save the new image to the disk
In the repository you’ll find 3 implementations:
- CPU – the interpolation is executed on the CPU (single thread). No additional copies to remote memory are required.
- CPU Threaded – workload split into available CPU threads
- GPU (CUDA) – the interpolation is executed on GPU. Remote memory holds original and new image (grid size: newImageWidth x newImageHeight)
- Epiphany – the original image is split into 16 pieces where each one is executed on chip core. Remote memory holds the original and new image.
It’s unlikely that you own Epiphany board, but for sure you have a device with a CPU. Here is how to run the code on all devices:
# install deps
pacman -S sdl2 sdl2_image opencv libpng cmake
# clone repo & build
git clone https://github.com/mkaczanowski/interpolation
cd interpolation/cpu
mkdir build && cd build
cmake ../
make
# select 105th frame
./interpolation 105
file output.png
The output.png
file is a scaled up version (width: 1100px
) of the original 105th frame (width: 640px
).
Developers experience
The one place where CUDA stands out compared to Epiphany is the ease of use. To me it’s understandable since Nvidia has hundreds of engineers working on the platform and Adapteva just a few. However, in the spirit of not taking sides I feel obliged to point that out.
CUDA kernel function invocation:
cudaTransform<<>>(
-1,
-1,
*newPixels,
....
);
Passing arguments to Epiphany is a bit manual and it’s a bit hard to debug:
e_write(&dev, row, col, 0x7000, &((*newImage)->pitch), sizeof(uint16_t)); // pitchOutput
e_write(&dev, row, col, 0x7004, &(image->pitch), sizeof(uint16_t)); // pitchInput
....
// unlock work on the core
e_write(&dev, row, col, 0x7032, &(cores[row][col]), sizeof(uint8_t));
Hardware
Before we jump to the "benchmark" let me introduce you to the hardware:
- GPU: jetson nano
- GPU: 128-core Maxwell
- CPU: Quad-core ARM A57 @ 1.43 GHz
- RAM: 4 GB 64-bit LPDDR4 25.6 GB/s
- Epiphany: parallella
- EPIPHANY: 16-Core chip, up to 1GHz (~700Mhz)
- CPU: Zynq-Z7010 Dual-core ARM A9 CPU, up to 1 GHz
- RAM: 1GB
- CPU: odroid xu4
- CPU: Samsung Exynos5422 Cortex™-A15 2Ghz and Cortex™-A7 Octa
core CPUs - RAM: 2 GB
- CPU: Samsung Exynos5422 Cortex™-A15 2Ghz and Cortex™-A7 Octa
Measurements
Comparing apples to oranges… some might say. I’d agree that there is some truth to this because we are looking at three different architectures (ie. data bus with) that are widely distanced in time (ie. parallella is ~6 years older than jetson nano).
But in the high-level perspective the data should highlight the strong and weak spots of all three devices, as we compare them against one target application.
Which one is the fastest, the "easiest" or the most power-efficient implementation? Let’s find out!
How is the experiment performed?
The program (relevant to given platform) is being run 5 times in the row, once for each frame:
time for i in {100..105}; do ./interpolation $i; done
We measure two things:
- function duration – which includes memory transfer costs but excludes the device initialization time, image loading and saving from/to the HDD
- kernel duration – the time taken by the kernel function (epiphany, cpu, gpu) to run bilinear interpolation
How is power usage measured?
Unfortunately, no fancy equipment used here… Simply, probing TP-Link HS 110 smart plug API every second.
Test 1: CPU
Run the CPU implementation on parallella, odroid and jetson nano.
Test 2: Epiphany vs GPU
Run Epiphany and GPU implementation on either parallella or jetson nano.
Test 3: CPU vs GPU duration time vs image width
Run CPU and GPU implementation on jetson nano, but increase the image size with every sample.
Summary
The CPU implementation draws us a baseline to which I think we can compare to in further analysis. For what’s it worth, the parallella run ZYNQ CPU is… too slow. The second place goes to odroid whose speed is alright but the power usage jumps high to 8 Watts. This is 4 Watts higher than usage at idle state. Therefore the first place on the podium goes to jetson nano where the performance to power ratio is the best so far.
Now’s the time for the crown guest of the show… Epiphany vs GPU!
Speed-wise jetson is faster by ~180ms (~40% of the total time taken by jetson) but the power usage is twice as high (Epiphany ~1.5 Watt, jetson ~3.5 Watt).
The third test proves (to some degree) the image processing acceleration makes sense for heavier calculations when the cost of memory transfers is superseded by the computation speed gains. Also note that the results will slightly change with threaded CPU implementation.
At last, I am amazed by Epiphany again. Based on the results I must admit the chip still stands strong in the race. It offers good performance for indeed low power usage even after what, ~6 years. In that time the slow external memory access could be improved as many other things, like user experience.
UPDATE:
Now, graphs also include the CPU threaded implemenation. Speed-wise it beats the other implementations but power usage surges up to 11 Watts. In terms of power preservation the data doesn’t look too good, so I am silently ignoring that implementation in the above summary.
See other posts!
- # Parallella (part 1): Case study
- # Parallella (part 10): Power efficiency
- # Parallella (part 11): malloc
- # Parallella (part 12): Tensorflow?
- # Parallella (part 13): Closing notes
- # Parallella (part 2): Hardware
- # Parallella (part 3): Kernel
- # Parallella (part 4): ISA
- # Parallella (part 5): elibs
- # Parallella (part 6): FreeRTOS