Intro
At this point, we should have a basic understanding of the Parallella board design. However, we don’t know how to reference the board resources from the software… With standard Zynq kernel configuration, we are able to boot the board and use most of the peripherals, but the Epiphany chip stays hidden because there is no driver that would manage it.
In this article, dive into the existing Epiphany kernel module, break it into parts, and investigate its functions. Analyzing a few thousands of lines of code is not that exciting, so we are going to pretend the driver doesn’t exist and write it from scratch. The empirical approach will help us understand the Kernel <-> Epiphany communication better, and that will become crucial for the project we’ll implement at the end of the series.
Kernel
Before we dive into the kernel sources, let’s ask ourselves what we want to achieve? And why do we need to touch Kernel at all?
In essence, we want to provide communication between the OS and the Epiphany chip. From the previous article, we know that each epiphany core (eCore) has a local SRAM memory bank and global memory map.
Now imagine you want to write an application that would write data to local SRAM, and the host application (running on ARM) would read from it. Sounds simple but here are a few things to think about:
- Linux somehow needs to discover the device and perform some basic setup (register values, interrupts, clocks, power, etc.)
- Linux uses virtual memory and paging, Epiphany has flat contiguous memory
- Linux splits memory into Kernel and userspace, so accessing hardware memory can’t be done directly from userspace
Okay, so we need to write a kernel module that would basically allow us to access memory on the device (thinking from userspace perspective). A very naive API could look like this:
device = open_epiphany_device(core_id);
// write op
device->read(start_address, data, data_size);
// read op
device->read(start_address, size);
return data
The above snippet is our goal to implement, but first, we have to expose the device to the OS.
Device Tree
The Device Tree in Linux provides a way to describe non-discoverable hardware. This information was previously hardcoded in source code.
Likely you heard about ACPI or UEFI before as those are more common options on x86. With the rise of embedded devices, a simple(r) and flexible interface was required, mainly because there was no way to produce a generic kernel which will boot on a whole family of devices.
Nowadays, Device Tree is very popular on ARM boards and somewhat is a standard (even though there is active work on ARM UEFI). The device tree is represented in several different formats:
- .dts source file (human-readable)
- .dtsi source include files (human-readable)
- .dtb compiled blob file (binary format)
Building device tree
Usually, SoC vendors provide a device tree that we can base on. In our case we use Zynq-7000 series that has upstreamed configuration, so let’s use it in our parallella.dts
file:
/include/ "zynq-7000.dtsi"
The zynq-7000.dtsi
includes a bunch of definitions, such as VCC regulators, ARM CPU, or AMBA (Advanced Microcontroller Bus Architecture) with peripherals setup.
Parallella board has some additional hardware that is not covered by base configuration, such as:
- eLink device (on FPGA fabric)
- Ethernet port
- LEDs
- USB ports
- power management IC
- MicroSD port
- …
To keep this post compact, I will focus primarily on eLink device driver configuration because it’s tightly coupled with the device driver we are going to write.
Setting up memory
The memory configuration is not present in zynq-7000
since the DRAM is not part of the SoC. Thus we need to configure it ourselves"
- the
compatible
specifies the name of the system. It contains a string in the form<manufacturer>,<model>
- the
/memory
node provides basic information about the address and size of the physical memory -
the
reg
denotes memory region, in this case range from 0 (0x0) to 1G (0x40000000
== 1073741824 bytes)/ { compatible = "adapteva,parallella", "xlnx,zynq-7000"; memory { device_type = "memory"; reg = <0x0 0x40000000>; };
In addition to /memory
node, we would like to reserve a portion of DRAM to be shared between the Epiphany chip and the OS:
reserved-memory
node makes the operating system exclude reserved memory from normal usage#address-cells
and#size-cells
are set to 1 because thereg
values are a single uint32 (reg
format is tuple of<address size>
). For 64bit address, the address-cells would be equal to 2.ranges
is used for address translation, however emptyranges
property means addresses in the child address space are mapped 1:1 onto the parent address spaceno-map
indicates the operating system must not create a virtual mapping of the region as part of its standard mapping of system memory nor permit speculative access to it under any circumstances other than under the control of the device driver using the region.reserved-memory { #address-cells = <1>; #size-cells = <1>; ranges; emem: emem@3e000000 { reg = <0x3e000000 0x2000000>; no-map; }; };
That piece of configuration will tell the kernel to "back off" from the 32Mb starting at
0x3e000000
, that is lower part of kernel space. The Linux kernel (usually) splits memory in 3/1 ratio, where the user space is placed in high (0x00000000 - 0x2fffffff
) and kernel space in low memory (0x30000000 - 0x40000000
). The address0x3e000000
is a result of0x40000000 - 0x2000000
.
You might be curious why we prevent virtual mapping with no-map
?
The kernel uses virtual memory, so the memory block would have to be divided into pages and then accessed by virtual memory address. However eLink (in FGPA) doesn’t understand only physical adresses. This Verilog snippet does the memory translation:
assign ext_mem_access = (elink_dstaddr_tmp[31:28] ==
VIRT_EXT_MEM) & ~(elink_dstaddr_tmp[31:20] ==
AXI_COORD); assign elink_dstaddr_inb[31:28] = ext_mem_access ? `PHYS_EXT_MEM : elink_dstaddr_tmp[31:28]; assign elink_dstaddr_inb[27:0] = elink_dstaddr_tmp[27:0];
In further FPGA bitstream releases, the MMU unit was added to translate between physical and virtual addresses. However, it’s optional to use, due to some bugs that stall the system.
Setting up eLink
The elink0
node is going to be used by our kernel module. In essence, it describes the eLink device on FPGA fabric and sets up some basic things. Let’s see what’s there:
- the
compatible
specifies the name of the system. It contains a string in the form<manufacturer>,<model>
- the
interrupt
unlike address range translation which follows the natural structure of the tree, Interrupt signals can originate from and terminate on any device in a machine. Unlike device addressing, which is naturally expressed in the device tree, interrupt signals are expressed as links between nodes independent of the tree. Those properties are used to describe interrupt connections:interrupt-controller
– a device that receives the interrupt. In this case, we useGeneric Interrupt Controller
(Zynq TRM) that is a centralized resource for managing interrupts sent to the CPUs from the PS and PL (the intc is defined in zynq dtsi file)interrupts
is a property of a device node containing a list of interrupt specifiers, one for each interrupt output signal on the device. Frankly speaking, I haven’t figured out why the interrupt code is 55, but it’s related to the VivadoIRQ_F2P
attribute (Zynq TRM)
- the
clocks
describes the clock bindings. In our bindings,clock
represents a device with four clock inputs, named "fclk1…3". The fclk clocks are connected to outputs from 15 to 18 of the clck device. The fclk clocks are routed into PL - the
memory-region
references the shared DRAM we created before
&elink0 {
compatible = "adapteva,elink";
interrupts = <0 55 1>;
interrupt-parent = <&intc>;
clocks = <&clkc 15>, <&clkc 16>, <&clkc 17>, <&clkc 18>;
clock-names = "fclk0", "fclk1", "fclk2", "fclk3";
memory-region = <&emem>;
Now we need to let the system know which memory region is going to be owned by the driver. We do it via reg
property, where:
- eLink registers are placed at
0x81000000
(allocated size = 1MB) - the mappable eMesh region is placed at
0x80000000
(allocated size: 256 MB). From what I could find the base address is the AXI address and the size is defined as 4 available eLinks 64 cores (Epiphany comes in two flavors 16 or 64 core) 1 MB per core = 256 MB
#address-cells = <2>; /* */
#size-cells = <2>; /* <#chip rows, #chip cols> */
reg = <0x81000000 0x100000>, <0x80000000 0x10000000>; // 1MB, 256MB
At this point we reserved the shared DRAM block but we didn’t pass that information to the driver, let’s do it here:
adapteva,mmu = <0x8e000000 0x3e000000 0x02000000>;
Hold on! Where is 0x8e000000
coming from? Isn’t that supposed to be 0x3e000000
as we declared before?
The physical address of this memory segment is 0x3e000000 - 0x3fffffff
. To overcome some system limitations, this range is aliased to address 0x8e000000 - 0x8fffffff
, as seen from the Epiphany side. For example, when a buffer of 8KB is allocated at offset 64KB on that segment, the host sees this buffer as occupying addresses 0x3e010000 - 0x3e012000
. For accessing the buffer from the Epiphany program, this range is aliased to 0x8e010000 - 0x8e012000
(accroding to SDK Manual)
In the eMesh, we might have a few chips "linked together," so with array.
We’ll register a single 16 core chip-array with north-westmost coreid 0x808
.
array0@808 {
compatible = "adapteva,chip-array";
reg = <0x808 1 1 1>;
};
};
The origin of 808
core ID are nicely visualized down below:
At last, we will set up the voltage regulator that can be used to control the elink
TX speed based on the temperature
vdd-supply = <®ulator_epiphany>;
regulator_epiphany: dcd1 {
regulator-name = "VDD_DSP";
regulator-min-microvolt = <900000>;
regulator-max-microvolt = <1200000>;
};
Kernel module
There are a few ways you can write device drivers in the Linux Kernel… In our case, it makes a lot of sense to use platform device driver framework, since the Epiphany chip is not part of Zynq SoC, but rather it feels to be an autonomous device in the system. The quote from the documentation:
Platform devices are devices that typically appear as autonomous entities in the system. This includes legacy port-based devices and host bridges to peripheral buses, and most controllers integrated into system-on-chip platforms. What they usually have in common is direct addressing from a CPU bus. Rarely, a platform_device will be connected through a segment of some other kind of bus; but its registers will still be directly addressable.
Platform devices are given a name, used in driver binding, and a list of resources such as addresses and IRQs.
Platform drivers follow the standard driver model convention, where discovery/enumeration is handled outside the drivers, and drivers provide probe() and remove() methods. They support power management and shutdown notifications using the standard conventions.
struct platform_driver {
int (*probe)(struct platform_device *);
int (*remove)(struct platform_device *);
void (*shutdown)(struct platform_device *);
int (*suspend)(struct platform_device *, pm_message_t state);
int (*suspend_late)(struct platform_device *, pm_message_t state);
int (*resume_early)(struct platform_device *);
int (*resume)(struct platform_device *);
struct device_driver driver;
};
The above structure is pretty clear, so there is nothing more than to implement probe
and remove
functions for a start.
NOTE: The kernel module is quite large, so for the purposes of this article, I am going to refer only to the important pieces (skip initialization code, etc.).
Building kernel
At first let’s fetch ~latest kernel:
mkdir ~/workspace
curl -L "http://www.kernel.org/pub/linux/kernel/v5.x/linux-5.4.tar.xz" | tar -C ~/workspace -J -xf -
curl -L "http://www.kernel.org/pub/linux/kernel/v5.x/patch-5.4.8.xz" | unxz - > ~/workspace/patch-5.4.8
cd ~/workspace/linux-5.4
git apply --whitespace=nowarn ../patch-5.4.8
Our target is ARM board so we’ll cross-compile the kernel:
ARCH=arm CROSS_COMPILE=/usr/bin/arm-none-eabi- make UIMAGE_LOADADDR=0x8000 uImage modules dtbs
Kernel config
Defconfig is just a configuration file for the kernel build system. There is nothing special about the defconfig, so I would recommend using this one:
curl https://raw.githubusercontent.com/mkaczanowski/pkgbuilds/master/linux-parallella/config > ~/workspace/linux-5.4/.defconfig
In addition to standard hardware setup, there is one extra line we need to add in order to build the Epiphany module:
CONFIG_EPIPHANY=m
Also, we have to extend the Kconfig and Makefile:
# drivers/misc/Kconfig
+config EPIPHANY
+ tristate "Adapteva Epiphany device driver."
+ depends on OF
+ default n
+
+ help
+ The epiphany device driver
# drivers/misc/Makefile
+obj-$(CONFIG_EPIPHANY) += epiphany.o
Basic module structure
Lets write the module structure with some debug messages:
#define DEBUG 1
#include
static int __init epiphany_module_init(void) {
pr_debug("epiphany module init\n");
return 0;
}
module_init(epiphany_module_init);
static void __exit epiphany_module_exit(void) {
pr_debug("epiphany module exit\n");
}
module_exit(epiphany_module_exit);
MODULE_DESCRIPTION("Adapteva Epiphany driver");
MODULE_VERSION("0.1");
MODULE_LICENSE("GPL");
Now compile, ship to the board and load the module:
ARCH=arm CROSS_COMPILE=/usr/bin/arm-none-eabi- make UIMAGE_LOADADDR=0x8000 modules
scp ./drivers/misc/epiphany.ko root@parallella:~/
insmod epiphany.ko && rmmod -f epiphany.ko
We should see debug messages in dmesg
:
[root@parallella ~]# dmesg | grep Epiphany
[ 87.065968] epiphany module init
[ 100.649254] epiphany module exit
Class
Taken from kernel labs:
A class is a high-level view of the Linux Device Model, which abstracts implementation details. For example, there are drivers for SCSI and ATA drivers, but all belong to the class of drives. Classes provide a grouping of devices based on functionality, not how they are connected or how they work. Classes have a correspondent in /sys/classes.
There are two main structures that describe the classes: struct class and struct device. The class structure describes a generic class, while the structure struct device describes a class associated with a device. There are functions for initializing / deinitiating and adding attributes for each of these, include/linux/device.h in include/linux/device.h.
The advantage of using classes is that the udev program in userspace allows the automatic creation of devices in the /dev directory based on class information. For example, we’d like to have /dev/epiphany/elink[0-9]
created automatically based on the class struct.
static struct epiphany {
struct class class;
struct list_head elink_list;
struct list_head chip_array_list;
struct list_head mesh_list;
dev_t devt;
struct idr minor_idr;
spinlock_t minor_idr_lock;
/* For device naming */
atomic_t elink_counter;
atomic_t mesh_counter;
atomic_t array_counter;
} epiphany = {};
The above class should be pretty clear to you, maybe with the exception of the minor_idr
which identifies a specific device within a class. On the other hand, the major ID identifies the general class of the device and is used by the Kernel to look up the appropriate driver for this type of device.
We need to initialize the class now:
static char *epiphany_devnode(struct device *dev, umode_t *mode) {
return kasprintf(GFP_KERNEL, "epiphany/%s", dev_name(dev));
}
static void epiphany_device_release(struct device *dev) {}
static void __init init_epiphany(void) {
epiphany.class.name = "epiphany";
epiphany.class.owner = THIS_MODULE;
epiphany.class.devnode = epiphany_devnode;
epiphany.class.dev_release = epiphany_device_release;
...
and register:
class_register(&epiphany.class);
alloc_chrdev_region(&epiphany.devt, 0, MINORMASK, "epiphany");
Once we load the module we should check if it all worked:
[root@parallella ~]# cat /proc/devices | grep epiphany
241 epiphany
[root@parallella ~]# find /sys/class -name 'epiphany'
/sys/class/epiphany
eLink driver
At this stage, our module is pretty useless; it only registers the class into the system. It would be nice to set up the actual eLink device, and for that, we’ll have to read the device tree!
static const struct of_device_id elink_of_match[] = {
{ .compatible = "adapteva,elink" },
{ }
};
MODULE_DEVICE_TABLE(of, elink_of_match);
static struct platform_driver elink_driver = {
.probe = elink_platform_probe,
.remove = elink_platform_remove,
.driver = {
.name = "elink",
.of_match_table = of_match_ptr(elink_of_match)
}
};
The above snippet will execute elink_platform_probe
when adapteva,elink
is present in the device tree. This way we can easily enable/disable eLink driver by simply modifying the DT, neat isn’t it?
eLink device
Simply stated, the goal of elink_platform_probe
is to translate the device tree elink
node into the elink_device
structure. That involves setting up things mentioned in eLink0
DT node, such as:
eLink
control registerseMesh
memory regionshared
memory regionpower regulator
clocks
At first, we’ll define the elink_device
structure, it will be a container for all the things we initialize, and in general, it’ll represent the actual device.
struct elink_device {
...
};
Control registers
The epiphany chip has a set of registers used to configure the operating mode of the product. In the table below, the link registers additionally have an offset that is dependent on the link in question as follows:
- North link offset:
0x00200000
- East link offset:
0x08300000
- South link offset:
0x0c200000
- West link offset:
0x08000000
In order to write to these memory-mapped registers, the store transaction must be configured with a special control mode that allows the transaction to bypass the regular eMesh routing protocol. In our case, we use only West Link, so the control mode would have to be set to 13
(1101
in binary). You can find the actual list of registers in the elink_regmap.vh
Verilog file
So, we need to read the reg offset from the DT in order to calculate the correct address:
ret = of_address_to_resource(pdev->dev.of_node, 0, &res);
if (ret) {
dev_err(&pdev->dev, "no control reg resource\n");
return ERR_PTR(ret);
}
elink->regs_start = res.start;
elink->regs_size = resource_size(&res);
Similarily we’ll do the same (with counter +1) for emesh:
ret = of_address_to_resource(pdev->dev.of_node, 1, &res);
Once we successfully read the values, we should let the kernel know that our driver will use given range of I/O adressess. The request_mem_region
function does that and will prevent other drivers to make any overlapping call to the same region through this function.
devm_request_mem_region(&pdev->dev, elink->regs_start, elink->regs_size, pdev->name)
devm_request_mem_region(&pdev->dev, elink->emesh_start, elink->emesh_size, pdev->name)
Like user space, the Kernel accesses memory through page tables; as a result, when kernel code needs to access memory-mapped I/O devices, it must first set up an appropriate kernel page-table mapping. The in-kernel tool for that job has long been ioremap(), which has a number of variants.
A successful call to ioremap() returns a kernel virtual address corresponding to the start of the requested physical address range. This address is not normally meant to be dereferenced directly, though, for a number of (often architecture-specific) reasons. Instead, accessor functions like readb() or iowrite32() should be used. To enforce this rule, the return address from ioremap() is annotated with the __iomem marker; that will cause the sparse checker to complain about accesses that do not use the proper functions. (source: lwn.net)
Taken the above explanation we’ll have to use kernel virtual memory to access registers:
elink->regs = devm_ioremap_nocache(&pdev->dev, elink->regs_start, elink->regs_size);
and that is how we read the register via virtual address:
int offset = 0xF020C; // ELINK_VERSION
ioread32((u8 __iomem *) elink->regs + offset);
// example read execution
regs: 0xf0a00000 ;
offset: 0x000f020c;
address: 0xf0af020c;
Power regulator
Suprisingly there is not a lot of to be done here, simply we use the "Voltage and current API":
supply = devm_regulator_get_optional(&pdev->dev, "vdd");
Clocks
Same here, we utilize the "Common Clock Framework":
static const char const *names[] = {
"fclk0", "fclk1", "fclk2", "fclk3"
};
for (i = 0; i < ARRAY_SIZE(names); i++) {
elink->clocks[i] = devm_clk_get(&pdev->dev, names[i]);
...
}
Reserved memory
The reserved memory setup is pretty much the same as presented in Control registers
section, this is:
- read address range from
memory-region
DT node - call
devm_request_mem_region
to reserve memory
Register the device
Alrighty, the elink
device seems to be initialized, so it’s time to register it. In other words, we would like to see the device file elink0
present under the /dev/epiphany/
and perform various operation on it, like read or write.
Implementation wise we should:
- implement
file_operations
(defined inlinux/fs.h
) - get first available
minor
number for the device - add char device with
cdev_add
- probe for
elink
device. This is to fetch: chip type, platform version, chip ID. We need this information to properly setup voltage, tx divider, and other chip-specific properties. - finally register the device with
device_register
The file operation functions are the essence of the driver, so we’ll discuss them in a separate section. For now, we are good with stubs:
static const struct file_operations elink_char_driver_ops = {
.owner = THIS_MODULE,
.open = char_open,
.release = char_release,
.mmap = elink_char_mmap,
.unlocked_ioctl = elink_char_ioctl
};
Character device creation goes as follows:
elink->minor = minor_get(elink); // wrapper around
idr_alloc
devt = MKDEV(MAJOR(epiphany.devt), elink->minor); cdev_init(&elink->cdev, &elink_char_driver_ops); elink->cdev.owner = THIS_MODULE; cdev_add(&elink->cdev, devt, 1);
And finally we can register the device:
elink->dev.class = &epiphany.class;
elink->dev.parent = NULL;
elink->dev.devt = devt;
dev_set_name(&elink->dev, "elink%d", atomic_inc_return(&epiphany.elink_counter) - 1);
device_register(&elink->dev);
The probe function includes a few reads:
elink_reset(); // see next section
elink->version = reg_read(elink->regs, ELINK_VERSION);
elink->coreid_pinout = reg_read(elink->regs, ELINK_CHIPID);
eLink reset
To reset the overall system, we ought to:
- clear the registers by asserting (setting the value of reg to 1 and then to 0) RESET pin
- setup eLink TX
- setup eLink RX
union elink_reset {
u32 reg;
struct {
unsigned tx_reset:1;
unsigned rx_reset:1;
};
} __packed;
union elink_reset reset = {0};
reset.tx_reset = 1;
reset.rx_reset = 1;
reg_write(reset.reg, elink->regs, ELINK_RESET);
usleep_range(500, 600);
... // de-assert here
You might wonder why we write the reset.reg
but we set the reset.{tx, rx}
values. It’s a C "trick", to be specific the {tx,rx}_reset
are 1 bit variables (see the colon) that are "flattened" without padding (thanks to "__packed" qualifier). Thus the actual value of reg is 00000000000000000000000000000011
in binary.
Remember the shared memory access and the no-map
DT attribute? We’ll this is how we setup re-mapping on the eLink
device:
rxcfg.remap_mode = 1;
rxcfg.remap_sel = 0xfe0;
rxcfg.remap_pattern = 0x3e0;
reg_write(rxcfg.reg, elink->regs, ELINK_RXCFG);
The alternative is to use eMMU but it seems to be broken for now (causing the system freeeze):
rxcfg.mmu_enable = 1;
rxcfg.remap_mode = 0;
eLink file operations
This is the important section, if not the most important. With file operations, we are going to implement the memory mapping (mmap
call) between the device and userspace. Being able to read/write data from the eCore is what we wanted in the first place!
Right, but first, we need to revisit the addressing to (for instance) be able to tell at what phys address do we keep data for core (0, 0)?
Nah, sorry I meant core (32, 8)?
Ah again, I meant core ID 808…
Damn! But what offset should I choose for EAST_ELINK?
Last try! What about kernel VM pages? I can’t refer the phys address directly…
I am just playing with you, but as you see, the addressing might give us a headache. Let’s write a few helper functions!
Core ID
The Core ID is a number that identifies a core in the system. Each core is associated with a unique number that is related to the core’s coordinates in the global mesh. The ID is a 12-bit number where the 6 high order bits are the core row coordinate, and the 6 low order bits are the core column coordinates. This number also indicates the core’s 1MB slice in the global memory space, where it comprises the most significant bits of the core’s globally addressable space.
What is the core ID with coords (32, 8) – north-westmost?
row = 32 = 100000 (binary)
col = 8 = 001000 (binary)
core_id = 100000_001000 (16-bit) = 2056 (dec) = 0x808 (hex)
Core ID phys address
In order to calculate the phys address of each core (previously mentioned 1 MB), we need to know where the chip-array starts. Do you remember the reg = <0x808 1 1 1>;
in array DT node? That’s the start of our array!
What is the phys address for core (32, 9)?
core_id = 100000_001001 = 2057 (dec) = 0x809
rel_core_id = core_id - array_id = 0x809 - 0x808 = 1
// 1 MB shift
offs_from_array_start = (rel_coreid) << 20 = 0x00100000
// Adjust for offset from elink mem region (align by row)
offs += COL(array_id) << 20 = 0x00100000 + 8 << 20 = 0x00100000 0x00800000 = 0x00900000
// Phys addr in the eMesh
phys_addr = offs + elink->emesh_start = 0x80000000 + 0x00900000 = 0x80900000
The phys address for core (32, 9) or (0, 1) if we look for relative position starts at 0x80900000
and ends at 0x809fffff
, however, we only have 32 kb per core, so the actual end address is going to be lower (0x80908000
)
Accessing virtual memory
The userspace program leverages the mmap
system call to access the device memory. The requested memory is served in the form of pages, usually of 4kb size.
Taken into account that we have 32kb of SRAM per eCore, we’ll end up with 32 / 4 = 8 pages, so we need to provide some way to enumerate the pages etc.
For example the phys address range of each page frame for core (32, 9) is:
0x80900000
–0x80901000
0x80901000
–0x80902000
0x80902000
–0x80903000
0x80903000
–0x80904000
0x80904000
–0x80905000
0x80905000
–0x80906000
0x80906000
–0x80907000
0x80907000
–0x80908000
Wit that explanation lets look into the elink_char_mmap
function (member of elink_char_driver_ops
):
static int elink_char_mmap(struct file *file, struct vm_area_struct *vma)
A struct vm_area_struct is created at each mmap()
call issued from user space. A driver that supports the mmap()
operation must complete and initialize the associated struct vm_area_struct. The most important fields of this structure are:
vm_start
,vm_end
– the beginning and the end of the memory area, respectively (these fields also appear in /proc//maps); vm_file
– the pointer to the associated file structure (if any);vm_pgoff
– the offset of the area within the file;vm_flags
– a set of flags;vm_ops
– a set of working functions for this areavm_next
,vm_prev
– the areas of the same process are chained by a list structure
Lets initialize the vma
:
static const struct vm_operations_struct epiphany_vm_ops = {
.open = epiphany_vm_open,
.close = epiphany_vm_close,
.fault = epiphany_vm_fault
};
vma->vm_ops = &epiphany_vm_ops;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
A word of explanation:
pgprot_noncached
will make the physical memory pages non-cacheable- flags:
VM_PFNMAP
– page-ranges managed without "struct page", just pure PFNVM_DONTEXPAND
– cannot expand with mremapVM_DONTDUMP
– do not include in the core dump
VM fault
When the program tries to access a memory page that is not currently mapped into the virtual memory space of a process, the page fault
exception is going to be triggered. We’ll need to write the fault handler to make that page accessible.
I admit this is going to be a bit long flow and quite convoluted since a lot of addressing is involved… but with an example, it’s going to be easier to follow.
Sooo, imagine we want to read 32kb of data from the eCore ID (32, 9) or in relative format (0, 1).
// Step 1: Setup initial data
core_id = 0x809 // eCore (32, 9)
map_size = 0x00008000 // 32kb
phy_base = 0x80900000
The above should be quite clear at this point. Have you noticed we are still operating on phys addresses? Isn’t the userspace using the VM?
Yup, so we need to setup the mapping.
// Step 2: Setting up the mapping
dev_fd = open(EPIPHANY_DEV, O_RDWR | O_SYNC);
mapped_base = mmap(
NULL, // addr - let the Kernel chooses the (page-aligned) address at which to create the mapping
map_size, // 32kb of SRAM per eCore
PROT_READ|PROT_WRITE, // read/write operations allowed
MAP_SHARED, // self explanatory
dev_fd, // device file descriptor
phy_base // this is where we want to start the mapping from
);
The mapped_base
is going to be some virtual address, say… 0xb6c29000
. Every time you’ll execute our program, you likely will get different addresses, but for now, let us use that one.
// Step 3: Read from the buffer unsigned int buf[map_size]; memcpy(&buf, mapped_base, map_size); // read 32kb starting at
0xb6c29000
(virt address)
And that’s all when it comes to the userspace part. Now, what do you think should happen on the kernel side?
Yes, you’re right! We will get 8 fault handler execution that should load the pages, what pages you might ask:
0xb6c29000
virt addr,0x80900000
phys addr0xb6c2a000
virt addr,0x80901000
phys addr0xb6c2b000
virt addr,0x80902000
phys addr- …
Since we know the exact mappings we should let the kernel know about them and we do it via:
// vm_fault_t vmf_insert_pfn(struct vm_area_struct * vma, unsigned long addr, unsigned long pfn)
vmf_insert_pfn(
vma,
(unsigned long)vmf->address,
phys_pfn
);
Almost there, the third argument is a page frame number, not a physical address, so how do we calculate this? In short, we take the phys addr and shift it by 8, so for instance, 0x80900000 >> 8
is 0x809000
. It’s a bit more complicated though, if you take other factors into account, so please check the mesh_pfn_to_phys_pfn
for more information.
Troubleshooting
On Linux we can look up the memory map of a process under the /proc/$PID/maps
. This is quite handy while troubleshooting the kernel / user space mappings issues/.
$ cat /proc/$PID/maps
address perms offset dev inode pathname
0xb6c29000-0xb6c31000 rw-s 80900000 00:06 7254 /dev/epiphany/mesh0
The above corresponds to what we described in past 2 sections:
- address:
0xb6c29000-0xb6c31000
gives us 0x8000 (32kb) memory - perms:
rw-s
= read / write / shared - offset:
0x80900000
the offset we used in mmap
Summary
Uff, we have covered quite a lot! Dealing with hardware certainly is not the easiest task but definitely worthwhile of understanding. Hopefully, you get the idea of what role does the kernel play in the operating system, what are registers, or how to interact with the VM. If you’re interested in learning more, there are a few things we skipped in this article:
- eMesh device setup
- array clock gating
- device unregister
- DMA
- IOCTL
If you don’t have parallella board, there are still a few at DigiKey, so grab one before all are gone!
See other posts!
- # Parallella (part 1): Case study
- # Parallella (part 10): Power efficiency
- # Parallella (part 11): malloc
- # Parallella (part 12): Tensorflow?
- # Parallella (part 13): Closing notes
- # Parallella (part 2): Hardware
- # Parallella (part 3): Kernel
- # Parallella (part 4): ISA
- # Parallella (part 5): elibs
- # Parallella (part 6): FreeRTOS