Last.FM: Recently listened

Sunday, May 21, 2006

[L4] Speeding up L4Linux task creation

It's L4Linux once again. As mentioned the last time, L4Linux gets performance leaks when it is in need of services provided by other L4 services, since it then needs to call these services. One example for this is task creation, which I tried to improve recently. Let's first have a look at what L4Linux is doing upon a user-space application calling fork():
  1. Allocate an L4 task from the task server. The task server upon initialization got all available tasks from the resource manager (RMGR). However it does all these requests only to get informed about how many tasks are in the system. It then returns these tasks to RMGR immediately.
  2. Setup Linux-internal task data.
  3. Call the task server once again to start the task.
  4. The task server now calls RMGR to really start the task.
The same goes for terminating a task. Once again RMGR is called to kill the task using the task server as a proxy. So what is this task server good for anyway? - Well, it provides ownership management for tasks. This could also be included to RMGR thereby saving this indirection, but no one has done so up to now.

Having a look at what is done in task server and RMGR to startup a task shows, that it is simply some l4_task_new() system call. The first idea to improve the situation is to let L4Linux do so itself. I therefore added a new allocate_chief() call to the task server interface. This allocates a task for the client and makes this client become the task's chief. This is necessary, because only a task's chief is allowed to start and stop it.

This alone speeds up task creation and termination by around 10%, since we save one L4Linux -> task server -> RMGR call chain for both, starting and stopping a task. Another improvement can be made by caching unused tasks. If a task is terminated, we do not need to return it to the task server, but can put it into a cache instead. It can then later be restarted as a new L4Linux task. This removes every interaction with L4 servers from the start/stop process of a task. All in all new task management and task caching do result in 20% performance increase for task creation.

Saturday, May 13, 2006

[L4] L4Linux performance issues

L4Linux is a paravirtualized version of the Linux kernel running on top of the L4 microkernel family. For my dipoma thesis I am doing some evaluation of L4Linux in comparison to native Linux to find out where the former's performance problems come from.

One of the problems is that Linux applications are running as L4 user space apps alongside with the kernel. Every system call issued by an application leads to an IPC to the Linux server which then answers back once again using IPC. This means, that 2 context switches are invoked for each system call, resulting in a much larger number of TLB and cache misses than on native Linux.

I measured this by performing a nearly-null system cal: sys_getpid(). From Linux user space its average execution time is around 260 cycles on my test computer (An AMD Duron 800 MHz with 256 MB RAM, Linux 2.6.16 booted from a ramdisk). Performing the same task on the same computer with the same ramdisk setup in L4Linux results in around 3,700 cycles for each call to sys_getpid().

I thereafter counted cache and TLB misses for each setup and learned, that for 100,000 calls to sys_getpid() there were around 200 TLB misses in native Linux - probably from the point where my benchmark was interrupted by some other app. On L4Linux there were 6 TLB misses for each system call - some more. These misses lead to delay in execution of L4Linux system calls, because cached need to be refilled.

However, this is not the one and only source of performance leaks. Losing 3,500 cycles for a system call is not a lot, when you see that a blocking sys_read() needs 2,7 million cycles in average. There are other sources, for instance the points where L4Linux needs to use L4 system services to get work done. I will discuss this in another post soon.

Conclusion: Context switches for system calls reduce L4Linux performance.

Solutions already have been proposed:
  • Processors with tagged TLBs do not need to flush their TLBs on context switches and therefore will reduce TLB misses for system calls.
  • Cache coloring can be used to reduce overlapping between L4Linux and its applications so that both do not thrash their caches while running in parallel.
  • Small address spaces are a concept to run multiple applications inside the same virtual address space. Thereby no context switching is performed for Linux system calls.