Release status
Kernel release status
The current stable 2.6 release is 2.6.12.4, which was
announced on August 5.
The current 2.6 prepatch is 2.6.13-rc6, released by Linus on
August 7. This prepatch contains a fix for recent aic7xxx performance
problems (so extra testing by people with the relevant hardware is being
requested), the removal of a few patches which caused regressions, and a
number of fixes. The long-format changelog
has the details.
Linus's git repository contains a very small number of fixes added since
-rc6. It appears that the August 12 to 19 time frame for 2.6.13
found in Andrew Morton's kernel
status report may be just about right.
The current -mm tree is 2.6.13-rc5-mm1. Recent
additions to -mm include a relayfs update, a new kzalloc()
function (see below), some debugging helpers from the realtime preemption
patch set, some architecture updates, and lots of fixes.
The current 2.4 prepatch is 2.4.32-pre3, released by Marcelo on August 8. This
prepatch adds a handful of fixes and a 2.6 serial ATA backport.
Comments (none posted)
Kernel development news
Quote of the week
I have to say, with tcl/tk, "google" + "random typing" can make you
appear to know what the hell you're doing.
-- Linus Torvalds
Comments (none posted)
Toward more robust network-based block I/O
One thing which came out of
this
year's Kernel Summit is that the kernel still does not deal well with
network-based block devices when memory gets tight. If the system is full
of dirty memory, the kernel must write some of those dirty pages to their
backing store so that the memory may be reused. But the act of writing
that data over the network can require the allocation of more memory. Even
worse, completing network-based I/O requires the ability to receive the
acknowledgment packets back from the remote device. Not only does that
packet reception require memory, but the system must contend with the fact
that the network could also be the source of vast numbers of packets which
are completely unrelated to the problem at hand. If the system cannot find
a way to receive the packets it needs while ignoring unrelated packets,
extreme memory pressure will eventually lead to a lockup.
Solving this problem is hard. At the Summit, Linus suggested that it might
not even make sense to try; instead, users should be directed toward I/O
hardware which does not present this sort of problem. In reality, however,
Linux will do its best to support network-based block devices. Daniel
Phillips has recently been working on a patch which tries to make some
progress in that direction.
Like many before him, Daniel bases his approach on the use of preallocated
memory pools - a chunk of memory which is set aside for use when no other
memory is available. Daniel has tried to take things a little further by
quantifying how much memory should be set aside. To that end, each network
driver should, when an interface is brought up, make a call to:
int adjust_memalloc_reserve(int pages);
Where pages is the number of pages required to be able to continue
to receive packets on the given interface. A helper function,
estimate_skb_pages(), can come up with a guess for how many pages
will be required to hold a given number of packets with a specified maximum
size. The call to adjust_memalloc_reserve() will cause the
virtual memory subsystem to set aside the given number of pages for
emergency use by the driver. In this way, it is hoped, the system will
reserve a sufficient amount of memory without being overly wasteful.
Memory can be allocated from the reserve by adding the new
__GFP_MEMALLOC flag to the allocation request. A new networking
helper function, dev_memalloc_skb(), will use that flag if
necessary to obtain a packet. Before doing so, however, it checks a count
of packets allocated from the reserve; no interface is allowed to allocate
beyond a maximum count, which defaults to 50. Unlike previous versions of
the patch, the current code does not attempt to track which packets, in
particular, were allocated from reserve memory. Any packets which
originate from a given device will, when returned to the system, be
credited to that device's reserve.
A longstanding problem with the reserve approach is that, if one is not
careful, the reserve simply gets depleted and the system runs out of memory
anyway. In a situation where memory use is not entirely within the system's
control - when dealing with incoming network data, for example - this sort
of depletion is especially likely. Your system may be doing its best to
flush dirty pages to your home iSCSI array, but the network memory reserves
are full of incoming music being downloaded by your children, so the entire
system comes to a halt. Such an outcome may please the RIAA, but the
kernel developers are trying to satisfy a different audience.
Daniel's answer to this problem is to add a special flag to network sockets
which are involved in block I/O. Only sockets marked with
SOCK_MEMALLOC are entitled to use packet memory from the
reserves. When the packet arrives on the interface, the system cannot know
whether it is useful or not, so that packet must be received (possibly
using reserve memory) and fed into the system in
the usual way. The protocol code, however, is expected to check each
packet to see whether it comes from a device which is currently using
reserve memory. If so, and the packet does not belong to a suitably-marked
socket, that packet is to be dropped immediately. In this way, it is
hoped, the system will be able to focus its remaining resources on
recovering from its memory crunch.
This approach may have some promise. This patch needs some work, however,
before it is ready for serious stress testing. Once it has been worked
into shape, the patch can be applied to a suitably-equipped system, which
can then be pushed into a state of serious memory pressure. That point
has been the downfall of a number of other approaches to this problem;
whether Daniel's work is up to this test remains to be seen.
Comments (1 posted)
kzalloc()
The kernel code base is full of functions which allocate memory with
kmalloc(), then zero it with
memset(). Recently, Pekka
Enberg concluded that much of this code could be cleaned up by using
kcalloc() instead.
kcalloc() has this prototype:
void *kcalloc(size_t n, size_t size, unsigned int __nocast gfp_flags);
This function will allocate an array of n items, and will zero the
entire array before returning it to the caller. Pekka's patch converted a
number of kmalloc()/memset() pairs over to
kcalloc(), but that patch drew a
complaint from Andrew Morton:
Notice how every conversion you did passes in `1' in the first
argument? And that's going to happen again and again and again.
Each callsite needlessly passing that silly third argument, adding
more kernel text.
Very few callers actually need to allocate an array of items, so the extra
argument is unneeded in most cases. Each instance of that argument adds a
bit to the size of the kernel, and, over time, that space adds up. The
solution was to create yet another allocation function:
void *kzalloc(size_t size, unsigned int __nocast gfp_flags);
This function returns a single, zeroed item. It has been added to -mm,
with its appearance in the mainline likely to happen for 2.6.14.
Comments (9 posted)
Time to merge GFS?
Red Hat recently
announced
that Fedora Core 4 was available with the Global Filesystem (GFS).
Like Oracle's OCFS2, GFS allows a tightly-linked cluster to manage
filesystems stored on a shared disk. Now that GFS is actually shipping,
Red Hat would like to see it merged into the mainline kernel. Thus,
recently, David Teigland
posted the patches for
review and asked for feedback. He got some.
One issue has to do with locking. Since the filesystem is kept on shared
storage, the nodes of the cluster must take care to avoid stepping on each
others' toes and corrupting things. The distributed lock manager (DLM)
subsystem is used to that end; whenever a node wishes to access a
particular block on the filesystem, it first obtains a cluster-wide lock on
that block. As long as the filesystem only supports the read()
and write() system calls, this locking works reasonably well. The
filesystem code can obtain the locks it needs, perform the operation, then
return the locks, and all works well.
The problem comes in when the filesystem supports mmap() as well.
Accesses to memory mapped with mmap() does not happen with the
read() and write() system calls; it is, instead, done
with regular memory operations. Locking in this case is handled in
conjunction with the virtual memory subsystem; the permissions on any
particular page are set to be consistent with the level of lock currently
held by the local node. If the node does not have a lock for a specific
block in the filesystem, the page table entry for the corresponding page
will show that page as being absent. If the process which made the mapping
tries to access the page, it will incur a page fault; the filesystems
nopage() method can then set up the mapping, acquiring whatever
locks are required.
Page faults are asynchronous events. In particular, a page fault could
happen while the kernel is busy handling a read() or
write() operation somewhere else in the filesystem. In this case,
the kernel will be acquiring two independent locks in the filesystem, and
in an arbitrary order. It does not take much experience with locking to
learn that, when multiple locks are to be acquired, the order in which they
are taken is critical. Consider a case where there are two locks (call
them "A" and "B") and two processes needing them. Imagine that one process
acquires A, while the other acquires B. Each process then attempts to grab
the remaining lock. At this point, both processes will wait forever; this
situation is called an "ABBA deadlock." Contrary to what some may believe,
the term has nothing to do with 1970's Swedish rock bands.
Avoiding this kind of deadlock requires a fair amount of ugly filesystem
trickery; Zach Brown put it this way:
So clustered file systems in Linux (GFS, Lustre, OCFS2, (GPFS?))
all walk vmas in their file->{read,write} to discover mappings that
belong to their files so that they can preemptively sort and
acquire the locks that will be needed to cover the mappings that
might be established in ->nopage. As you point out, this both
relies on the mappings not changing and gets very exciting when you
mix files and mappings between file systems that are each sorting
and acquiring their own DLM locks.
Sorting this situation out properly will probably require some sort of
support at the VFS layer. In that way, one hopes, a single, working
solution would be found. The alternative seems to be a bunch of brittle
and complicated code in each filesystem which has this problem.
Another glitch encountered by GFS is its support for "context-dependent
path names." These are, in essence, symbolic links with magic properties.
The GFS code, if it encounters "@hostname" as a component in a
symbolic link, will substitute the name of the current host. Similar
substitutions will happen for @mach, @os, @uid,
and others. There is also support for an alternative syntax
("{hostname}"), for whatever reason.
This mechanism exists to allow cluster nodes to establish private areas on
a shared disk. It can also be used, for example, to create
architecture-specific directories full of binaries on a common path. In
the past, administrators have used automounter trickery to a very similar
end. The filesystem hackers, who do not like to see this sort of magic
buried within individual filesystems, suggest that bind mounts should be
used instead. That technique, however, is relatively cumbersome and
error-prone, so there is some interest in finding a way to maintain the
sort of functionality implemented by context-dependent links.
The objections to context-dependent links include the addition of magic to
parts of the filesystem namespace and the fact that they are specific to
one filesystem. Moving the resolution of these links up to the VFS layer
could be a part of the solution, since it would then at least function the
same way for all filesystems. Adding this kind of semantics may always be
a hard sell, however, since it changes the way Linux filesystems are
expected to behave. The old, automounter-based approach may end up being
the recommended technique for those needing this sort of behavior.
Comments (6 posted)
A realtime preemption overview
August 10, 2005
This article was contributed by Paul McKenney
There have been a considerable number of papers describing a number of
different aspects of and approaches to realtime, a few of which were
listed in the RESOURCES section of my "realtime
patch acceptance summary from July.
However, there does not appear to be a similar description of the realtime
preemption (PREEMPT_RT) patch. This document attempts to fill this gap, using
the V0.7.52-16 version of this patch. However, please note that the
PREEMPT_RT patch evolves very quickly!
Philosophy of PREEMPT_RT
The key point of the PREEMPT_RT patch is to minimize the amount of kernel
code that is non-preemptible, while also minimizing the amount of code
that must be changed in order to provide this added preemptibility. In
particular, critical sections, interrupt handlers, and interrupt-disable
code sequences are normally preemptible. The PREEMPT_RT patch leverages
the SMP capabilities of the Linux kernel to add this extra preemptibility
without requiring a complete kernel rewrite. In a sense, one can loosely
think of a preemption as the addition of a new CPU to the system, and
then use the normal locking primitives to synchronize with any action
taken by the preempting task.
Note that this statement of philosophy should not be taken too literally,
for example, the PREEMPT_RT patch does not actually perform a CPU
hot-plug event for each preemption. Instead, the point is that the
underlying mechanisms used to tolerate (almost) unlimited preemption
are those that must be provided for SMP environments. More information
on how this philosophy is applied is given in the following sections.
Features of PREEMPT_RT
This section gives an overview of the features that the PREEMPT_RT
patch provides.
- Preemptible critical sections
- Preemptible interrupt handlers
- Preemptible "interrupt disable" code sequences
- Priority inheritance for in-kernel spinlocks and semaphores
- Deferred operations
- Latency-reduction measures
Each of these topics is covered in the following sections.
Preemptible critical sections
In PREEMPT_RT, normal spinlocks (spinlock_t and rwlock_t) are
preemptible, as are RCU read-side critical sections (rcu_read_lock()
and rcu_read_unlock()). Semaphore critical sections are preemptible,
but they already are in both PREEMPT and non-PREEMPT kernels (but more
on semaphores later). This preemptibility means that you can block
while acquiring a spinlock, which in turn means that it is illegal to
acquire a spinlock with either preemption or interrupts disabled (the
one exception to this rule being the _trylock variants, at least as long
as you don't repeatedly invoke them in a tight loop). This also means
that spin_lock_irqsave() does -not- disable hardware interrupts when
used on a spinlock_t.
Quick Quiz #1: How can semaphore critical sections be preempted in
a non-preemptible kernel?
So, what to do if you need to acquire a lock when either interrupts
or preemption are disabled? You use a raw_spinlock_t instead of
a spinlock_t, but continue invoking spin_lock() and friends on
the raw_spinlock_t. The PREEMPT_RT patch includes a set of macros
that cause spin_lock() to act like a C++ overloaded function -- when
invoked on a raw_spinlock_t, it acts like a traditional spinlock, but
when invoked on a spinlock_t, its critical section can be preempted.
For example, the various _irq primitives (e.g., spin_lock_irqsave())
disable hardware interrupts when applied to a raw_spinlock_t, but do not
when applied to a spinlock_t. However, use of raw_spinlock_t (and its
rwlock_t counterpart, raw_rwlock_t) should be the exception, not the rule.
These raw locks should not be needed outside of a few low-level areas,
such as the scheduler, architecture-specific code, and RCU.
Since critical sections can now be preempted, you cannot rely on a
given critical section executing on a single CPU -- it might move
to a different CPU due to being preempted. So, when you are using
per-CPU variables in a critical section, you must separately handle
the possibility of preemption, since spinlock_t and rwlock_t are
no longer doing that job for you. Approaches include:
- Explicitly disable preemption, either through use of
get_cpu_var(), preempt_disable(), or disabling hardware
interrupts.
- Use a per-CPU lock to guard the per-CPU variables. One
way to do this is by using the new DEFINE_PER_CPU_LOCKED()
primitive -- more on this later.
Since spin_lock() can now sleep, an additional task state was added.
Consider the following code sequence (supplied by Ingo Molnar):
spin_lock(&mylock1);
current->state = TASK_UNINTERRUPTIBLE;
spin_lock(&mylock2); // [*]
blah();
spin_unlock(&mylock2);
spin_unlock(&mylock1);
Since the second spin_lock() call can sleep, it can clobber the value
of current->state, which might come as quite a surprise to the blah()
function. The new TASK_RUNNING_MUTEX bit is used to allow the scheduler
to preserve the prior value of current->state in this case.
Although the resulting environment can be a bit unfamiliar, but it
permits critical sections to be preempted with minimal code changes,
and allows the same code to work in the PREEMPT_RT, PREEMPT, and
non-PREEMPT configurations.
Preemptible interrupt handlers
Almost all interrupt handlers run in process context in the PREEMPT_RT
environment. Although any interrupt can be marked SA_NODELAY to cause it
to run in interrupt context, only the fpu_irq, irq0, irq2, and lpptest
interrupts have SA_NODELAY specified. Of these, only irq0 (the per-CPU
timer interrupt) is normally used -- fpu_irq is for floating-point
co-processor interrupts, and lpptest is used for interrupt-latency
benchmarking. Note that software
timers (add_timer() and friends) do not run in hardware interrupt
context; instead, they run in process context and are fully preemptible.
Note that SA_NODELAY is not to be used lightly, as can greatly degrade
both interrupt and scheduling latencies. The per-CPU timer interrupt
qualifies due to its tight tie to scheduling and other core kernel
components. Furthermore, SA_NODELAY interrupt handlers must be coded
very carefully as noted in the following paragraphs, otherwise, you
will see oopses and deadlocks.
Since the per-CPU timer interrupt (e.g., scheduler_tick()) runs in
hardware-interrupt context, any locks shared with process-context
code must be raw spinlocks (raw_spinlock_t or raw_rwlock_t), and,
when acquired from process context, the _irq variants must be used,
for example, spin_lock_irqsave(). In addition, hardware interrupts
must typically be disabled when process-context code accesses per-CPU
variables that are shared with the SA_NODELAY interrupt handler, as
described in the following section.
Preemptible "interrupt disable" code sequences
The concept of preemptible interrupt-disable code sequences may seem
to be a contradiction in terms, but it is important to keep in mind the
PREEMPT_RT philosophy. This philosophy relies on the SMP capabilities
of the Linux kernel to handle races with interrupt handlers, keeping in
mind that most interrupt handlers run in process context. Any code
that interacts with an interrupt handler must be prepared to deal with
that interrupt handler running concurrently on some other CPU.
Therefore, spin_lock_irqsave() and related primitives need not disable
preemption. The reason this is safe is that if the interrupt handler
runs, even if it preempts the code holding the spinlock_t, it will block
as soon as it attempts to acquire that spinlock_t. The critical section
will therefore still be preserved.
However, local_irq_save() still disables preemption, since there is no
corresponding lock to rely on. Using locks instead of local_irq_save()
therefore can help reduce scheduling latency, but substituting locks in
this manner can reduce SMP performance, so be careful.
Code that must interact with SA_NODELAY interrupts cannot use
local_irq_save(), since this does not disable hardware interrupts.
Instead, raw_local_irq_save() should be used. Similarly, raw spinlocks
(raw_spinlock_t, raw_rwlock_t, and raw_seqlock_t) need to be used when
interacting with SA_NODELAY interrupt handlers. However, raw spinlocks
and raw interrupt disabling should -not- be used outside of a few
low-level areas, such as the scheduler, architecture-dependent code,
and RCU.
Priority inheritance for in-kernel spinlocks and semaphores
Realtime programmers are often concerned about priority inversion, which
can happen as follows:
- Low-priority task A acquires a resource, for example, a lock.
- Medium-priority task B starts executing CPU-bound, preempting
low-priority task A.
- High-priority task C attempts to acquire the lock held by
low-priority task A, but blocks because of medium-priority
task B having preempted low-priority task A.
Such priority inversion can indefinitely delay a high-priority task.
There are two main ways to address this problem: (1) suppressing
preemption and (2) priority inheritance. In the first case, since there
is no preemption, task B cannot preempt task A, preventing priority
inversion from occurring. This approach is used by PREEMPT kernels
for spinlocks, but not for semaphores. It does not make sense to
suppress preemption for semaphores, since it is legal to block while
holding one, which could result in priority inversion even in absence
of preemption. For some realtime workloads, preemption cannot be
suppressed even for spinlocks, due to the impact to scheduling latencies.
Priority inheritance can be used in cases where suppressing preemption
does not make sense. The idea here is that high-priority tasks
temporarily donate their high priority to lower-priority tasks that
are holding critical locks. This priority inheritance is transitive:
in the example above, if an even higher priority task D attempted to
acquire a second lock that high-priority task C was already holding,
then both tasks C and A would be be temporarily boosted to the priority
of task D. The duration of the priority boost is also sharply limited:
as soon as low-priority task A releases the lock, it will immediately
lose its temporarily boosted priority, handing the lock to (and being
preempted by) task C.
However, it may take some time for task C to run, and it is quite possible
that another higher-priority task E will try to acquire the lock in the
meantime. If this happens, task E will "steal" the lock from task C,
which is legal because task C has not yet run, and has therefore not
actually acquired the lock. On the other hand, if task C gets to run
before task E tries to acquire the lock, then task E will be unable to
"steal" the lock, and must instead wait for task C to release it, possibly
boosting task C's priority in order to expedite matters.
In addition, there are some cases where locks are held for extended
periods. A number of these have been modified to add "preemption points"
so that the lock holder will drop the lock if some other task needs it.
The JBD journaling layer contains a couple of examples of this.
It turns out that write-to-reader priority inheritance is particularly
problematic, so PREEMPT_RT simplifies the problem by permitting only
one task at a time to read-hold a reader-writer lock or semaphore,
though that task is permitted to recursively acquire it. This makes
priority inheritance doable, though it can limit scalability.
Quick Quiz #2: What is a simple and fast way to implement priority
inheritance from writers to multiple readers?
In addition, there are some cases where priority inheritance is
undesirable for semaphores, for example, when the semaphore is being
used as an event mechanism rather than as a lock (you can't
tell who will post the event before the fact, and therefore have no
idea which task to priority-boost). There are compat_semaphore and
compat_rw_semaphore variants that may be used in this case. The various
semaphore primitives (up(), down(), and friends) may be used on either
compat_semaphore and semaphore, and, similarly, the reader-writer
semaphore primitives (up_read(), down_write(), and friends) may be used
on either compat_rw_semaphore and rw_semaphore. Often, however, the
completion mechanism is a better tool for this job.
So, to sum up, priority inheritance prevents priority inversion, allowing
high-priority tasks to acquire locks and semaphores in a timely manner,
even if the locks and semaphores are being held by low-priority tasks.
PREEMPT_RT's priority inheritance provides transitivity, timely removal
of inheritance, and the flexibility required to handle cases when high
priority tasks suddenly need locks earmarked for low-priority tasks.
The compat_semaphore and compat_rw_semaphore declarations can be used
to avoid priority inheritance for semaphores for event-style usage.
Deferred operations
Since spin_lock() can now sleep, it is no longer legal to invoke it while
preemption (or interrupts) are disabled. In some cases, this has been
solved by deferring the operation requiring the spin_lock() until
preemption has been re-enabled:
In all of these situations, the solution is to defer an action until
that action may be more safely or conveniently performed.
Latency-reduction measures
There are a few changes in PREEMPT_RT whose primary purpose is to reduce
scheduling or interrupt latency.
The first such change involves the x86 MMX/SSE hardware. This hardware
is handled in the kernel with preemption disabled, and this sometimes
means waiting until preceding MMX/SSE instructions complete. Some
MMX/SSE instructions are no problem, but others take overly long amounts
of time, so PREEMPT_RT refuses to use the slow ones.
The second change applies per-CPU variables to the slab allocator,
as an alternative to the previous wanton disabling of interrupts.
Summary of PREEMPT_RT primitives
This section gives a brief list of primitives that are either added
by PREEMPT_RT or whose behavior is significantly changed by PREEMPT_RT.
Locking Primitives
- spinlock_t
-
Critical sections are preemptible. The _irq operations
(e.g., spin_lock_irqsave()) do -not- disable hardware
interrupts. Priority inheritance is used to prevent
priority inversion. An underlying rt_mutex is used
to implement spinlock_t in PREEMPT_RT (as well as
to implement rwlock_t, struct semaphore, and struct
rw_semaphore).
- raw_spinlock_t
-
Special variant of spinlock_t that offers the traditional
behavior, so that critical sections are non-preemptible
and _irq operations really disable hardware interrupts.
Note that you should use the normal primitives (e.g.,
spin_lock()) on raw_spinlock_t. That said, you shouldn't
be using raw_spinlock_t -at- -all- except deep within
architecture-specific code or low-level scheduling and
synchronization primitives. Misuse of raw_spinlock_t
will destroy the realtime aspects of PREEMPT_RT.
You have been warned.
- rwlock_t
-
Critical sections are preemptible. The _irq operations
(e.g., write_lock_irqsave()) do -not- disable hardware
interrupts. Priority inheritance is used to prevent
priority inversion. In order to keep the complexity of
priority inheritance down to a dull roar, only one task
may read-acquire a given rwlock_t at a time, though that
task may recursively read-acquire the lock.
- RW_LOCK_UNLOCKED(mylock)
-
The RW_LOCK_UNLOCKED macro now takes the lock itself as
an argument, which is required for priority inheritance.
Unfortunately, this makes its use incompatible with the
PREEMPT and non-PREEMPT kernels. Uses of RW_LOCK_UNLOCKED
should therefore be changed to DEFINE_RWLOCK().
- raw_rwlock_t
-
Special variant of rwlock_t that offers the traditional
behavior, so that critical sections are non-preemptible
and _irq operations really disable hardware interrupts.
Note that you should use the normal primitives (e.g.,
read_lock()) on raw_rwlock_t. That said, as with
raw_spinlock_t, you shouldn't be using raw_rwlock_t -at-
-all- except deep within architecture-specific code or
low-level scheduling and synchronization primitives.
Misuse of raw_rwlock_t will destroy the realtime aspects
of PREEMPT_RT. You have once again been warned.
- seqlock_t
-
Critical sections are preemptible. Priority inheritance
has been applied to the update side (the read-side
cannot be involved in priority inversion, since seqlock_t
readers do not block writers).
- SEQLOCK_UNLOCKED(name)
-
The SEQLOCK_UNLOCKED macro now takes the lock itself as
an argument, which is required for priority inheritance.
Unfortunately, this makes its use incompatible
with the PREEMPT and non-PREEMPT kernels. Uses of
SEQLOCK_UNLOCKED should therefore be changed to use
DECLARE_SEQLOCK(). Note that DECLARE_SEQLOCK() defines
the seqlock_t and initializes it.
- struct semaphore
-
The struct semaphore is now subject to priority
inheritance.
- down_trylock()
-
This primitive can schedule, so cannot be invoked with
hardware interrupts disabled or with preemption disabled.
However, since almost all interrupts run in process
context with both preemption and interrupts enabled,
this restriction has no effect thus far.
- struct compat_semaphore
-
A variant of struct semaphore that is -not- subject to
priority inheritance. This is useful for cases when
you need an event mechanism, rather than a sleeplock.
- struct rw_semaphore
-
The struct rw_semaphore is now subject to priority
inheritance, and only one task at a time may read-hold.
However, that task may recursively read-acquire the
rw_semaphore.
- struct compat_rw_semaphore
-
A variant of struct rw_semaphore that is -not- subject
to priority inheritance. Again, this is useful for cases
when you need an event mechanism, rather than a sleeplock.
Quick Quiz #3: Why can't event mechanisms use priority
inheritance?
Per-CPU Variables
- DEFINE_PER_CPU_LOCKED(type, name)
- DECLARE_PER_CPU_LOCKED(type, name)
-
Define/declare a per-CPU variable with the specified
type and name, but also define/declare a corresponding
spinlock_t. If you have a group of per-CPU variables
that you want to be protected by a spinlock, you can
always group them into a struct.
- get_per_cpu_locked(var, cpu)
-
Return the specified per-CPU variable for the specified
CPU, but only after acquiring the corresponding spinlock.
- put_per_cpu_locked(var, cpu)
-
Release the spinlock corresponding to the specified
per-CPU variable for the specified CPU.
- per_cpu_lock(var, cpu)
-
Returns the spinlock corresponding to the specified
per-CPU variable for the specified CPU, but as an lvalue.
This can be useful when invoking a function that takes
as an argument a spinlock that it will release.
- per_cpu_locked(var, cpu)
-
Returns the specified per-CPU variable for the specified
CPU as an lvalue, but without acquiring the lock,
presumably because you have already acquired the lock
but need to get another reference to the variable.
Or perhaps because you are making an RCU-read-side
reference to the variable, and therefore do not need
to acquire the lock.
Interrupt Handlers
- SA_NODELAY
-
Used in the struct irqaction to specify that the
corresponding interrupt handler should be directly invoked
in hardware-interrupt context rather than being handed
off to an irq thread. The function redirect_hardirq()
does the wakeup, and the interrupt-processing loop may
be found in do_irqd().
Note that SA_NODELAY should -not- be used for normal
device interrupts: (1) this will degrade both interrupt
and scheduling latency and (2) SA_NODELAY interrupt
handlers are much more difficult to code and maintain
than are normal interrupt handlers. Use SA_NODELAY
only for low-level interrupts (such as the timer tick)
or for hardware interrupts that must be processed with
extreme realtime latencies.
- local_irq_enable()
- local_irq_disable()
- local_irq_save(flags)
- local_irq_restore(flags)
- irqs_disabled()
- irqs_disabled_flags()
- local_save_flags(flags)
-
The local_irq*() functions do not actually disable
hardware interrupts, instead, they simply disable
preemption. These are suitable for use with normal
interrupts, but not for SA_NODELAY interrupt handlers.
However, it is usually even better to use locks (possibly
per-CPU locks) instead of these functions for PREEMPT_RT
environments -- but please also consider the effects on
SMP machines using non-PREEMPT kernels!
- raw_local_irq_enable()
- raw_local_irq_disable()
- raw_local_irq_save(flags)
- raw_local_irq_restore(flags)
- raw_irqs_disabled()
- raw_irqs_disabled_flags()
- raw_local_save_flags(flags)
-
These functions disable hardware interrupts, and are
therefore suitable for use with SA_NODELAY interrupts
such as the scheduler clock interrupt (which, among
other things, invokes scheduler_tick()).
These functions are quite specialized, and should only
be used in low-level code such as the scheduler,
synchronization primitives, and so on. Keep in mind
that you cannot acquire normal spinlock_t locks while
under the effects of raw_local_irq*().
Miscellaneous
- wait_for_timer()
-
Wait for the specified timer to expire. This is
required because timers run in process in the PREEMPT_RT
environment, and can therefore be preempted, and can
also block, for example during spinlock_t acquisition.
- smp_send_reschedule_allbutself()
-
Sends reschedule IPI to all other CPUs. This is used in
the scheduler to quickly find another CPU to run a newly
awakened realtime task that is high priority, but not
sufficiently high priority to run on the current CPU.
This capability is necessary to do the efficient global
scheduling required for realtime. Non-realtime tasks
continue to be scheduled in the traditional manner per-CPU
manner, sacrificing some priority exactness for greater
efficiency and scalability.
- INIT_FS(name)
-
This now takes the name of the variable as an argument so
that the internal rwlock_t can be properly initialized
(given the need for priority inheritance).
- local_irq_disable_nort()
- local_irq_enable_nort()
- local_irq_save_nort(flags)
- local_irq_restore_nort(flags)
- spin_lock_nort(lock)
- spin_unlock_nort(lock)
- spin_lock_bh_nort(lock)
- spin_unlock_bh_nort(lock)
- BUG_ON_NONRT()
- WARN_ON_NONRT()
-
These do nothing (or almost nothing) in PREEMPT_RT, but
have the normal effect in other environments. These
primitives should not be used outside of low-level code
(e.g., in the scheduler, synchronization primitives,
or architecture-specific code).
- spin_lock_rt(lock)
- spin_unlock_rt(lock)
- in_atomic_rt()
- BUG_ON_RT()
- WARN_ON_RT()
-
Conversely, these have the normal effect in PREEMPT_RT,
but do nothing in other environments. Again, these
primitives should not be used outside of low-level code
(e.g., in the scheduler, synchronization primitives,
or architecture-specific code).
- smp_processor_id_rt(cpu)
-
This returns "cpu" in the PREEMPT_RT environment, but
acts the same as smp_processor_id() in other environments.
This is intended for use only in the slab allocator.
PREEMPT_RT configuration options
High-Level Preemption-Option Selection
- PREEMPT_NONE selects the traditional no-preemption case for
server workloads.
- PREEMPT_VOLUNTARY enables voluntary preemption points, but
not wholesale kernel preemption. This is intended
for desktop use.
- PREEMPT_DESKTOP enables voluntary preemption points along with
non-critical-section preemption (PREEMPT). This is
intended for low-latency desktop use.
- PREEMPT_RT enables full preemption, including critical sections.
Feature-Selection Configuration Options
- PREEMPT enables non-critical-section kernel preemption.
- PREEMPT_BKL causes big-kernel-lock critical sections to be
preemptible.
- PREEMPT_HARDIRQS causes hardirqs to run in process context,
thus making them preemptible. However, the irqs
marked as SA_NODELAY will continue to run in hardware
interrupt context.
- PREEMPT_RCU causes RCU read-side critical sections to be
preemptible.
- PREEMPT_SOFTIRQS causes softirqs to run in process context,
thus making them preemptible.
Debugging Configuration Options
These are subject to change, but give a rough idea of the
sorts of debug features available within PREEMPT_RT.
- CRITICAL_PREEMPT_TIMING measures the maximum time that the
kernel spends with preemption disabled.
- CRITICAL_IRQSOFF_TIMING measures the maximum time that the
kernel spends with hardware irqs disabled.
- DEBUG_IRQ_FLAGS causes the kernel to validate the "flags"
argument to spin_unlock_irqrestore() and similar
primitives.
- DEBUG_RT_LOCKING_MODE enables runtime switching of spinlocks
from preemptible to non-preemptible. This is useful
to kernel developers who want to evaluate the overhead
of the PREEMPT_RT mechanisms.
- DETECT_SOFTLOCKUP causes the kernel to dump the current stack
trace of any process that spends more than 10 seconds
in the kernel without rescheduling.
- LATENCY_TRACE records function-call traces representing
long-latency events. These traces may be read
out of the kernel via /proc/latency_trace. It is
possible to filter out low-latency traces via
/proc/sys/kernel/preempt_thresh.
This config option is extremely useful when tracking
down excessive latencies.
- LPPTEST enables a device driver that performs parallel-port
based latency measurements, such as used by Kristian
Benoit for measurements posted on LKML in June 2005.
Use scripts/testlpp.c to actually run this test.
- PRINTK_IGNORE_LOGLEVEL causes -all- printk() messages to be
dumped to the console. Normally a very bad idea, but
helpful when other debugging tools fail.
- RT_DEADLOCK_DETECT finds deadlock cycles.
- RTC_HISTOGRAM generates data for latency histograms for applications
using /dev/rtc.
- WAKEUP_TIMING measures the maximum time from when a
high-priority thread is awakened to the time it actually
starts running in microseconds. The result is accessed
from /proc/sys/kernel/wakeup_timing. and the test may
be restarted via:
echo 0 > /proc/sys/kernel/preempt_max_latency
Some unintended side-effects of PREEMPT_RT
Because the PREEMPT_RT environment relies heavily on Linux being coded
in an SMP-safe manner, use of PREEMPT_RT has flushed out a number of
SMP bugs in the Linux kernel, including some timer deadlocks,
lock omissions in ns83820_tx_timeout() and friends, an ACPI-idle
scheduling latency bug, a core networking locking bug, and a number
of preempt-off-needed bugs in the block IO statistics code.
Quick quiz answers
Quick Quiz #1: How can semaphore critical sections be preempted in
a non-preemptible kernel?
Strictly speaking, preemption simply does not happen in a
non-preemptible kernel (e.g., non-CONFIG_PREEMPT). However,
roughly the same thing can occur due to things like page
faults while accessing user data, as well as via explicit
calls to the scheduler.
Quick Quiz #2: What is a simple and fast way to implement priority
inheritance from writers to multiple readers?
If you come up with a way of doing this, I expect that Ingo
Molnar will be very interested in learning about it. However,
please check the LKML archives before getting too excited, as this
problem is extremely non-trivial, there are no known solutions,
and it has been discussed quite thoroughly. In particular, when
thinking about writer-to-reader priority boosting, consider the
case where a reader-writer lock is read-held by numerous readers,
and each reader is blocked attempting to write-acquire some
other reader-writer lock, each of which again is read-held by
numerous readers. Of course, the time required to boost (then
un-boost) all these readers counts against your scheduling latency.
Of course, one solution would be to convert the offending
code sequences to use RCU. ;-) [Sorry, couldn't resist!!!]
Quick Quiz #3: Why can't event mechanisms use priority inheritance?
There is no way for Linux to figure out which task to boost.
With sleeping locks, the task that acquired the semaphore would
presumably be the task that will release it, so that is the task
whose priority gets boosted. In contrast, with events, any
task might do the down() that awakens the high-priority task.
[Thanks to Ingo Molnar for his thorough review of a previous draft of this
document].
Comments (12 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
- dmitry pervushin: spi.
(August 8, 2005)
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>