[SGVLUG] Microsoft Reports OSS Unix Beats Windows XP

Mon Nov 14 14:04:45 PST 2005

On 11/13/05, Dustin <laurence at alice.caltech.edu> wrote:
> On Sun, 13 Nov 2005, Chris Smith wrote:
> > On 11/13/05, Dustin <laurence at alice.caltech.edu> wrote:
> > > The last sentence is key.  Yes, you can get higher throughput with "real"
> > > GC (actually, I think ref counting is as "real" as any other method,
> > > unless reaping cycles is fundamental to your definition of GC reality),
> >
> > Hence why I tend to call it "poor man's GC" ;-), although that's a tad
> > unfair as there are circumstances where ref counting is the ideal
> > choice.
>
> I guess I would argue that if you want to make the degree of
> bulletproofing part of the definition, why stop there?

Hey, you're getting me wrong here. I'm agreeing with you that it's a
form of GC, just not a particularly clever one (which is orthogonal as
to whether it might be best for a particular situation).

> > The real benchmarks for this stuff are run in real world scenarios,
> > and so they very much take this in to account. Indeed, you'll find a
> > lot of the higher quality automatic memory management systems
> > specifically have a first generation sized to fit in cache.
>
> It doesn't help as much as advertised:
>
> http://www.cs.umass.edu/~emery/pubs/gcvsmalloc.pdf
>
> concludes via a very clever test rig that you have to occupy about 5x the
> amount of memory needed in order for (their test programs) to run
> comparably (or a bit faster--at least sometimes the usual
> back-of-the-envelope calculation seems to work) to manual deallocation.
> When you don't have that much, it looked like order of magnitude penalties
> compared to manual allocation is realistic.

Hehe. I see we read the same papers. ;-)

That paper has several shortcomings that limit it's usefullness, that
I actually pointed out to the folks who designed the test. In
particular, the "clever test rig" really removes it from reality.
Basically, they set up the runtime so that it's completely
deterministic (i.e. threads always get scheduled in a specific order
and context switches always happen at a specific time). Then they
analyze a run to determine where the last point in the code that an
object was touched, and insert a delete at that point. This is then
presented as the "what if we were using manual memory management"
example. However, what they've effectively created is an ideal "mark"
automatic memory manager with a sub-optimal "sweep": it runs in zero
time (all the analysis is done statically before the program is run),
and it identifies unused memory even before it is no longer reachable,
just so long as it will no longer be used. The only potentially
sub-optimal aspect of the system is the "sweep" phase, which isn't
able to collesce the deletes. So, they compare this to a real world GC
algorithm (or as close as you can get since this is on a reseaerch
VM)... one whose design worries about a number of concerns that
they've eliminated from the system. But most importantly, they've
basically created a comparison between two different GC algorithms,
rather than real world manual memory management and real world
automatic memory management. Not surprisingly, the real world GC
algorithm performs rather poorly, as it is operating with several
disadvantages, and the only aspect of it's design where it has an
advantage is collescing the deletes.

In the real world, particularly with kernels, you don't have a
completely deterministic system. This is why reference counting is so
common in kernels. In the real world, you actually have to devote a
significant amount of time just trying to determine if it's the right
time to do a manual deletion (indeed, this is where reference counting
systems can be beaten by mark-and-sweep collectors, because they
impose a significant constant overhead with each acquisition and
release of a reference in order to determine when it's safe to delete
an object). You also end up having to do significant amounts of
synchronization while you are doing this. That stuff is turns out to
be very expensive, particularly in increasingly sophisticated modern
systems.

> I also love their point that nobody seems to have ever measured
> malloc/free pauses against GC pauses (a case of the no-GC crowd liking to
> assume that they are nearly no-ops).  I believe it's usually a lot better,
> but it would be nice to test that since free() can do quite a lot of
> unexpected work.

Actually, there has been a lot of work here, and it's worth noting
that "GC pauses" don't necessarily exist at all. It very much depends
on the algorithm.

> In the interests of full disclosure, it might be worth saying that for a
> lot of this stuff I'm of the assume-the-worst-case camp.  I always liked
> heapsort better than quicksort because paying a 2x penalty seemed worth
> paying for guaranteed N lg N behavior. :-)  So you can see why I consider
> the usual arguments for GC to be at least slight lies if they don't
> discuss *all* kinds of safety.  Sudden extreme pathological behavior
> beyond what an explicit allocator would suffer under heavy loads seems
> like a recipe for deploying something that works wonderfully until the day
> the company needs it the most.  I regard that as at least as hazardous as
> tracking down free() bugs (which, while very bad, at least tend to offer a
> warning in the form of steadily growing memory footprints).

Your pessimism seems to be a bit selective though. Imagine the joy of
a system with say 100 cores, with each core receiving thousands of
interrupts per second, and in the middle of this all the cores are
acquiring read-only references to some global object. Now, a GC system
is basically going to impose no overhead here, but a reference
counting system is going to require atomic updates of the counter for
that object 100x, in the midst of all kinds of interrupts that have to
be handled. The net effect this could easily make the system run
1/1000 it's expected speed. What should take maybe 10 cycles is likely
going to expend thousands, if not tens of thousands of cycles
(actually, now that I think about it, it could be substancially worse
than even that).

There are, unfortunately, pathological cases for both strategies. So
this isn't so much a case of choosing between "slow but no pathlogical
cases" and "fast but for pathlogical cases". It's a situation where
you *are* going to have pathological cases, and the best you can do is
hope to minimize their frequency and impact.

> > A lot of kernels don't ever swap out their heap, so this need not be a
> > concern in kernel space. In general the small size of kernel heaps
> > tends to minimize the impact of any memory overhead, although one
> > would be a fool to ignore that kernel heaps are kept small for very,
> > very good reasons.
>
> Mmm.  Again, I can't claim to know a great deal about kernel allocation
> patterns, but if you need 5x/N the footprint (where N is the usual
> fragmentation penalty with a good malloc()--I'm not sure what it is,
> actually, but I don't think it is 5x, so I'm assuming that here) to get
> good performance then I am uncomfortable of having it in there.

Yeah, you don't need even close to 5x though. Worst case you need 2x.
I'm sure their concurrent mark-and-sweep algorithm keeps it much lower
than that.

> Again you may laugh at my belief that a collector is less predictable, but
> how many realtime systems use GC?

Well, for hard realtime systems I suspect the answer is pretty close
to zero unless you are counting reference counting as GC. However, for
the more common soft realtime systems, you'll find the number is
surprisingly high.

It's worth noting that in general, realtime systems tend to eschew
complexity wherever possible (for obvious reasons). As such, aside
from the realtime scheduling needs, their systems tend to be far
simpler than the "average". I get the impression that this kernel was
designed for more complex systems, where realtime developers would
likely throw up their hands and give up.

> (There's a very good reason why all functional languages
> (I am aware of) are fully GC'd.)

Well, functional languages have a serious problem with the notion of
state, so it'd be hard to imagine having manual memory management in
that context (what, use a monad for each chunk of memory?!).

> > > For my own purposes, the times I care most about kernel performance is for
> > > responsiveness, and last I checked "real-time" and "garbage collection"
> > > didn't belong in the same sentence, and even making it tolerable requires
> > > a great deal of sophistication.
> >
> > Check again. ;-) They actually have ways of doing this now.
>
> It's a fair point, but saying it's been done isn't the same as saying it
> was a great plan you'd recommend to your friends. :-)

It's a great plan I'd recommend to friends with lots of CPU cores. ;-)

Note that the paper itself delineates a number of significant
advantages that come from automated memory management, and quantifies
the "pauses" as being around 100ms. While not ideal, that's about the
impact of having to page in some memory and I suspect it most often
occurs while the CPUs are otherwise blocked while waiting for IO
anyway.

> > So here's a thought experiment for you... think about the direction
> > CPUs and systems are taking. More and more cores. More and more
>
> > Now tell me that reference counting kernels provides predictable
> > response rates. ;-)
>
> This is an excellent point because my reasoning is quite
> uniprocessor-biased (darn it, your collector wouldn't work well on a 6502,
> so what good is it?!?), and I agree that ref counting should suffer a
> great deal more in an SMP world.

It actually applies pretty well even in uniprocessor cases where you
have a high number of interrupts being thrown (welcome to the hell
that is the modern operating system kernel). Either you design a
system with horrid response rates to interrupts (read: forget about
multimedia or high performance computing) or you end up with a ton of
synchronization points in your kernel around tracking reference
counting.

> If you have so many cores you can just (effectively) give one to the collector
> permanently, it might well be that fancy collection wins in almost all scenarios.

The kernel in Singularity is a concurrent mark-and-sweep collector
(presumably compacting as well), so this is in fact what they are
using. They don't say "parallel" explicitly, but it's possible it
might even be able to have multiple cores doing collecting at once.

> OTOH, the gaps between CPU speeds, memory speeds, and storage speeds are
> increasing, which is worse for automatic collection.

Actually, it's the other way around. If every time you free up memory
you have this huge performance penalty while you sync with memory,
it's much better to collesce these operations to allow the penalty to
be amortized across multiple deallocations. This also means you have
more and more essentially "unused" CPU cycles which can instead be
collesced together into an increasingly sophisticated memory manager.

> > I suspect these researchers were looking at more sophisticated
> > mechanisms than reference counting for exactly this reason, and
>
> No, it's because they belong to a cult.  They might or might not be right,
> but I claim only half-joking that it's an accident because they're
> worshipping a God and not writing code. :-)

I think you're projecting on to them. They are very much writing code
(this is a functional OS, perhaps not production quality, but it
actually does do stuff). More importantly, they are experimenting with
a very real idea: what happens if you get rid of the overhead of
hardware memory protection and let the kernel enforce memory
protection in a more flexible and sophisticated manner?

> > One should keep in mind that the "Linus" mindset only applies until
> > things become hideously complicated (and the kernel has undergone
> > revisions to reflect this).
>
> Well, explicitly so, since it assumes that writing correct (monolithic
> kernel/explicitly deallocated) code is an option.  When that's no longer
> true then the argument is comparing the possible with the impossible.

I was talking about sticking with simple design principles that just
require a lot of manual labor to get things right. There are several
places in the kernel where the initial "simple" design that Linus
defended rather aggressively have essentially been thrown out, often
repeatedly (SMP would be a pretty clear example).

> > If you look at kernels designed for large
> > numbers of cores, you'll find that they get increasingly complex (and
> > therefore difficult to thoroughly test and tune, let alone being
> > stable) unless some "performance sapping" abstractions are applied.
>
> Fine, but are you arguing that this complexity is greatly reduced by
> introducing kernelspace GC?

Not explicitly. I'm saying the complexity requires a very complex
memory management scheme, which is going to be at least as complex as
a sophosticated GC. As such, having one doesn't necessarily increase
complexity, and it may actually reduce it.

> I'd have thought it mostly stemmed from other
> issues entirely and wouldn't be affected much by the kernel's memory
> allocation strategy.  I suppose I can see how memory management might
> become a big deal, but can't really say anything useful without learning a
> lot more about kernel memory usage.

You can rest assured that in just about any serious OS kernel these
days, memory management is a HUGE deal.

> Mmmm.  Yeah, there is something to that argument.  About those guys we're
> assuming would be writing the collector, though--I have the feeling that
> suggesting GC on LKML would be an easy way to commit suicide. :-)

The LKML isn't necessarily the best place to float trial balloons
about new approaches to kernel design. They tend to recognize the
value of such approaches only after trying all others. ;-)

> > sizes. Since the kernel needs to run forever with consistent
> > performance, it needs to be able to continue to allocate potentially
> > large contiguous structures even after billions of allocations (the
> > Linux kernel actually isn't terribly great at this).
>
> Mmm.  How does kmalloc compare to the regular GNU malloc (which IIRC
> provides rather good performance)?

In the Linux kernel, it's very, very hard to get large contiguous
memory blocks (something that video capture folks gripe about all the
time), particularly if the kernel has been running for a while.

> > So, all in all, you tend to end up with something that looks
> > suspiciously like automated memory management, even if it isn't.
>
> It's not clear to me that it's any more like that than writing a
> general-purpose malloc(), however, since that is really just a description
> of writing a general-purpose manager of any type.  Or were you suggesting
> that a *copying* (or hybrid) collector in particular would be a win when
> you're uptime is in the hundreds of days?

Yes, I am suggesting that. I've actually seen situations (indeed with
realtime embedded systems) where that was a compelling argument for
going with GC.

> Could be, but mark-and-sweep is "real" GC and it shouldn't do any better
> than malloc() since it can't move allocated objects.

I think you are referring to a C/C++ mark-and-sweep collector. Now, by
definition and mark-and-sweep collector doesn't *have* to move
allocated objects, as opposed to a copying collector, but that doesn't
mean that it *can't*. Indeed, for languages where memory addresses
aren't directly addressed it's pretty common for mark-and-sweep
collectors to be compacting (Sun's JVM for example).

> Aside from anything else, I'm going to have to contemplate the pain of
> tracking down bugs in a sophisticated kernel space collector.

In general, tracking down memory bugs in kernel space can be a real
pain. I know a poor developer who literally spent months trying to
track down a particularly nasty one in his NT driver. The Linux kernel
tends to be a bit simpler than NT in this regard, but it keeps getting
more and more complicated.

--
Chris