[SGVLUG] Microsoft Reports OSS Unix Beats Windows XP

Sun Nov 13 22:58:59 PST 2005

On Sun, 13 Nov 2005, Chris Smith wrote:

> On 11/13/05, Dustin <laurence at alice.caltech.edu> wrote:

> > The last sentence is key.  Yes, you can get higher throughput with "real"
> > GC (actually, I think ref counting is as "real" as any other method,
> > unless reaping cycles is fundamental to your definition of GC reality),
> 
> Hence why I tend to call it "poor man's GC" ;-), although that's a tad
> unfair as there are circumstances where ref counting is the ideal
> choice.

I guess I would argue that if you want to make the degree of
bulletproofing part of the definition, why stop there?  Why not say
copying or mark & sweep aren't "real" because it's possible to leave a
pointer to the root of a huge no-longer-live data structure?  OTOH if the
definition is simple, "automated memory reclamation," it seems to me that
ref counting counts.

However, I will grant you that the literature talks like ref counting 
isn't "real" at least half the time, so you have numbers on your side at 
least.  I blame the fact that ref. counting is possible in C and C++ and 
therefore automatically evil. :-)  In fact, I believe I sometimes talk 
that way too, in spite of my theoretical objection to the idea.  So I 
can't really get too high up on my horse....

For convenience I'll go the whole stupid idea and mostly use GC for
mark-and-sweep/copy collection, because there isn't a concise term for "GC
except for ref counting". :-)

> Actually, GC tends to be a bigger win on the responsiveness side if
> you are using a parallel collector, although it can get ugly if the
> collector can't keep up with the allocator.

See below.

> > But...the memory overhead of GC is startling, and downplayed by GC bigots
> > (just like the time and space overhead of malloc is downplayed by anti-GC
> > bigots).
> 
> Well wait... how can both GC and non-GC approaches impose time and
> space overheads that each side tends to downplay? I mean, surely one
> approach clearly has more overhead than the other, or there isn't a
> clear overhead in which case your assertion doesn't make any sense....

Nothing so complex.  I wasn't taking a side on the answer, just taking a
cheap shot at all & sundry.  I was saying that GC bigots never go beyond
casually dismissing the performance costs and maybe mentioning the
asymptotic behavior, while anti-GC bigots talk as though free() is no more
expensive than nulling a pointer.

In other words, both habitually compare their imaginary optimal system 
against the other guy's imaginary pessimal system.

> > Further, I am not aware of any "real"  collector that doesn't stomp all over the
> > heap,
> 
> Well, a compacting or copying collector will potentially "stomp" all
> over the heap. However, there are a number of approaches which can
> mitigate the impact of this.

The one I know of is this:

http://www.cs.umass.edu/~emery/pubs/f034-hertz.pdf

but half of what is interesting about the paper is that it more or less
says that half of what has been claimed about GC is a lie in the real
world.  We've been told for a *long* time that there is no need for
anything but GC'd dynamic allocation, ever, yet even the best ones must
have behaved at least as poorly under memory pressure during that time as
those they benchmark their bookmarking collector against.  I'm not very
interested in the opinions of people who make claims like that.

Let's be fair and mention that I have a built-in prejudice against people
who want to tell me that I should use GC in a program that allocates a
small number of large, long-lived matrices and then flogs the daylights
out of some difference equations for hours.  It's silly, of course, but
that is more or less the message of the GC crowd.  (The message of the
anti-GC crowd could be called premature optimization by perfect
programmers, among other things.)

I sort of advocate burning all the GC zealots at the stake for the Good Of 
Mankind, regardless of which way they're zealous. :-)

> ...More importantly, inside a kernel the
> heap is small enough that the impact of said "stomping" might be tiny
> compared to the benefits in terms of defragmenting memory.

I'll grant you I'm not real familiar with the kernel's allocation 
patterns.  Also, I'm not used to thinking about GC in kernel space--of 
course that's the one place you don't actually care about that unless you 
actually swap kernel memory.  I know Linux doesn't.  So it was somewhat of 
a silly objection.

If the kernel can swap, then I say again that it seems worse to me to have 
the kernel become unresponsive than userspace code.  Arguments to the 
contrary are welcome if you think that's such an irrational prejudice.

> The real benchmarks for this stuff are run in real world scenarios,
> and so they very much take this in to account. Indeed, you'll find a
> lot of the higher quality automatic memory management systems
> specifically have a first generation sized to fit in cache.

It doesn't help as much as advertised:

http://www.cs.umass.edu/~emery/pubs/gcvsmalloc.pdf

concludes via a very clever test rig that you have to occupy about 5x the
amount of memory needed in order for (their test programs) to run
comparably (or a bit faster--at least sometimes the usual
back-of-the-envelope calculation seems to work) to manual deallocation.  
When you don't have that much, it looked like order of magnitude penalties
compared to manual allocation is realistic.

I also love their point that nobody seems to have ever measured
malloc/free pauses against GC pauses (a case of the no-GC crowd liking to
assume that they are nearly no-ops).  I believe it's usually a lot better,
but it would be nice to test that since free() can do quite a lot of
unexpected work.

In the interests of full disclosure, it might be worth saying that for a
lot of this stuff I'm of the assume-the-worst-case camp.  I always liked
heapsort better than quicksort because paying a 2x penalty seemed worth
paying for guaranteed N lg N behavior. :-)  So you can see why I consider
the usual arguments for GC to be at least slight lies if they don't
discuss *all* kinds of safety.  Sudden extreme pathological behavior
beyond what an explicit allocator would suffer under heavy loads seems 
like a recipe for deploying something that works wonderfully until the day 
the company needs it the most.  I regard that as at least as hazardous as 
tracking down free() bugs (which, while very bad, at least tend to offer a 
warning in the form of steadily growing memory footprints).

> > so the extra memory that is required to start reaping the benefits
> > of that calculation is larger than one might think.
> 
> Despite my point-by-point responses to your points, I agree with this
> final point here. In general, memory management issues are quite
> complex, and it takes a lot of study and expertise to really determine
> what is the most efficient approach.

I agree with that, but the above study is pretty nice.  Particularly in 
that they get a handle on comparing the two methods with programs designed 
for GC, rather than the usual method of comparing C programs designed for 
explicit allocation with and without the Boehm collector.

> A lot of kernels don't ever swap out their heap, so this need not be a
> concern in kernel space. In general the small size of kernel heaps
> tends to minimize the impact of any memory overhead, although one
> would be a fool to ignore that kernel heaps are kept small for very,
> very good reasons.

Mmm.  Again, I can't claim to know a great deal about kernel allocation 
patterns, but if you need 5x/N the footprint (where N is the usual 
fragmentation penalty with a good malloc()--I'm not sure what it is, 
actually, but I don't think it is 5x, so I'm assuming that here) to get 
good performance then I am uncomfortable of having it in there.

Again you may laugh at my belief that a collector is less predictable, but
how many realtime systems use GC?  (But see comments about MP systems
below.)  Probably some, but I bet a lot fewer than use malloc with or
without ref counting, and for fairly good reasons.

> management). Also, I wouldn't necessarily assume that this makes it
> more vulnerable to memory pressure. Indeed, it's quite possible the
> added flexibility of a sophisticated automatic memory management
> system may make it easier to respond to memory pressure.

I find the above paper quite convincing that it *does* suffer *much* worse
under memory pressure.  That said, I haven't lived enough in kernel space
to be quite sure that the conclusions are still valid there.  
Particularly since one of the ugliest habits of collectors is to step on a
lot more swapped-out or uncached pages, and in a kernel like Linux that
doesn't happen because the pages would never swap out (but the cache
misses would still happen at a level I couldn't really guess).

> > In userland you have a lot more choices about what performance trade-offs
> > work for your application, but you're *going* to be running a kernel no matter
> > what so you have less flexibility (and argument that, logically, would argue for a
> > GC/noGC compile time option, an interesting idea).
> 
> Instead what tends to happen is that one builds a memory management
> scheme designed to perform as well as possible in as many different
> scenarios as possible.... sounds exactly what automatic memory
> management systems are designed for, doesn't it? ;-)

Well, in theory.  In practice, I do not trust them to be as predictable as
malloc and free for typical real-world programs.  I'm willing to make
exceptions, though, especially for programs that allocate a huge number of
small, short-lived objects, where I'd tend to be very very worried about
depending on malloc.  (There's a very good reason why all functional
languages (I am aware of) are fully GC'd.)

> > For my own purposes, the times I care most about kernel performance is for
> > responsiveness, and last I checked "real-time" and "garbage collection"
> > didn't belong in the same sentence, and even making it tolerable requires
> > a great deal of sophistication.
> 
> Check again. ;-) They actually have ways of doing this now.

It's a fair point, but saying it's been done isn't the same as saying it 
was a great plan you'd recommend to your friends. :-)

> So here's a thought experiment for you... think about the direction
> CPUs and systems are taking. More and more cores. More and more

> Now tell me that reference counting kernels provides predictable
> response rates. ;-)

This is an excellent point because my reasoning is quite
uniprocessor-biased (darn it, your collector wouldn't work well on a 6502,
so what good is it?!?), and I agree that ref counting should suffer a
great deal more in an SMP world. If you have so many cores you can just
(effectively) give one to the collector permanently, it might well be that
fancy collection wins in almost all scenarios.

OTOH, the gaps between CPU speeds, memory speeds, and storage speeds are 
increasing, which is worse for automatic collection.  Maybe not as much as 
tons of hardware threads is good, I don't really know of course.

> I suspect these researchers were looking at more sophisticated
> mechanisms than reference counting for exactly this reason, and

No, it's because they belong to a cult.  They might or might not be right, 
but I claim only half-joking that it's an accident because they're 
worshipping a God and not writing code. :-)

> One should keep in mind that the "Linus" mindset only applies until
> things become hideously complicated (and the kernel has undergone
> revisions to reflect this).

Well, explicitly so, since it assumes that writing correct (monolithic
kernel/explicitly deallocated) code is an option.  When that's no longer 
true then the argument is comparing the possible with the impossible.

If you look at kernels designed for large
> numbers of cores, you'll find that they get increasingly complex (and
> therefore difficult to thoroughly test and tune, let alone being
> stable) unless some "performance sapping" abstractions are applied.

Fine, but are you arguing that this complexity is greatly reduced by
introducing kernelspace GC?  I'd have thought it mostly stemmed from other
issues entirely and wouldn't be affected much by the kernel's memory
allocation strategy.  I suppose I can see how memory management might
become a big deal, but can't really say anything useful without learning a
lot more about kernel memory usage.

> The allocation patterns it is a win for are very much dependent on how
> the memory manager was tuned, which ironically is also the case for
> explicit memory management. The only difference is that explicit
> memory management gives an application developer who can't tweak the
> memory manager more knobs to turn to try to get things right. Aside
> from third party device driver authors (who tend to have common
> allocation patterns) the folks in control of the kernel code are the
> same people who'd be in control of the memory manager.

Mmmm.  Yeah, there is something to that argument.  About those guys we're
assuming would be writing the collector, though--I have the feeling that
suggesting GC on LKML would be an easy way to commit suicide. :-)

> sizes. Since the kernel needs to run forever with consistent
> performance, it needs to be able to continue to allocate potentially
> large contiguous structures even after billions of allocations (the
> Linux kernel actually isn't terribly great at this).

Mmm.  How does kmalloc compare to the regular GNU malloc (which IIRC 
provides rather good performance)?

> So, all in all, you tend to end up with something that looks
> suspiciously like automated memory management, even if it isn't.

It's not clear to me that it's any more like that than writing a
general-purpose malloc(), however, since that is really just a description
of writing a general-purpose manager of any type.  Or were you suggesting
that a *copying* (or hybrid) collector in particular would be a win when
you're uptime is in the hundreds of days?  Could be, but mark-and-sweep is
"real" GC and it shouldn't do any better than malloc() since it can't move
allocated objects.

Aside from anything else, I'm going to have to contemplate the pain of
tracking down bugs in a sophisticated kernel space collector.  It sounds
like a very, ah, *interesting* experience.  Remind me again why Linus says
we don't need a kernel debugger? :-)

Dustin