[SGVLUG] bit order (was Linux based web-server appliance)

Dustin Laurence dustin at laurences.net
Wed May 24 16:06:24 PDT 2006


On Wed, May 24, 2006 at 02:58:25PM -0700, David Lawyer wrote:

> On Fri, May 19, 2006 at 01:51:01PM -0700, Dustin Laurence wrote:
> > 
> > Quite to the contrary, the provably optimal algorithm is inherently
> > big-endian,
> Not necessarily.  See my remarks latter on.

Yes, necessarily.  You just don't account for the costs of your
algorithm.

> I use a RPN calculator and never thought about the state.  There's
> just the contents of the 4 stack registers.  I always thought of it as
> just using postfix notation where 2 3 + results in 5.  You just place
> the operator +after the two operands 2 and 3. 

It isn't just that RPN, because Horner's rule is a particular factoring.
Consider evaluating

    a x^3 + b x^2 + c x + d

on a stack machine (rpn calculator).  The naive algorithm works just as
well in RPN:

    x x x * * a * x x * b * + x c * + d +

(naturally you'll use <enter> to duplicate x, but written that way it is
easier to compare to the Horner's rule version below).  That is a
literal translation of the expression, obeying normal precedence rules.
It takes six multiplications and 3 additions.  Now factor out according
to Horner's rule:

    d + x( c + x( b + x a ) )

That is the infix form, since standard notation is infix.  Now evaluate
on a stack machine:

    x a * b + x * c + x * d +

that requires 3 multiplies and 3 additions; it turns out to be provable
that no algorithm can take less than that.

The connection between Horner's rule and postfix notation is just that
by making parentheses and explicit '=' evaluation unnecessary it takes
fewer keystrokes to apply Horner's rule in RPN, and probably also the
fact that RPN users are generally sophisticated enough to do it very
easily (whereas you don't know with algebraic users--they may be very
sophisticated, or they may be using algebraic calculators because they
couldn't understand RPN).

> But back to the question of efficiency.  But first, putting the case in
> context, most of the time there is no need to extract a number from
> digits because the number is in binary and it's just kept in binary
> form.  So extracting a number from it's digits is really just changing
> the basis, like from binary to decimal or hexadecimal.

You fail to understand the point.  Bytestream-oriented network protocols
effectively work in base 256; I mentioned that, but perhaps it wasn't
clear enough.  I then couched my example in decimal digits simply
because it is more familiar.  If that confuses you, go back and re-read
it using base 256.  Since bytes are the smallest directly-addressible
unit in many contexts, including modern CPUs, C, and (I think) network
protocols at the level of IP and higher (and IP is what we were
discussing--I think it's true down at the link layer (think Ethernet) as
well, actually), bytes are the right unit to use (and another
demonstration that the bit-endianness of the physical layer is entirely
separate from the byte-endianness of the higher layers).

The point is: just reading a multi-byte number from the stream is
*necessarily* a polynomial evaluation.

> Now the value of the number "def" (with base b) is d x b^2  +  e x b^1
> + f x b^0.  But if we are going to do a lot of such calculations with
> base b, we precalculate b^0, b^1, b^2, etc.  So let B be the vector of
> the b's (..., b^3, b^2, b^1, b^0) and X be the vector of the digits
> (d, e, f, ....).  The value of the number is B.X where . is inner
> product.  B is a constant vector.  Now the efficiency of calculating
> B.X doesn't depend of whether or not we start multiplying with the low
> order order digit of X or with the high order.

> Above has n -1 multiplications and n -1 additions, the same as 
> with the inner product of B and X vectors as I just described.

Sure it does, you just have to read what I wrote.  Trading space
efficiency for time efficiency doesn't impress anyone when you could
have had both by calculating big-endian.  You are allocating extra
storage for your vector of precomputed powers, and therefore must make a
decision in advance about the maximum size of numbers you can interpret
or must add in some (comparatively) complex code to dynamically resize
your array when you get a sufficiently long number.

Let's actually implement these algorithms.  First, if digits are in
strict big-endian order:

    digits_left = next_byte() - 1; /* Assume protocol disallows
                                    * "0-digit" numbers. */
    unsigned char value = next_byte();

    while (digits_left > 0) {
        value = 0x100*value + next_byte();
        digits_left--;
    }

And I could even have written the calculation as "value = (value << 8) |
next_byte()" for the benefit of brain-dead compilers.  I've been
deliberately vague about some types because they don't matter in the
example as long as they are "sufficiently large".  Notice how

Now little endian:

    static const long powers[] = {0x100, 0x10000, 0x1000000};

    max_digit_index = next_byte() - 1;

    value = next_byte();

    digit_index = 0;
    while (digit_index < max_digit_index) {
        value = power[digit_index]*value + next_byte();
        digit++;
    }

Notice that I had to introduce an extra vector of storage for the
precomputed powers to avoid the multiplication and an extra local
variable to count *up* the digits.  In this example neither is a big
deal because we're implementing in software, but there is no question
that the big-endian calculation is more efficient no matter how you
trade-off space and time.  Worse, the little-endian example was harder
to write correctly (I didn't take the time to compile these, BTW, so I
don't guarantee them much) and readably because the big-endian version
is just more *natural*.  Anyone who prefers the code in the second
example has no taste and had better go back to PHP.

That was in the best case scenario for little-endian, where the maximum
number of digits that can ever occurd and the actual number of digits
are known in advance (because it's rather likely, the encoding doesn't
allow guard values).  If it was a terminated bytestream rather than a
counted bytestream, then the big-endian code is almost unchanged but the
little endian code grows a few more warts.

And if we're doing this in hardware, and we don't just throw an
oversized controller at it because they're so cheap these days, it
*does* matter.  The hardware engineers are going to have some
unprintable things to say about the incompetence of people who define
protocols that require extra registers for nothing.

> O(n^2) is true if we didn't store the values of b^n in a table.  But
> for repeated operations one would store them in a table so it's O(n).
> In fact the b^n values can be hard-coded into the hardware or
> software.

Sure, and now the little-endian version takes O(n) space while the
big-endian one continues to take O(1).  I just assumed I didn't need to
explain obvious time-space tradeoffs explicitly.

> For little endian, we can wait till we get all the digits and then use
> Horner's, inspecting the most significant digit first.  OK, I admit I
> was wrong stating an advantage for little endian.  I guess it really
> doesn't make much difference.

But does--if you do that, you're again using O(n) space, this time for
the byte buffer rather than the precomputed powers.  And what are you
doing?  You are reversing the byte order to big-endian, the order that
it should have been in the first place, to compensate for an
incompetently defined little-endian protocol.

The argument that endianness doesn't matter much has to do with
endianness *in memory*, where we have random access to the bytes (and
thus can step through backwards with no time or space penalty, and apply
Horner's rule no matter what the order).  Plus the architecture
endianness is about ordering within certain machine-defined words, not
arbitrary length bignums.  It doesn't really matter there.  But for
sequential access, it does, which is likely why the transmission
protocols tend to be big-endian.

> -----------------------topic change-----------------------------------
> > 
> > Anyway, nobody cared about CPU consumption in that era and so it
> > wasn't optimized.
> 
> The people who made my 486 cared since they had a "green" model which
> had very low consumption which was advertised in the MB manual with
> estimates for the HD and monitor.

While it's a good point about the drive an monitor, and I'm sure you're
right that the monitor was designed to save energy, I doubt very much
that applies to the CPU design itself.  Do you care to argue that Intel,
which didn't publicly care about power consumption on the desktop until
the P4's started to dissipate ~200W, was worrying about power
consumption in the 486 era?  I really don't think so.

> But what about the consumption per hour?

If that were the only consideration I'd just use a slide rule.

Dustin, who dissipates >= 70W.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.sgvlug.net/pipermail/sgvlug/attachments/20060524/93de8554/attachment.bin


More information about the SGVLUG mailing list