SHA1 [C++]

Prev: C++ Concepts
Next: Portable way to write binary data

From: Le Chaud Lapin on 21 Nov 2005 20:28

kanze wrote:
> Le Chaud Lapin wrote:
> > In any case, again, you have to be fastidious in watching the
> > spec in this section. In general, you cannot slap on an
> > entire 64-bit chunk onto the message. For example, if the
> > message is 440 bits long, you're only allowed to add 1 byte
> > for padding: the bit-pattern of a 1 followed by 7 zeros. This
> > will bring you to congruency mod 448, at which point you can
> > tack on the 64-bit length to get to a full 512 bits.
>
> His interface forced the user to provide the 32 bit chunks.

Not sure if that is the best way to approach this. To view the input
data to the hasher as a stream of bytes is far more general than
forcing the view as vector of 32 bit chunks, and in most cases
eliminates unnecessary copying.

> Technically, of course, you might have to pad as little as a
> single bit. In practice, you can pad using the same multiple of
> bits as the user is required to provide (supposing that is a
> power of 2). Of course, for most uses, you'll want to support
> at least as small as bytes.

This is true, and it also hints at a fact that many of us overlook in
certain kinds of programming (embbeded, communications, etc): when you
send a byte from one machine to another, there is already a guarantee
that the bit order will be preserved for you. Imagine a world where
you had to worry about bit-order too.

> > Technically, you have to be prepared for crunching a digest
> > whose size is greater than (2^32). Here you're only taking
> > care of lengths up to 2^32.
>
> Depends on the application. My own implementation of MD5 is
> limited to the SIZE_T_MAX * 8 bits. Perhaps internally, there
> are other limits which means that the messages will be small.

Hmm...not sure you can call this an implementation of MD5. I would make
a concession that it is ok to view the data as a string of bytes
instead of a string of bits, which, as you pointed out, is what the
spec calls for, for true compliance. But I think tossing out the
upper-half of the 64-bit word for the length and calling it an
implementation of the algorithm is stretching it. This presumes of
course that you are making a generally resuable component, which you
intend to polish off and make available for others to use.

> > 2. You want the output of the hashing to be an object itself
> > so that you can let it construct, assign it, compare it, store
> > it in containers, toss it around, etc., let it destruct.
>
> Maybe. In my case, my MD5 calculator is an accumulator. I
> suppose that there is some argument for defining a result type,
> but in practice, I've almost always read the results using the
> << operator, e.g.:
>
> file << myText
> << std::accumulate( myText.begin(), myText.end(), MD5() ) ;
>
> Internally, the << operator uses a member function:
> template< typename OutIter >
> void resultsTo( OutIter dest ) const ;
> which I made public, more because there wasn't any reason not
> too, and I thought it might be useful to someone.

Where is the digest? Certainly it would be nice to write
map<Digest,Foo>, for example. This again assumes that you are making a
class that someone else would want to use. I can hardly think of a
more opportune moment to let the object-oriented features of C++ bear
down on the design of a component.

> > 4. You want the hash algorithm itself to be fast.
>
> You want it to be correct first. It isn't, so it's not yet time
> to worry about speed.

Well, when I do my designs, I like to take everything into account all
at once, including performance (macro, not micro). I spend most of my
time mulling. Then I fidget to verify propositions, then back off, then
walk around in woods some, then eat a snack, then tinker, then only
when I think I have a visceral feel for what's going on do I begin to
bear down on the implementation. For me the code always comes out
cleaner than it would have if I had taken the incremental approach, and
takes less time to get a finished product (contratry to what others
claim). You're going to have to think about these things at some point
anyway, so you might as well mull it over as much as possible before
you start typing, because the mind is faster than the hand (though both
work better with the help of the other).

> > I would define the algorithm itself as a namespace.
>
> Why?

Well, to give you a real example, lets say you had a massive software
probject that depended on primitives involved in security, including
hashing. One master #include file contained:

namespace Security_Set
{
using MD5::Hasher;
using MD5::Digest;

using namespace RSA;

typedef RC6::Cipher<32,20,16> Symmetric_Cipher;
typedef RC6::Key<16> Symmetric_Key;
}

Now let's say two years later, someone found a flaw in your chosen
Hasher! Good grief!! No worries, you could simply modify your
namespace:

namespace Security_Set
{
using SHA_256::Hasher;
using SHA_256::Digest;

using namespace RSA;

typedef RC6::Cipher<32,20,16> Symmetric_Cipher;
typedef RC6::Key<16> Symmetric_Key;
}

If you have been using Hasher and Digest objects throughout your
system, all you would have to do is recompile, and everything should
work, data stored on disk notwithstanding. As a matter of fact, this
is precisely what I intend to do, selecting from multiple Hashers,
multiple symmetric cipher systems, and multiple asymmetric cipher
systems, and multiple nonce sizes (128-bit versus 192-bit versus
256-bit) to test performances, and also verify that it is possible to
do a simple recompile of a very large system and have it work with the
new cipher primitives without incident. I cannot express in words how
convenient this is. With this technique, if someone were to find a
flaw in Rijndael 2 days before we shipped our product, we could
actually make the change I just mentioned, recompile, and ship using an
alternative symmetric cipher. The marketing people would have to
update their collateral.

> After a quick glance at SHA-1, I suspect that you could use some
> sort of common machine for the entire family, including SHA-1
> and MD-5, with a policy class to customize, much as is standard
> practice for CRC's. I'd certainly evaluate this route before
> implementing my second digest in the family. (As mentionned
> above, I've already implemented MD5.)

Yikes. This is good example, where the whole resusability hunt is too
agressive IMO. Hashers are very well defined. You know exactly what
they are, the sizes of their digests, the expectations on input. They
are small in terms of the code size. Trying to force them to be common
just so you can say, "it took me a week but I finally tweaked these
things enough to use common code" is a bit much, don't you think?

> This is IMHO the big question with regards to the interface. In
> my current implementation, extracting the value would modify the
> internal state, so that the accumulator itself could not be
> reused. [snipped]

This is the fun part of software design - imagining all the situations
in which the primitive might be used, and then designing it so that it
will feel good for the user once the primitive is wielded in the
construction of a system. Note that "feel good" comes from eliminating
as much uncertainty surrounding the primitive as possible. Said in a
different way, "feel good" comes from a reduction on requisite mental
burden.

1. Hash.
2. Hash.
3. Hash.
4. Hash.
5. Hash, but get the digest and reset the hasher.

This feels right, and it will work in almost any context, including
hashing a sequences of 32-bit words stored in a vector<>. The
programmer does not have to worry about inefficiencies from copying.

> However, all of the functions which one would normally
> use to do this work on a copy, and not the original object, so
> the orginal object remains unmodified. This copy isn't free, of
> course, but the object was designed to be as cheap to copy as
> possible -- all data members are arrays of integral types, so
> basically, the copy constructor just does a memcpy. And given
> the cost of calculating an MD5 digest anyway, I suspect that it
> doesn't matter.

Hmm..I still prefer the idea of letting the programmer supply a const
void * to the data, its length (unsigned of course), then doing a hash,
and asking for the digest if it is time to get the digest, and if not,
letting the hasher maintain internal state until such digest is wanted.
Also, if you allow explicit contrsuction of the Digest from a
std::string, and allow it to yield itself as a std::string, you've
pretty much finsihed off the Digest as a class (unless you are doing
crypto, in which case allowing the Digest to yield itself as a big
integer can be very nice.)

-Le Chaud Lapin

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Jakob Bieling on 21 Nov 2005 20:24

Le Chaud Lapin <unoriginal_username(a)yahoo.com> wrote:
> gkoreman(a)gmail.com wrote:

>> // Step 3
>> message.push_back(messageLength);

> Technically, you have to be prepared for crunching a digest whose size
> is greater than (2^32). Here you're only taking care of lengths up to
> 2^32.

I might be missing something .. but why 2^32?

regards
--
jb

(reply address in rot13, unscramble first)

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: gkoreman on 21 Nov 2005 20:22

I know it is shorter, I was just mentally adding 0's when needed. The
digest isn't really going to be printed out.

I am not running this on an x86. I am running it on a big endian
machine to start out, then once I get it working on big-endian, I will
make it work on little-endian.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Thomas J. Gritzan on 21 Nov 2005 20:44

kanze schrieb:
> Thomas J. Gritzan wrote:
>>Works for me on a little-endian machine (only for zero-length messages).
>
> Only for zero-length messages. Of course, his interface only
> allows for message lengths which are a multiple of 32 bits, so
> this might be OK.
>
> And how does the endian-ness make a difference. I implemented
> MD5 on a Sparc (big-endian) -- I've since compiled the code on a
> Linux based PC (little-endian), and it works perfectly well. Is
> there something fundamentally different between MD5 and SHA-1
> which means that SHA-1 is endian dependant? They look very
> similar in priciples to me.

I changed the interface, adding an length-parameter to the function. Now
it works for the text examples on wikipedia as well.
The endian-ness makes no difference except for the order of the
high-part and the low-part of the 64-bit "length of the message". Since
the OP did it as 32-bit, he bypasses this problem.

So the code works without change on little & big endian machines.

Thomas

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Le Chaud Lapin on 21 Nov 2005 21:10

kanze wrote:
> And how does the endian-ness make a difference. I implemented
> MD5 on a Sparc (big-endian) -- I've since compiled the code on a
> Linux based PC (little-endian), and it works perfectly well. Is
> there something fundamentally different between MD5 and SHA-1
> which means that SHA-1 is endian dependant? They look very
> similar in priciples to me.

It definitely makes a difference, but first, let me rant:

Big-endian is silly, IMO. It was created as a substitute for
little-endian to help those who, for whatever reason, find
little-endian confusing when it's printed out. Ironically, it doesn't
help, because it is precisely these people are most likely to get
confused no matter what, so what you end up with is two models, one for
people who not confused in the least but have to acknowledge the
nuisance that is big-endian, and one for the people who are still
confused and upset that big-endian didn't rid them of the confusion:

http://groups.google.com/group/comp.protocols.tcp-ip/browse_frm/thread/3f89dd1fb5a3fa66

Getting of my rant box, if the input data is 0x01234567, that's 4
bytes. In a buffer, begining at offset 0, either the 0x01 will be
there or the 0x67 will be there.

When the programmer goes to form a word from this buffer to be used in
the plethora of Boolean operations of the hash algorithm, s/he will
have to choose how to extract the bytes from this stream to form the
word. Note that implicit in all of this is that when the stream of
bytes move from one machine to another, the bit order is preserved for
individual bytes, and the byte order is preserved for a stream. So if
0x01 is the byte at offset 0 (big-endian), it will be the byte at
offset 0 when it is sent as a sequence of bytes. But if a
little-endian receiving machine gets these 4-bytes, and puts in them in
increasing memory locations starting at 0, and then extracts the 4
bytes as a word, the word will become 0x76543210.

So there are two subtleties here. The first is that bit-order is
guranteed no matter what from machine to machine. We take that for
granted, because it is. The second is that no one ever said that we
were allowed to treat the input stream as a sequence of words. The
most that we are allowed is bytes. So during both extraction and
implantation of words, we have to be careful about byte-sequences.

-Le Chaud Lapin-

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: C++ Concepts
Next: Portable way to write binary data