SHA1 [C++]

Prev: C++ Concepts
Next: Portable way to write binary data

From: Le Chaud Lapin on 22 Nov 2005 06:25

Jakob Bieling wrote:
> Le Chaud Lapin <unoriginal_username(a)yahoo.com> wrote:
> > Technically, you have to be prepared for crunching a digest whose size
> > is greater than (2^32). Here you're only taking care of lengths up to
> > 2^32.
>
> I might be missing something .. but why 2^32?

Well SHA-1 calls for taking a message of N bits and padding it with a 1
bit followed by however many bits are necessary to reach congruence to
448 mod 512, which is 512-64 bits, where the 64 bits are reserved to
store the original length of the message. You can see that the OP is
padding the message with a 1 bit, followed by however many zero bits
are necessary to get congruence to 512 - 32 = 480 bits (he wrote 490
but meant 480).

This is why the result will come out right for a zero length string,
but not necessarily right for other strings. OTOH, he mentions in
follow-up post that he knew this and was just trying to get the thing
working first.

-Le Chaud Lapin-

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Carl Barron on 22 Nov 2005 06:29

In article <1132616256.047169.129400(a)g44g2000cwa.googlegroups.com>, Le
Chaud Lapin <unoriginal_username(a)yahoo.com> wrote:

>
> Big-endian is silly, IMO. It was created as a substitute for
> little-endian to help those who, for whatever reason, find
> little-endian confusing when it's printed out.
I am not so sure about any factuality in this, only that apparently
you are used to assuming the world should be small endian. I think
historically big endian was around long before little endian. I don't
recall any endian conversations before the z80,m6502 era.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: gkoreman on 22 Nov 2005 06:30

....
But if a little-endian receiving machine gets these 4-bytes, and puts
in them in
increasing memory locations starting at 0, and then extracts the 4
bytes as a word, the word will become 0x76543210.
....

Close. The word will actually become 0x67452301.
I am sure that is what you meant though.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: kanze on 22 Nov 2005 14:39

Le Chaud Lapin wrote:

>> kanze wrote:
>
>>> > Le Chaud Lapin wrote:

>>>> > > In any case, again, you have to be fastidious in watching
>>>> > > the spec in this section. In general, you cannot slap on
>>>> > > an entire 64-bit chunk onto the message. For example, if
>>>> > > the message is 440 bits long, you're only allowed to add 1
>>>> > > byte for padding: the bit-pattern of a 1 followed by 7
>>>> > > zeros. This will bring you to congruency mod 448, at
>>>> > > which point you can tack on the 64-bit length to get to a
>>>> > > full 512 bits.

>>> > His interface forced the user to provide the 32 bit chunks.

>> Not sure if that is the best way to approach this.

I'm pretty sure it isn't:-).

>> To view the input data to the hasher as a stream of bytes is
>> far more general than forcing the view as vector of 32 bit
>> chunks, and in most cases eliminates unnecessary copying.

I'm not sure about the "unnecessary copying" bit. At some
point, you do have to put the data in blocks of 32 bit chunks to
process it, and that involves some copying. The only difference
is whether the user has to handle the problem, or whether it is
handled in the Digest code.

Other than that, technically, the basis of the Digest is a
stream of bits, and you should provide an interface which allows
the user to pass in only a certain number of bits, at least for
the last byte. Practically, I've never found any use for this,
and haven't bothered to implement it. My MD5 code restricts the
user to a multiple of 8 bytes. It's a restriction, but in my
experience, it's not a bothersome one.

>>> > Technically, of course, you might have to pad as little as a
>>> > single bit. In practice, you can pad using the same
>>> > multiple of bits as the user is required to provide
>>> > (supposing that is a power of 2). Of course, for most uses,
>>> > you'll want to support at least as small as bytes.

>> This is true, and it also hints at a fact that many of us
>> overlook in certain kinds of programming (embbeded,
>> communications, etc): when you send a byte from one machine
>> to another, there is already a guarantee that the bit order
>> will be preserved for you. Imagine a world where you had to
>> worry about bit-order too.

Been there. Done that. There is universal agreement that bit
order is little endian. Except in certain IBM documentation, at
least.

>>>> > > Technically, you have to be prepared for crunching a
>>>> > > digest whose size is greater than (232). Here you're
>>>> > > only taking care of lengths up to 232.

>>> > Depends on the application. My own implementation of MD5 is
>>> > limited to the SIZE_T_MAX * 8 bits. Perhaps internally,
>>> > there are other limits which means that the messages will be
>>> > small.

>> Hmm...not sure you can call this an implementation of MD5. I
>> would make a concession that it is ok to view the data as a
>> string of bytes instead of a string of bits, which, as you
>> pointed out, is what the spec calls for, for true compliance.
>> But I think tossing out the upper-half of the 64-bit word for
>> the length and calling it an implementation of the algorithm
>> is stretching it. This presumes of course that you are making
>> a generally resuable component, which you intend to polish off
>> and make available for others to use.

Woah. I don't toss out anything. I just require that the data
set over which I calculate the Digest is not larger than 232
bytes. In practice, it has to be smaller than that if it is to
fit into memory. (On the other hand, when I implemented it, it
was for a particular field in a message of the Radius protocol.
32 bytes maximum, if I remember correctly.)

>>>> > > 2. You want the output of the hashing to be an object
>>>> > > itself so that you can let it construct, assign it,
>>>> > > compare it, store it in containers, toss it around, etc.,
>>>> > > let it destruct.

>>> > Maybe. In my case, my MD5 calculator is an accumulator. I
>>> > suppose that there is some argument for defining a result
>>> > type, but in practice, I've almost always read the results
>>> > using the << operator, e.g.:

>>> > file << myText
>>> > << std::accumulate( myText.begin(), myText.end(), MD5() ) ;

>>> > Internally, the << operator uses a member function:
>>> > template< typename OutIter >
>>> > void resultsTo( OutIter dest ) const ;
>>> > which I made public, more because there wasn't any reason not
>>> > too, and I thought it might be useful to someone.

>> Where is the digest? Certainly it would be nice to write
>> map<Digest,Foo>, for example.

You can extract it into a string (as hex) if you want. But
you're right -- some sort of Digest type would be cleaner. I
just didn't need it.

>> This again assumes that you are making a class that someone
>> else would want to use. I can hardly think of a more
>> opportune moment to let the object-oriented features of C++
>> bear down on the design of a component.

I based my own class more on the STL -- it's an accumulator for
std::accumulate, and it supports operator << for extracting the
current value.

That corresponded to my needs at the time. You're example with
map is a very good one, however, and as soon as I get time...

>>>> > > 4. You want the hash algorithm itself to be fast.

>>> > You want it to be correct first. It isn't, so it's not yet
>>> > time to worry about speed.

>> Well, when I do my designs, I like to take everything into
>> account all at once, including performance (macro, not micro).

If by macro, you mean big-O complexity, perhaps. Otherwise...

I also keep context in mind. The poster is having problems
getting code to work. No point in worrying about speed yet.

>> I spend most of my time mulling. Then I fidget to verify
>> propositions, then back off, then walk around in woods some,
>> then eat a snack, then tinker, then only when I think I have a
>> visceral feel for what's going on do I begin to bear down on
>> the implementation. For me the code always comes out cleaner
>> than it would have if I had taken the incremental approach,
>> and takes less time to get a finished product (contratry to
>> what others claim).

You won't find me arguing against design first. You should be
able to do incremental development, for the all too frequent
times when you have to, because the customer doesn't know what
he wants until he starts seeing things. But even then, I'll
often knock out a throw away prototype, using what I learn from
it in order to make a clean design for the final product.

>> You're going to have to think about these things at some point
>> anyway, so you might as well mull it over as much as possible
>> before you start typing, because the mind is faster than the
>> hand (though both work better with the help of the other).

Well, I don't like to go looking for problems until they occur.
Some things are obvious -- if I have to process millions of data
elements, an O(n2) algorithm isn't going to cut it. On the
other hand, I don't waste my time worrying about constant
factors until the profiler says I have to.

>>>> > > I would define the algorithm itself as a namespace.

>>> > Why?

>> Well, to give you a real example, lets say you had a massive
>> software probject that depended on primitives involved in
>> security, including hashing. One master #include file
>> contained:

>> namespace Security_Set
>> {
>> using MD5::Hasher;
>> using MD5::Digest;

>> using namespace RSA;

>> typedef RC6::Cipher<32,20,16> Symmetric_Cipher;
>> typedef RC6::Key<16> Symmetric_Key;
>> }

>> Now let's say two years later, someone found a flaw in your
>> chosen Hasher! Good grief!! No worries, you could simply
>> modify your namespace:

>> namespace Security_Set
>> {
>> using SHA_256::Hasher;
>> using SHA_256::Digest;

>> using namespace RSA;

>> typedef RC6::Cipher<32,20,16> Symmetric_Cipher;
>> typedef RC6::Key<16> Symmetric_Key;
>> }

>> If you have been using Hasher and Digest objects throughout
>> your system, all you would have to do is recompile, and
>> everything should work, data stored on disk notwithstanding.

The samething is true if you use typedef's. And if you can
implement the different Digest algorithms using a Policy
template parameter, then you're going to be using templates. No
one wants to write Gabi::Hasher< MD5 > all over the place.

[...]

>>> > After a quick glance at SHA-1, I suspect that you could use
>>> > some sort of common machine for the entire family, including
>>> > SHA-1 and MD-5, with a policy class to customize, much as is
>>> > standard practice for CRC's. I'd certainly evaluate this
>>> > route before implementing my second digest in the family.
>>> > (As mentionned above, I've already implemented MD5.)

>> Yikes. This is good example, where the whole resusability hunt
>> is too agressive IMO.

I can sort of understand your reaction. There's no point in
making code completely illegible, just so you can gain two lines
when you use it twice. On the other hand, CRC lent itself quite
well to this sort of thing -- the policy class is relatively
simple, and only contains a few constants and tables; all of the
actual code is in the template itself.

Before this thread, I'd not looked at any digest but MD5,
because that was the only one I needed professionally until now.
A quick glance at SHA gave me the impression that it was pretty
much the same, modulo a few parameters, and maybe a primitive
function or so. So the possibility is certainly in my head.
Especially as there is a good deal of convenience boiler plate
in my MD5 class, which would definitly be the same in a SHa
class.

>> Hashers are very well defined. You know exactly what they
>> are, the sizes of their digests, the expectations on input.
>> They are small in terms of the code size. Trying to force them
>> to be common just so you can say, "it took me a week but I
>> finally tweaked these things enough to use common code" is a
>> bit much, don't you think?

The final decision will be based simply on: will it cost me less
time to make my current implementation generic than to write a
new one? My first quick glance at SHA gave me the impression
that most of my MD5 code would be applicable. This is true
precisely because, as you say, the hashers are small in terms of
code size; a significant part of my implementation is involved
in providing a comfortable interface (e.g. supporting insertion
of char, unsigned char, and int for the return values of
functions like istream::get(), or several different formats of
output).

>>> > This is IMHO the big question with regards to the interface.
>>> > In my current implementation, extracting the value would
>>> > modify the internal state, so that the accumulator itself
>>> > could not be reused. [snipped]

>> This is the fun part of software design - imagining all the
>> situations in which the primitive might be used, and then
>> designing it so that it will feel good for the user once the
>> primitive is wielded in the construction of a system. Note
>> that "feel good" comes from eliminating as much uncertainty
>> surrounding the primitive as possible. Said in a different
>> way, "feel good" comes from a reduction on requisite mental
>> burden.

Exactly. Generally speaking, I doubt that it is that useful to
be able to continue hashing after having extracted the digest.
But you never know, and it was simple enough to avoid the
problem.

>> 1. Hash. 2. Hash. 3. Hash. 4. Hash. 5. Hash, but get the
>> digest and reset the hasher.

>> This feels right, and it will work in almost any context,
>> including hashing a sequences of 32-bit words stored in a
>> vector<>. The programmer does not have to worry about
>> inefficiencies from copying.

I've separated get the digest and reset. A use can get the
digest, and continue hashing. I don't know if this is
practically useful, but why not?

>>> > However, all of the functions which one would normally use
>>> > to do this work on a copy, and not the original object, so
>>> > the orginal object remains unmodified. This copy isn't
>>> > free, of course, but the object was designed to be as cheap
>>> > to copy as possible -- all data members are arrays of
>>> > integral types, so basically, the copy constructor just
>>> > does a memcpy. And given the cost of calculating an MD5
>>> > digest anyway, I suspect that it doesn't matter.

>> Hmm..I still prefer the idea of letting the programmer supply
>> a const void * to the data, its length (unsigned of course),
>> then doing a hash, and asking for the digest if it is time to
>> get the digest, and if not, letting the hasher maintain
>> internal state until such digest is wanted.

That's more or less what I do. There are functions to add data
to the hash, including a global operator+, so that the object
can be used with std::accumulate, and functions which extract
the digest. The functions which extract the digest are all
const, and do not modify the state in any way.

>> Also, if you allow explicit contrsuction of the Digest from a
>> std::string, and allow it to yield itself as a std::string,
>> you've pretty much finsihed off the Digest as a class (unless
>> you are doing crypto, in which case allowing the Digest to
>> yield itself as a big integer can be very nice.)

I like your idea of a Digest type, especially with the explicit
constructor from string and the ability to yield itself as a
string. A separate digist is missing in my current
implementation. On the other hand, I do have the possibility of
extracting the digest as a raw binary -- all the user has to do
is provide an output iterator to which I can write a uint8_t.

--
James Kanze GABI Software
Conseils en informatique orient?e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: kanze on 23 Nov 2005 08:44

Carl Barron wrote:
> In article
> <1132616256.047169.129400(a)g44g2000cwa.googlegroups.com>, Le
> Chaud Lapin <unoriginal_username(a)yahoo.com> wrote:

> > Big-endian is silly, IMO. It was created as a substitute
> > for little-endian to help those who, for whatever reason,
> > find little-endian confusing when it's printed out.

> I am not so sure about any factuality in this, only that
> apparently you are used to assuming the world should be
> small endian. I think historically big endian was around
> long before little endian.

Historically, machines were word addressed, and there wasn't any
Internet, so there wasn't any endian-ness. The IBM 360 was
big-endian, even to the numbering of the bits in its words (so
the low order bit of a word was bit 32, where as the low order
bit of a byte was 8. And yes, then also numbered from 1 to n,
rather than from 0 to n-1.) The PDP-11 was little-endian.

It's curious that the network should have standardized on
big-endian byte order. Bit order is always little endian, and
most of the protocols were originally developped on VAXen, which
are little endian.

> I don't recall any endian conversations before the z80,m6502
> era.

That's because machines didn't communicate with machines of
other types before then.

--
James Kanze GABI Software
Conseils en informatique orient?e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: C++ Concepts
Next: Portable way to write binary data