Collecting true randomness from natural language texts [Cryptography]

Prev: How is the statistic of autocorrelation test for randomness arrived at?
Next: Permutation Extrapolation Function (PXF) (you might find theend of the post useful)

From: Mok-Kong Shen on 6 Apr 2010 11:01

robertwessel2(a)yahoo.com:
[snip]
> Cooking down each day's New York Times (or a Usenet feed) is probably
> a perfectly acceptable source of entropy for a simulation, but I would
> harbor severe doubts about its value if you need cryptographically
> secure random bits.

If one has the entropy and is not sure of its security (because
the opponent might guess correctly the materials one used) the issue
isn't big in my humble view. With the first method I mentioned, the
content of the polyalphabetical substitution matrix is one's secret.
So the result is protected. With the second method one can pass the
result through e.g. a transformation via a permutation polynomial
(whose coefficients are one's secret), which conserves the entropy
because the mapping is bijective.

M. K. Shen

From: unruh on 6 Apr 2010 11:30

On 2010-04-06, robertwessel2(a)yahoo.com <robertwessel2(a)yahoo.com> wrote:
> On Apr 6, 12:57?am, unruh <un...(a)wormhole.physics.ubc.ca> wrote:
>> On 2010-04-06, robertwess...(a)yahoo.com <robertwess...(a)yahoo.com> wrote:
>>
>>
>>
>>
>>
>> > On Apr 5, 9:03?pm, Earl_Colby_Pottinger
>> ><earlcolby.pottin...(a)sympatico.ca> wrote:
>> >> On Apr 5, 8:17?pm, unruh <un...(a)wormhole.physics.ubc.ca> wrote:
>>
>> >> > On 2010-04-05, Earl_Colby_Pottinger <earlcolby.pottin...(a)sympatico.ca> wrote:
>>
>> >> > > Text is a VERY VERY non-random source.
>>
>> >> > Well, very very is perhaps an overstatement. Certainly it has a fair
>> >> > amount of reduncancy, but estimates of 2-2.5 bits of randomness per
>> >> > character is often quoted ( rather than the 8 bits/byte, or the 6
>> >> > bits/character assuming only the ascii printable characters or so.
>> >> > Thus if you squeeze the text down by a factor of 3 or so, you should get
>> >> > pretty good randomness (ie, use MD5 on each ?50 charactes or so, and use the
>> >> > 128 bit output as your random source).
>>
>> >> That does not sound right to me, I thought English text when guessed
>> >> by humans have a far lower bit rate than that. ?Or am I
>> >> misunderstanding good MD5 will hash the input. ?It seems to me that
>> >> there are far less than 2 to power of 128 possible ways to arrange 50
>> >> characters of text that will be a valid English text (with trailing
>> >> and leading fragments).
>>
>> > It's actually more along the lines of .6-1.5 bits/character, depending
>> > on who did the estimate (Shannon measured .6-1.3).
>>
>> Fine make it 150 characters.
>
>
> But still, it's not very random in the cryptographic sense. Let's say
> there are on the order of a trillion English equivalent words
> published per day. If I know you got your entropy from some N byte
> sequence in Tuesday?s collection of 1T words, that's at best about 40

??? How is that 40 bits worth? Your statement is like saying "I know you
used number digits, and there are only 10 of them, so that is only 3
bits worth of entropy.
How do you know he got the entropy from what was published tuesday,
rather than from Mark Twain's Tom Sawyer? How do you know they were
consecutive. Since you do not know the order in which I arranged those
trillion words You also do not know what it was.
How do y ou know thoseN bytes were consecutive? How do you know how big
N was?

> bits worth. And if I can monitor the traffic to your PC, you couldn't
> retrieve more than about 1TB of source material per day even if you
> fully utilized a 100Mb/s link.

So?

The machine has words ( a few GB) on board.

>
> FWIW, total Usenet traffic is about 20 million messages, and 20GB, per
> day. If you wanted to use an external stream as an entropy source,
> that?s a highly available (and fairly high rate) source. That does
> contain a noteworthy binary component, of course.
>
> Cooking down each day's New York Times (or a Usenet feed) is probably
> a perfectly acceptable source of entropy for a simulation, but I would
> harbor severe doubts about its value if you need cryptographically
> secure random bits.

From: robertwessel2 on 6 Apr 2010 17:38

On Apr 6, 10:30 am, unruh <un...(a)wormhole.physics.ubc.ca> wrote:
> On 2010-04-06, robertwess...(a)yahoo.com <robertwess...(a)yahoo.com> wrote:
>
>
>
>
>
> > On Apr 6, 12:57?am, unruh <un...(a)wormhole.physics.ubc.ca> wrote:
> >> On 2010-04-06, robertwess...(a)yahoo.com <robertwess...(a)yahoo.com> wrote:
>
> >> > On Apr 5, 9:03?pm, Earl_Colby_Pottinger
> >> ><earlcolby.pottin...(a)sympatico.ca> wrote:
> >> >> On Apr 5, 8:17?pm, unruh <un...(a)wormhole.physics.ubc.ca> wrote:
>
> >> >> > On 2010-04-05, Earl_Colby_Pottinger <earlcolby.pottin...(a)sympatico.ca> wrote:
>
> >> >> > > Text is a VERY VERY non-random source.
>
> >> >> > Well, very very is perhaps an overstatement. Certainly it has a fair
> >> >> > amount of reduncancy, but estimates of 2-2.5 bits of randomness per
> >> >> > character is often quoted ( rather than the 8 bits/byte, or the 6
> >> >> > bits/character assuming only the ascii printable characters or so..
> >> >> > Thus if you squeeze the text down by a factor of 3 or so, you should get
> >> >> > pretty good randomness (ie, use MD5 on each ?50 charactes or so, and use the
> >> >> > 128 bit output as your random source).
>
> >> >> That does not sound right to me, I thought English text when guessed
> >> >> by humans have a far lower bit rate than that. ?Or am I
> >> >> misunderstanding good MD5 will hash the input. ?It seems to me that
> >> >> there are far less than 2 to power of 128 possible ways to arrange 50
> >> >> characters of text that will be a valid English text (with trailing
> >> >> and leading fragments).
>
> >> > It's actually more along the lines of .6-1.5 bits/character, depending
> >> > on who did the estimate (Shannon measured .6-1.3).
>
> >> Fine make it 150 characters.
>
> > But still, it's not very random in the cryptographic sense. Let's say
> > there are on the order of a trillion English equivalent words
> > published per day. If I know you got your entropy from some N byte
> > sequence in Tuesday?s collection of 1T words, that's at best about 40
>
> ??? How is that 40 bits worth? Your statement is like saying "I know you
> used number digits, and there are only 10 of them, so that is only 3
> bits worth of entropy.
> How do you know he got the entropy from what was published tuesday,
> rather than from Mark Twain's Tom Sawyer? How do you know they were
> consecutive. Since you do not know the order in which I arranged those
> trillion words You also do not know what it was.
> How do y ou know thoseN bytes were consecutive? How do you know how big
> N was?

Because I stated those as assumptions. And that's reasonable since we
generally assume that the attacker knows the algorithm in question.
You might have a better algorithm, but you're still deriving your
entropy from sources available to me.

> > bits worth. And if I can monitor the traffic to your PC, you couldn't
> > retrieve more than about 1TB of source material per day even if you
> > fully utilized a 100Mb/s link.
>
> So?

So I can reasonably monitor all of the "text" you're accumulating for
your entropy generation process. Combine that with my knowledge of
the algorithm you're using for selecting source bits, and I'm a long
way towards determining what random numbers you're generating.

From: Mok-Kong Shen on 7 Apr 2010 03:26

robertwessel2(a)yahoo.com wrote:
> unruh wrote:
[snip]

> So I can reasonably monitor all of the "text" you're accumulating for
> your entropy generation process. Combine that with my knowledge of
> the algorithm you're using for selecting source bits, and I'm a long
> way towards determining what random numbers you're generating.

No problem against such a powerful opponent, who can exactly know
"exactly" which characters of the texts collected are processed.
Firstly, one can mix in some texts of one's own (perhaps some random
keying by one's child). Secondly, and that's definitely better, one
encrypts the result.

M. K. Shen

From: WTShaw on 7 Apr 2010 03:53

On Apr 5, 8:17 pm, unruh <un...(a)wormhole.physics.ubc.ca> wrote:
> On 2010-04-05, Earl_Colby_Pottinger <earlcolby.pottin...(a)sympatico.ca> wrote:
>
> > Text is a VERY VERY non-random source.
>
> Well, very very is perhaps an overstatement. Certainly it has a fair
> amount of reduncancy, but estimates of 2-2.5 bits of randomness per
> character is often quoted ( rather than the 8 bits/byte, or the 6
> bits/character assuming only the ascii printable characters or so.
> Thus if you squeeze the text down by a factor of 3 or so, you should get
> pretty good randomness (ie, use MD5 on each 50 charactes or so, and use the
> 128 bit output as your random source).

You math makes no sense as popular as it is because text characters
have little to do with bits. Rather then apples and oranges it's
peanuts and grapefruit, or taking a worm's eye view of the shape of
the planet. so much has been done by people who really did not
understand.

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: How is the statistic of autocorrelation test for randomness arrived at?
Next: Permutation Extrapolation Function (PXF) (you might find theend of the post useful)