Collecting true randomness from natural language texts [Cryptography]

Prev: How is the statistic of autocorrelation test for randomness arrived at?
Next: Permutation Extrapolation Function (PXF) (you might find theend of the post useful)

From: J.D. on 5 Apr 2010 23:52

On Apr 5, 10:03 pm, Earl_Colby_Pottinger
<earlcolby.pottin...(a)sympatico.ca> wrote:
>
> Are there any other languages that are far more denser than English?

As far as I know, all natural languages have a fair amount of
redundancy, both in their grammatical structure but more importantly
in the ratio of morphemes to phonemes (i.e. there are enormously more
possible combinations of sounds, even within the constraints of
English phonology, than there are English words; serdly foon,
shayep?). This redundancy is readily apparent in our ability to
understand s###ences eve# #ver ver# #oisy ch##nels -- and I have never
heard of a natural language that does not have similar redundancy
(diachronic sound change alone should prevent such a language from
ever arising naturally). There are invented languages that are
expressly designed to minimize redundancy (supposedly to maximize
transmission rate), but generally all of these projects turn out to be
unlearnable -- as in, even their creators cannot ever actually use
them conversationally.

From: unruh on 6 Apr 2010 01:57

On 2010-04-06, robertwessel2(a)yahoo.com <robertwessel2(a)yahoo.com> wrote:
> On Apr 5, 9:03?pm, Earl_Colby_Pottinger
><earlcolby.pottin...(a)sympatico.ca> wrote:
>> On Apr 5, 8:17?pm, unruh <un...(a)wormhole.physics.ubc.ca> wrote:
>>
>> > On 2010-04-05, Earl_Colby_Pottinger <earlcolby.pottin...(a)sympatico.ca> wrote:
>>
>> > > Text is a VERY VERY non-random source.
>>
>> > Well, very very is perhaps an overstatement. Certainly it has a fair
>> > amount of reduncancy, but estimates of 2-2.5 bits of randomness per
>> > character is often quoted ( rather than the 8 bits/byte, or the 6
>> > bits/character assuming only the ascii printable characters or so.
>> > Thus if you squeeze the text down by a factor of 3 or so, you should get
>> > pretty good randomness (ie, use MD5 on each ?50 charactes or so, and use the
>> > 128 bit output as your random source).
>>
>> That does not sound right to me, I thought English text when guessed
>> by humans have a far lower bit rate than that. ?Or am I
>> misunderstanding good MD5 will hash the input. ?It seems to me that
>> there are far less than 2 to power of 128 possible ways to arrange 50
>> characters of text that will be a valid English text (with trailing
>> and leading fragments).
>
>
> It's actually more along the lines of .6-1.5 bits/character, depending
> on who did the estimate (Shannon measured .6-1.3).

Fine make it 150 characters.

From: robertwessel2 on 6 Apr 2010 02:23

On Apr 6, 12:57 am, unruh <un...(a)wormhole.physics.ubc.ca> wrote:
> On 2010-04-06, robertwess...(a)yahoo.com <robertwess...(a)yahoo.com> wrote:
>
>
>
>
>
> > On Apr 5, 9:03?pm, Earl_Colby_Pottinger
> ><earlcolby.pottin...(a)sympatico.ca> wrote:
> >> On Apr 5, 8:17?pm, unruh <un...(a)wormhole.physics.ubc.ca> wrote:
>
> >> > On 2010-04-05, Earl_Colby_Pottinger <earlcolby.pottin...(a)sympatico.ca> wrote:
>
> >> > > Text is a VERY VERY non-random source.
>
> >> > Well, very very is perhaps an overstatement. Certainly it has a fair
> >> > amount of reduncancy, but estimates of 2-2.5 bits of randomness per
> >> > character is often quoted ( rather than the 8 bits/byte, or the 6
> >> > bits/character assuming only the ascii printable characters or so.
> >> > Thus if you squeeze the text down by a factor of 3 or so, you should get
> >> > pretty good randomness (ie, use MD5 on each ?50 charactes or so, and use the
> >> > 128 bit output as your random source).
>
> >> That does not sound right to me, I thought English text when guessed
> >> by humans have a far lower bit rate than that. ?Or am I
> >> misunderstanding good MD5 will hash the input. ?It seems to me that
> >> there are far less than 2 to power of 128 possible ways to arrange 50
> >> characters of text that will be a valid English text (with trailing
> >> and leading fragments).
>
> > It's actually more along the lines of .6-1.5 bits/character, depending
> > on who did the estimate (Shannon measured .6-1.3).
>
> Fine make it 150 characters.

But still, it's not very random in the cryptographic sense. Let's say
there are on the order of a trillion English equivalent words
published per day. If I know you got your entropy from some N byte
sequence in Tuesdays collection of 1T words, that's at best about 40
bits worth. And if I can monitor the traffic to your PC, you couldn't
retrieve more than about 1TB of source material per day even if you
fully utilized a 100Mb/s link.

FWIW, total Usenet traffic is about 20 million messages, and 20GB, per
day. If you wanted to use an external stream as an entropy source,
thats a highly available (and fairly high rate) source. That does
contain a noteworthy binary component, of course.

Cooking down each day's New York Times (or a Usenet feed) is probably
a perfectly acceptable source of entropy for a simulation, but I would
harbor severe doubts about its value if you need cryptographically
secure random bits.

From: David Eather on 6 Apr 2010 06:56

On 6/04/2010 12:25 PM, robertwessel2(a)yahoo.com wrote:
> On Apr 5, 9:03 pm, Earl_Colby_Pottinger
> <earlcolby.pottin...(a)sympatico.ca> wrote:
>> On Apr 5, 8:17 pm, unruh<un...(a)wormhole.physics.ubc.ca> wrote:
>>
>>> On 2010-04-05, Earl_Colby_Pottinger<earlcolby.pottin...(a)sympatico.ca> wrote:
>>
>>>> Text is a VERY VERY non-random source.
>>
>>> Well, very very is perhaps an overstatement. Certainly it has a fair
>>> amount of reduncancy, but estimates of 2-2.5 bits of randomness per
>>> character is often quoted ( rather than the 8 bits/byte, or the 6
>>> bits/character assuming only the ascii printable characters or so.
>>> Thus if you squeeze the text down by a factor of 3 or so, you should get
>>> pretty good randomness (ie, use MD5 on each 50 charactes or so, and use the
>>> 128 bit output as your random source).
>>
>> That does not sound right to me, I thought English text when guessed
>> by humans have a far lower bit rate than that. Or am I
>> misunderstanding good MD5 will hash the input. It seems to me that
>> there are far less than 2 to power of 128 possible ways to arrange 50
>> characters of text that will be a valid English text (with trailing
>> and leading fragments).
>
>
> It's actually more along the lines of .6-1.5 bits/character, depending
> on who did the estimate (Shannon measured .6-1.3).

I think shannon also pointed out that the amount of entropy also depends
on the length of text - the entropy dropping as the text get longer.

From: Maaartin on 6 Apr 2010 07:31

On Apr 6, 12:56 pm, David Eather <eat...(a)tpg.com.au> wrote:
> I think shannon also pointed out that the amount of entropy also depends
> on the length of text - the entropy dropping as the text get longer.

This all should be no problem as there's a lot of text available, so
you can hash a couple of kilobytes to 128 bits. But a publicly known
text can be no source of entropy, can it? Hashing the title page of a
fixed internet newspaper could be enough, but for what purpose can I
use it? Surely not as a secret key, since the attacker knows it. Maybe
as a nonce, but for the nonce using a counter could be better (e.g.,
Salsa20 needs just a unique nonce, for CBC an encrypted counter works,
right?). Pls give me an example, where using an internet text as a
entropy source is advantageous.

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: How is the statistic of autocorrelation test for randomness arrived at?
Next: Permutation Extrapolation Function (PXF) (you might find theend of the post useful)