From: Andrew Fabbro on
I'm trying to devise a programmatic method to identify plaintext. One
approach I'd like to try is to check candidate plaintext against
tetragraphs that are extremely rare. For example, if XBMQ appears in
the plaintext, then I will consider it non-English and move on to the
next possible key.

This method may not be perfect but I suspect it will work for my
purposes.

The question is...where/how to get such a list of rare tetragraphs? I
have not been able to google anything. There are 456,976 possible
tetragraphs.

I built one from the Moby word lists, but it misses some important
things...for example, the plaintext ATTACKATDAWN (I often don't know
where the word boundaries are) contains "KATD", which does not appear
in any of Moby's mwords or any dictionary word. Apparently, I'll need
to process tetragraphs that cross word boundaries...I'm not sure if
that invalidates the approach.

Hmm. My next thought was to download a hundred plain text books from
Project Gutenberg, string all the letters together, and process the
resultant 4-character substrings...?

From: Maaartin on
On Jan 12, 6:34 am, Andrew Fabbro <andrew.fab...(a)gmail.com> wrote:
> I'm trying to devise a programmatic method to identify plaintext.  One
> approach I'd like to try is to check candidate plaintext against
> tetragraphs that are extremely rare.  For example, if XBMQ appears in
> the plaintext, then I will consider it non-English and move on to the
> next possible key.

Sure, but IMHO you should try harder. The time spent should be of the
plaintext recognition should be about of the same order of magnitude
as the time spent on decryption. Without having tried it myself, I'd
say you could look at all letters, di-, tri- and tetragrams in a
decrypted piece of text in a shorter time than the decryption takes,
thus minimizing the risk of failure. For example, the text "if XBMQ
appears in the plaintext" is a valid plaintext, isn't it? Otherwise,
you'll risk a false negative because of an acronym you don't know.

Something like giving positive points for probable n-grams and
negative for unprobable ones should work better.

> This method may not be perfect but I suspect it will work for my
> purposes.
>
> The question is...where/how to get such a list of rare tetragraphs?  I
> have not been able to google anything.  There are 456,976 possible
> tetragraphs.
>
> I built one from the Moby word lists, but it misses some important
> things...for example, the plaintext ATTACKATDAWN (I often don't know
> where the word boundaries are) contains "KATD", which does not appear
> in any of Moby's mwords or any dictionary word.  Apparently, I'll need
> to process tetragraphs that cross word boundaries...I'm not sure if
> that invalidates the approach.

You need to get a table of all trigrams at the end of a work and
combine it with all single letters at the beginning, etc. This is not
perfect, as it ignores the frequency distribution of whole words, but
IMHO it's good enough.

> Hmm.  My next thought was to download a hundred plain text books from
> Project Gutenberg, string all the letters together, and process the
> resultant 4-character substrings...?

IMHO even in such a large text there're many possible tetragrams
missing.
From: tms on
On Jan 12, 12:34 am, Andrew Fabbro <andrew.fab...(a)gmail.com> wrote:
> I'm trying to devise a programmatic method to identify plaintext.

There is published work on this subject. For instance, Ravi Ganesan
and Alan T. Sherman, "STATISTICAL TECHNIQUES FOR LANGUAGE RECOGNITION:
AN
INTRODUCTION AND GUIDE FOR CRYPTANALYSTS", Cryptologia 17: 4, 321 —
366. Try Google Scholar.

> One
> approach I'd like to try is to check candidate plaintext against
> tetragraphs that are extremely rare.  For example, if XBMQ appears in
> the plaintext, then I will consider it non-English and move on to the
> next possible key.

Suppose XMBQ is an acronym, or a foreign word, or nulls added to
confuse analysis?

From: David Eather on
tms wrote:
> On Jan 12, 12:34 am, Andrew Fabbro <andrew.fab...(a)gmail.com> wrote:
>> I'm trying to devise a programmatic method to identify plaintext.
>
> There is published work on this subject. For instance, Ravi Ganesan
> and Alan T. Sherman, "STATISTICAL TECHNIQUES FOR LANGUAGE RECOGNITION:
> AN
> INTRODUCTION AND GUIDE FOR CRYPTANALYSTS", Cryptologia 17: 4, 321 �
> 366. Try Google Scholar.
>
>> One
>> approach I'd like to try is to check candidate plaintext against
>> tetragraphs that are extremely rare. For example, if XBMQ appears in
>> the plaintext, then I will consider it non-English and move on to the
>> next possible key.
>
> Suppose XMBQ is an acronym, or a foreign word, or nulls added to
> confuse analysis?
>
Sinkov, "elementary cryptanalysis" is built entirely on the concept
and application of statistical techniques for language recognition