Winword-DOC - Word Version Detection [Win32 API]

Prev: Shortcut to an URL
Next: U++ Tutoring Plan

From: Franz Bachler on 20 Jul 2010 17:01

Hello,

has anyone a code sample how to detect the word version with which the
document was created?
(in C if possible)

worddetect testfile.doc

"testfile.doc" has the format of winword 6 / 95 / 97 / 2000 / 2002 / 2003
or 2007

Greetings,
Franz
--
Franz Bachler, A-3250 Wieselburg
E-Mail: fraba (at) gmx.at
Homepage: http://members.aon.at/fraba
oder http://home.pages.at/fraba

From: Jongware on 21 Jul 2010 04:25

On 20-Jul-10 23:01 PM, Franz Bachler wrote:
> Hello,
>
> has anyone a code sample how to detect the word version with which the
> document was created?
> (in C if possible)
>
> worddetect testfile.doc
>
> "testfile.doc" has the format of winword 6 / 95 / 97 / 2000 / 2002 / 2003
> or 2007

Check http://en.wikipedia.org/wiki/DOC_(computing) -- its 3rd reference
points to Microsoft's binary documentation of the Word file format, as a
PDF.

The main data structure is called the DOP ("Document Properties"), and
its structure member 'nFib' holds a code for Word versions from 1.0 to
Word 2007 (listed on p. 133 in that PDF).

[Jw]

From: Franz Bachler on 21 Jul 2010 16:17

> Check http://en.wikipedia.org/wiki/DOC_(computing) -- its 3rd reference
> points to Microsoft's binary documentation of the Word file format, as a
> PDF.
>
> The main data structure is called the DOP ("Document Properties"), and its
> structure member 'nFib' holds a code for Word versions from 1.0 to Word
> 2007 (listed on p. 133 in that PDF).

Okay, the nFib is the searched value. But I don't understand exactly how to
detect where the nFib is in the Word File. Is it always on the same place?

An Example: Word 2003; nFib = decimal 268 = Hex 010C; should be stored as 0C
01 (little endian)

The first hit is 0007BC + 1 (because hex dump starts with 000000) = decimal
1981

0007B0 00 00 FF FF FF FF 00 00 00 00 02 00 0C 01 00 00
*................*

Greetings,
Franz

From: Jongware on 22 Jul 2010 05:45

On 21-Jul-10 22:17 PM, Franz Bachler wrote:
>> Check http://en.wikipedia.org/wiki/DOC_(computing) -- its 3rd reference
>> points to Microsoft's binary documentation of the Word file format, as a
>> PDF.
>>
>> The main data structure is called the DOP ("Document Properties"), and its
>> structure member 'nFib' holds a code for Word versions from 1.0 to Word
>> 2007 (listed on p. 133 in that PDF).
>
> Okay, the nFib is the searched value. But I don't understand exactly how to
> detect where the nFib is in the Word File. Is it always on the same place?
>
> An Example: Word 2003; nFib = decimal 268 = Hex 010C; should be stored as 0C
> 01 (little endian)
>
> The first hit is 0007BC + 1 (because hex dump starts with 000000) = decimal
> 1981
>
> 0007B0 00 00 FF FF FF FF 00 00 00 00 02 00 0C 01 00 00
> *................*

Yes -- Microsoft's documentation is not too clear (in spite of their
"Open Source Promise" ;-)).

Word files start with (hex) D0CF, but so do a lot of other files: they
are all OLE streams. MS does provide a number of APIs to open OLE
streams and select any arbitrary section of it for reading, but perhaps
you don't need that *just to check the version*.
At least the first 512 bytes are for the OLE stream only (and when there
are more blocks, their size always is a multiple of 512). According to
MS, "The FIB starts at the beginning of the file" -- well, it's the
first in the "Root" stream of a Word file, so that's /almost/ true ...

You can check if you have found a FIB by checking its Magic number
'wIdent', the first ushort; its value is not given (... oh well ...) but
I see in a Word sample file it should be 0xA5EC. An additional check is
at +0x22: "wMagicCreated / Unique number identifying the file�s creator.
0x6A62 is the creator ID for Word".

The nFib ushort is right after wIdent -- for my file, I find a value of
0x00C1, or 193, indicating it's from Word 97.

(More on OLE streams can be found on
http://download.microsoft.com/.../WindowsCompoundBinaryFileFormatSpecification.pdf)

[Jw]

From: Franz Bachler on 26 Jul 2010 08:21

Hello,

here's the DOC-Checking-Program.

Greetings,
Franz

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>

void nfibsearch(char *szPuffer, int ds)

{
int i,s;
char szId[16];
char szVer[16];

for (i=512; i<4100; i+=512)
{
if (ds>i)
{
if (szPuffer[i]==(char) 0xEC && szPuffer[i+1]==(char) 0xA5)
{
printf("\n\n nFib Magic (EC A5) found at %d ",i);

s=0;
if (szPuffer[i+3]==(char) 0x00)
{
if (szPuffer[i+2]==(char) 0x65)
{s=1; strcpy(szId, "00 65"); strcpy(szVer, "6.0");}

if (szPuffer[i+2]==(char) 0x68)
{s=1; strcpy(szId, "00 68"); strcpy(szVer, "95");}

if (szPuffer[i+2]==(char) 0xC1)
{s=1; strcpy(szId, "00 C1"); strcpy(szVer, "97");}

if (szPuffer[i+2]==(char) 0xD9)
{s=1; strcpy(szId, "00 D9"); strcpy(szVer, "2000");}
}

if (szPuffer[i+3]==(char) 0x01)
{
if (szPuffer[i+2]==(char) 0x01)
{s=1; strcpy(szId, "01 01"); strcpy(szVer, "2002");}

if (szPuffer[i+2]==(char) 0x0C)
{s=1; strcpy(szId, "0C 01"); strcpy(szVer, "2003");}

if (szPuffer[i+2]==(char) 0x12)
{s=1; strcpy(szId, "12 01"); strcpy(szVer, "2007");}
}
if (s)
printf("\n\n Word %s identifier (%s) found at %d
",szVer,szId,i+2);
}
}
}
}

void stringsearch(char *szPuffer, char *szText, int ds)

{
int i,j,l,s;
char szInfo[128];

l=(int) strlen(szText);
if (l>120) return;

for (i=0; i<ds; i++)
{
s=0;
for (j=0; j<l; j++)
{
if (szPuffer[i+j]==szText[j]) s=1;
else { s=0; break; }
}
if (s)
{
strcpy(szInfo, szText);
for (j=0; j<3; j++)
szInfo[l+j]=szPuffer[i+j+l];
szInfo[l+3]='\0';
printf("\n %s found at %d ",szInfo,i);
}
}
}

int main(int argc, char **argv)

{
int iSize,ds;
char c,*szPuffer;
FILE *dz;

if (argc<2)
{
printf("\n Word Document Evaluation - call with ");
printf("\n\n %s filename \n ",argv[0]);
exit(1);
}

if ((dz=fopen(argv[1],"rb"))==NULL)
{
printf("\n Cannot open file %s! ",argv[1]);
printf("\n (Possibly file not found?) \n ");
exit(2);
}

fseek(dz, 0, SEEK_END);
iSize=ftell(dz);
fseek(dz, 0, SEEK_SET);

if (iSize<=0)
{
printf("\n Problem with file %s \n ",argv[1]);
fclose(dz);
exit(3);
}

szPuffer = (char *) calloc(iSize+64, sizeof(char));
if (szPuffer==NULL)
{
printf("\n Unable to allocate puffer memory! ");
printf("\n (Out of memory?) \n ");
fclose(dz);
exit(4);
}

ds=0;
while (fread(&c,1,1,dz)>0)
{
if (ds<=iSize)
szPuffer[ds++]=c;
else
break;
}
fclose(dz);

printf("\n File %s - %d Bytes read \n",argv[1],ds);

if (szPuffer[0]==(char)0xD0 && szPuffer[1]==(char)0xCF &&
szPuffer[2]==(char)0x11)
printf("\n Word header found (D0 CF 11) \n");
else
printf("\n Word header not found (D0 CF 11) \n");

// search for "Word.Document

stringsearch(szPuffer, "Word.Document.", ds);

// search for "Microsoft Word "

stringsearch(szPuffer, "Microsoft Word ", ds);

// nFib search

nfibsearch(szPuffer, ds);

printf("\n");
free(szPuffer);
return(0);
}

| Next | Last
Pages: 1 2
Prev: Shortcut to an URL
Next: U++ Tutoring Plan