From: Rahul on
I end up downloading duplicate (or more!) copies of journal papers (pdf)
since sometimes one forgets that one already has a copy. Annoying. I was
trying to think of a way to prevent this. My existing bibliographic s/w
(Endnote) is not too good at helping me out.

Would MD5 checksums be a good workaround? Every time I do a download:

create a MD5 checksum
if absent in log:
download pdf
add MD5 checksum to log
else:
blow up.

A lot of this work is on a WinXP machine but since I have the drive
Samba-mapped to my RHEL box I guess I can generate / check the MD5-sums
from the much-better Linux command line.

Any opinions / caveats? ....or "stupid idea"?


--
Rahul
From: mop2 on
That way you need download the file to get md5sum.
I think the register of the url of the source, and the date if needed,
is a better solution. Later this can be checked whithout any download.
I don't know if this is aplicable in your case.


Rahul wrote:
> I end up downloading duplicate (or more!) copies of journal papers (pdf)
> since sometimes one forgets that one already has a copy. Annoying. I was
> trying to think of a way to prevent this. My existing bibliographic s/w
> (Endnote) is not too good at helping me out.
>
> Would MD5 checksums be a good workaround? Every time I do a download:
>
> create a MD5 checksum
> if absent in log:
> download pdf
> add MD5 checksum to log
> else:
> blow up.
>
> A lot of this work is on a WinXP machine but since I have the drive
> Samba-mapped to my RHEL box I guess I can generate / check the MD5-sums
> from the much-better Linux command line.
>
> Any opinions / caveats? ....or "stupid idea"?
>
>
> --
> Rahul
From: Rahul on
mop2 <mop2bky4mz5tyjwa8ersp7hrg5u9qn(a)gmail.com> wrote in news:3a629a82-
1e4e-4a69-8a31-f26ea188420b(a)a70g2000hsh.googlegroups.com:

> That way you need download the file to get md5sum.
> I think the register of the url of the source, and the date if needed,
> is a better solution. Later this can be checked whithout any download.
> I don't know if this is aplicable in your case.

Thanks mop2! Problem is: sometimes an article pdf is available from
multiple providers (Science Diret, Elsiver etc.)

Other problem: Providers have a tendancy to keep changing their urls.

From your comment I got the idea of perhaps using the doi
(http://en.wikipedia.org/wiki/Digital_object_identifier). But this is not
always available I feel, especially for older articles.

-Rahul
From: Rob Simpson on
On Mon, 21 Apr 2008 01:53:28 +0000, Rahul propped his eyelids open with
toothpicks and wrote:

> I end up downloading duplicate (or more!) copies of journal papers (pdf)
> since sometimes one forgets that one already has a copy. Annoying. I was
> trying to think of a way to prevent this. My existing bibliographic s/w
> (Endnote) is not too good at helping me out.
>
> Would MD5 checksums be a good workaround? Every time I do a download:
>
> create a MD5 checksum
> if absent in log:
> download pdf
> add MD5 checksum to log
> else:
> blow up.
>
> A lot of this work is on a WinXP machine but since I have the drive
> Samba-mapped to my RHEL box I guess I can generate / check the MD5-sums
> from the much-better Linux command line.
>
> Any opinions / caveats? ....or "stupid idea"?

Unless you expect to d/l 2 pdfs with the same name, you're just creating
extra work by creating an md5sum. If you can't parse your bibliographic
files for the pdf name, create a log file with just the pdf names, then
parse that before downloading.



--
Rob - Linux user number 467898 Ubuntu User number 17166
Linux 2.6.22-14-generic
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
End User (V): An act the Customer Support Staff wishes to carry out.
(Known variant is "Disastrously End User" depending upon the magnitude of
the stupidity.) - Anis Shiekh
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
From: Rahul on
Rob Simpson <here(a)my.pc> wrote in news:480c15e0$1(a)clear.net.nz:
http://video.google.com/videoplay?docid=-2199332044603874737&q=tech%
20talk&hl=en>
> Unless you expect to d/l 2 pdfs with the same name, you're just creating
> extra work by creating an md5sum. If you can't parse your bibliographic
> files for the pdf name, create a log file with just the pdf names, then
> parse that before downloading.
>


Thanks Rob! But most of the online providers have a standard name (scif.pdf
etc.) that each download is pushed with. Or a seemingly random id generated
at download time. The "same" file at two dates might be pushed with
different names. Different files might be pushed with the same name.


Hence working with the filename did not seem an option to me.

--
Rahul