From: Ian Rawlings on
On 2008-04-12, Will Kemp <Will(a)xxxx.Swaggie.net> wrote:

> The keywords in question can't be defined in advance, so it would
> probably just be a question of the script or program identifying nouns
> and listing all of them that are, say, 4 or more characters long.

Heh, it sounds like you need natural language parsing, a meaty
research project. Thankfully there's lots about, so googling for
"natural language parsing perl" or something like that will probably
help, e.g.;

http://opennlp.sourceforge.net/projects.html

> A next best solution would probably be to extract all words of 4
> characters or more and sort them by word length. That's probably
> reasonably easy, but i can't work out how to sort by word length.

Easiest way would be to use an awk script to cycle through all the
words on a line by splitting the line up using split(), then use
length() on each word, printing out those of more than 4 chars long
with the letter count first, e.g.

4 this
2 is (but drop this one)
1 a (and this one)
4 test

... then running it through sort -n then uniq, then cut.

I'll leave the details to you ;-)

It might be worth just biting the bullet though and using perl, one
advantage being that you could then move on to putting words into a
database and using that to grab your keywords from and counting their
frequencies, adding weights etc.

--
Blast off and strike the evil Bydo empire!
http://youtube.com/user/tarcus69
http://www.flickr.com/photos/tarcus/sets/
From: Mike Scott on
Ian Rawlings wrote:
....
>
> It might be worth just biting the bullet though and using perl, one
> advantage being that you could then move on to putting words into a
> database and using that to grab your keywords from and counting their
> frequencies, adding weights etc.
>
If you hunt around, there's a perl Bayes filter that has the code for
much of this. Dr Dobbs, iirc. Might be able to get ideas or hack the code.



--
Mike Scott (unet <at> scottsonline.org.uk)
Harlow Essex England
From: Nix on
On 12 Apr 2008, Will Kemp said:
> I've also tried to install the Lingua::EN::Tagger perl module - but the
> dependency mesh is horrific. I got part of the way through it, but gave
> up because it would take forever to find all the dependencies and get
> them to install.

What? Can't CPAN find them for you, or are these non-Perl dependencies?

--
`The rest is a tale of post and counter-post.' --- Ian Rawlings
describes USENET
From: Will Kemp on
On Sat, 12 Apr 2008 17:18:38 +0100, Nix wrote:

> On 12 Apr 2008, Will Kemp said:
>> I've also tried to install the Lingua::EN::Tagger perl module - but the
>> dependency mesh is horrific. I got part of the way through it, but gave
>> up because it would take forever to find all the dependencies and get
>> them to install.
>
> What? Can't CPAN find them for you, or are these non-Perl dependencies?

Um... Ah, yeah. It probably could have! I had some vague recollection of
an app called cpan (like i said, i haven't done any perl programming for
years), but it didn't exist on my system and i thought it would have been
part of the core perl installation, so i didn't look any further. I've
installed it now though! ;-)

Thanks!

--
http://SnapAndScribble.com/will/blog

From: Will Kemp on
On Sat, 12 Apr 2008 11:45:04 +0000, Gordon Henderson wrote:

> In article <VUZLj.37949$4f4.35146(a)newsfe6-win.ntli.net>, Will Kemp
> <Will(a)xxxx.Swaggie.net> wrote:
>>I'm trying to find some way of parsing text files and extracting
>>"keywords" from them. "Keywords" is hard to define, of course, but i
>>guess it means nouns that are used relatively infrequently.
>>
>>I've tried to install TextMine - but didn't have any luck with it, and
>>couldn't be bothered going any further down that particular track. To be
>>honest, although it was only a couple of days ago, i can't remember what
>>the specific problems were, but i decided it wasn't worth the effort.
>>
>>I've also tried to install the Lingua::EN::Tagger perl module - but the
>>dependency mesh is horrific. I got part of the way through it, but gave
>>up because it would take forever to find all the dependencies and get
>>them to install.
>>
>>Has anyone got any better suggestions than either of those two?
>
> Swish-e ?
>
> http://swish-e.org/
>
> I use it to index mailling lists, but I don't know how easy it might be
> to extract the index it builds for other purposes...

That looks like exactly what i want - if i can extract the relevant
information from it, anyway. But i'm sure i will be able to.

Thanks!


--
http://SnapAndScribble.com/will/blog

First  |  Prev  |  Next  |  Last
Pages: 1 2 3
Prev: tar / zip query
Next: Distro recommendation