|
Prev: tar / zip query
Next: Distro recommendation
From: Ian Rawlings on 12 Apr 2008 07:59 On 2008-04-12, Will Kemp <Will(a)xxxx.Swaggie.net> wrote: > The keywords in question can't be defined in advance, so it would > probably just be a question of the script or program identifying nouns > and listing all of them that are, say, 4 or more characters long. Heh, it sounds like you need natural language parsing, a meaty research project. Thankfully there's lots about, so googling for "natural language parsing perl" or something like that will probably help, e.g.; http://opennlp.sourceforge.net/projects.html > A next best solution would probably be to extract all words of 4 > characters or more and sort them by word length. That's probably > reasonably easy, but i can't work out how to sort by word length. Easiest way would be to use an awk script to cycle through all the words on a line by splitting the line up using split(), then use length() on each word, printing out those of more than 4 chars long with the letter count first, e.g. 4 this 2 is (but drop this one) 1 a (and this one) 4 test ... then running it through sort -n then uniq, then cut. I'll leave the details to you ;-) It might be worth just biting the bullet though and using perl, one advantage being that you could then move on to putting words into a database and using that to grab your keywords from and counting their frequencies, adding weights etc. -- Blast off and strike the evil Bydo empire! http://youtube.com/user/tarcus69 http://www.flickr.com/photos/tarcus/sets/
From: Mike Scott on 12 Apr 2008 12:53 Ian Rawlings wrote: .... > > It might be worth just biting the bullet though and using perl, one > advantage being that you could then move on to putting words into a > database and using that to grab your keywords from and counting their > frequencies, adding weights etc. > If you hunt around, there's a perl Bayes filter that has the code for much of this. Dr Dobbs, iirc. Might be able to get ideas or hack the code. -- Mike Scott (unet <at> scottsonline.org.uk) Harlow Essex England
From: Nix on 12 Apr 2008 12:18 On 12 Apr 2008, Will Kemp said: > I've also tried to install the Lingua::EN::Tagger perl module - but the > dependency mesh is horrific. I got part of the way through it, but gave > up because it would take forever to find all the dependencies and get > them to install. What? Can't CPAN find them for you, or are these non-Perl dependencies? -- `The rest is a tale of post and counter-post.' --- Ian Rawlings describes USENET
From: Will Kemp on 12 Apr 2008 14:33 On Sat, 12 Apr 2008 17:18:38 +0100, Nix wrote: > On 12 Apr 2008, Will Kemp said: >> I've also tried to install the Lingua::EN::Tagger perl module - but the >> dependency mesh is horrific. I got part of the way through it, but gave >> up because it would take forever to find all the dependencies and get >> them to install. > > What? Can't CPAN find them for you, or are these non-Perl dependencies? Um... Ah, yeah. It probably could have! I had some vague recollection of an app called cpan (like i said, i haven't done any perl programming for years), but it didn't exist on my system and i thought it would have been part of the core perl installation, so i didn't look any further. I've installed it now though! ;-) Thanks! -- http://SnapAndScribble.com/will/blog
From: Will Kemp on 12 Apr 2008 14:34
On Sat, 12 Apr 2008 11:45:04 +0000, Gordon Henderson wrote: > In article <VUZLj.37949$4f4.35146(a)newsfe6-win.ntli.net>, Will Kemp > <Will(a)xxxx.Swaggie.net> wrote: >>I'm trying to find some way of parsing text files and extracting >>"keywords" from them. "Keywords" is hard to define, of course, but i >>guess it means nouns that are used relatively infrequently. >> >>I've tried to install TextMine - but didn't have any luck with it, and >>couldn't be bothered going any further down that particular track. To be >>honest, although it was only a couple of days ago, i can't remember what >>the specific problems were, but i decided it wasn't worth the effort. >> >>I've also tried to install the Lingua::EN::Tagger perl module - but the >>dependency mesh is horrific. I got part of the way through it, but gave >>up because it would take forever to find all the dependencies and get >>them to install. >> >>Has anyone got any better suggestions than either of those two? > > Swish-e ? > > http://swish-e.org/ > > I use it to index mailling lists, but I don't know how easy it might be > to extract the index it builds for other purposes... That looks like exactly what i want - if i can extract the relevant information from it, anyway. But i'm sure i will be able to. Thanks! -- http://SnapAndScribble.com/will/blog |