|
Prev: tar / zip query
Next: Distro recommendation
From: Will Kemp on 12 Apr 2008 03:54 I'm trying to find some way of parsing text files and extracting "keywords" from them. "Keywords" is hard to define, of course, but i guess it means nouns that are used relatively infrequently. I've tried to install TextMine - but didn't have any luck with it, and couldn't be bothered going any further down that particular track. To be honest, although it was only a couple of days ago, i can't remember what the specific problems were, but i decided it wasn't worth the effort. I've also tried to install the Lingua::EN::Tagger perl module - but the dependency mesh is horrific. I got part of the way through it, but gave up because it would take forever to find all the dependencies and get them to install. Has anyone got any better suggestions than either of those two? -- http://SnapAndScribble.com/will/blog
From: Ian Rawlings on 12 Apr 2008 04:18 On 2008-04-12, Will Kemp <Will(a)xxxx.Swaggie.net> wrote: > I'm trying to find some way of parsing text files and extracting > "keywords" from them. "Keywords" is hard to define, of course, but i > guess it means nouns that are used relatively infrequently. Well, without any context I'd suggest grep being fed a file of words to look for and set to search case-insensitive, but that's without knowing details like regulary expression requirements, the size of the files to search, the size of the keyword list, what you're going to do with them and so on, all of which will change what is basically a very generic problem description. -- Blast off and strike the evil Bydo empire! http://youtube.com/user/tarcus69 http://www.flickr.com/photos/tarcus/sets/
From: Will Kemp on 12 Apr 2008 05:39 On Sat, 12 Apr 2008 09:18:13 +0100, Ian Rawlings wrote: > On 2008-04-12, Will Kemp <Will(a)xxxx.Swaggie.net> wrote: > >> I'm trying to find some way of parsing text files and extracting >> "keywords" from them. "Keywords" is hard to define, of course, but i >> guess it means nouns that are used relatively infrequently. > > Well, without any context I'd suggest grep being fed a file of words to > look for and set to search case-insensitive, but that's without knowing > details like regulary expression requirements, the size of the files to > search, the size of the keyword list, what you're going to do with them > and so on, all of which will change what is basically a very generic > problem description. Yeah, you're right. I need to extract "keywords" from html files with an average of about 1000 words in each one. The keywords are to be used in the <meta keywords> tag - as part of a search engine optimisation process. It's going to require some manual intervention, because a word a script might pick as a "keyword" won't necessarily be useful for inclusion in the tag. But it's better if it extracts words that aren't useful, rather than *not* extract words that are. There are a couple of hundred such pages to work on, so i want to find a way of reducing the manual part of it and speeding up the process. The keywords in question can't be defined in advance, so it would probably just be a question of the script or program identifying nouns and listing all of them that are, say, 4 or more characters long. Identifying a word as a noun is the tricky part. A next best solution would probably be to extract all words of 4 characters or more and sort them by word length. That's probably reasonably easy, but i can't work out how to sort by word length. I guess i could write a perl script to do it, but i haven't written perl for so long it would take almost as long to work that out as it would to do the job manually! It's one of those things that i'd expect a pipe of two or three basic unix commands should be able to do, but i can't find anything that will do the word-length sort. -- http://SnapAndScribble.com/will/blog
From: Martin Gregorie on 12 Apr 2008 06:24 On Sat, 12 Apr 2008 09:39:08 +0000, Will Kemp wrote: > On Sat, 12 Apr 2008 09:18:13 +0100, Ian Rawlings wrote: > >> On 2008-04-12, Will Kemp <Will(a)xxxx.Swaggie.net> wrote: >> >>> I'm trying to find some way of parsing text files and extracting >>> "keywords" from them. "Keywords" is hard to define, of course, but i >>> guess it means nouns that are used relatively infrequently. >> >> Well, without any context I'd suggest grep being fed a file of words to >> look for and set to search case-insensitive, but that's without knowing >> details like regulary expression requirements, the size of the files to >> search, the size of the keyword list, what you're going to do with them >> and so on, all of which will change what is basically a very generic >> problem description. > > Yeah, you're right. > > I need to extract "keywords" from html files with an average of about > 1000 words in each one. The keywords are to be used in the <meta > keywords> tag - as part of a search engine optimisation process. > > It's going to require some manual intervention, because a word a script > might pick as a "keyword" won't necessarily be useful for inclusion in > the tag. But it's better if it extracts words that aren't useful, rather > than *not* extract words that are. There are a couple of hundred such > pages to work on, so i want to find a way of reducing the manual part of > it and speeding up the process. > > The keywords in question can't be defined in advance, so it would > probably just be a question of the script or program identifying nouns > and listing all of them that are, say, 4 or more characters long. > > Identifying a word as a noun is the tricky part. > > A next best solution would probably be to extract all words of 4 > characters or more and sort them by word length. That's probably > reasonably easy, but i can't work out how to sort by word length. > > I guess i could write a perl script to do it, but i haven't written perl > for so long it would take almost as long to work that out as it would to > do the job manually! It's one of those things that i'd expect a pipe of > two or three basic unix commands should be able to do, but i can't find > anything that will do the word-length sort. Could you solve the noun recognition problem by using a list of non-verbs with a matcher (such as column ) to remove all the non-nouns found in a page. I admit I don't know where you'd find such a list, but you may get one by parsing one (or all) of the pages into a sorted word list and manually weeding it. OTOH you could easily make your length ordered list with awk. Build an array indexed by the string "nn word" where nn is the word's length and omitting words of less than 4 characters. As a bonus, this will discard duplicates. Then use the END action to read the array in sorted string order, outputting the word from each index string. -- martin@ | Martin Gregorie gregorie. | org | Zappa fan & glider pilot
From: Gordon Henderson on 12 Apr 2008 07:45
In article <VUZLj.37949$4f4.35146(a)newsfe6-win.ntli.net>, Will Kemp <Will(a)xxxx.Swaggie.net> wrote: >I'm trying to find some way of parsing text files and extracting >"keywords" from them. "Keywords" is hard to define, of course, but i >guess it means nouns that are used relatively infrequently. > >I've tried to install TextMine - but didn't have any luck with it, and >couldn't be bothered going any further down that particular track. To be >honest, although it was only a couple of days ago, i can't remember what >the specific problems were, but i decided it wasn't worth the effort. > >I've also tried to install the Lingua::EN::Tagger perl module - but the >dependency mesh is horrific. I got part of the way through it, but gave >up because it would take forever to find all the dependencies and get >them to install. > >Has anyone got any better suggestions than either of those two? Swish-e ? http://swish-e.org/ I use it to index mailling lists, but I don't know how easy it might be to extract the index it builds for other purposes... Gordon |