From: Will Kemp on
I'm trying to find some way of parsing text files and extracting
"keywords" from them. "Keywords" is hard to define, of course, but i
guess it means nouns that are used relatively infrequently.

I've tried to install TextMine - but didn't have any luck with it, and
couldn't be bothered going any further down that particular track. To be
honest, although it was only a couple of days ago, i can't remember what
the specific problems were, but i decided it wasn't worth the effort.

I've also tried to install the Lingua::EN::Tagger perl module - but the
dependency mesh is horrific. I got part of the way through it, but gave
up because it would take forever to find all the dependencies and get
them to install.

Has anyone got any better suggestions than either of those two?


--
http://SnapAndScribble.com/will/blog
From: Ian Rawlings on
On 2008-04-12, Will Kemp <Will(a)xxxx.Swaggie.net> wrote:

> I'm trying to find some way of parsing text files and extracting
> "keywords" from them. "Keywords" is hard to define, of course, but i
> guess it means nouns that are used relatively infrequently.

Well, without any context I'd suggest grep being fed a file of words
to look for and set to search case-insensitive, but that's without
knowing details like regulary expression requirements, the size of the
files to search, the size of the keyword list, what you're going to do
with them and so on, all of which will change what is basically a very
generic problem description.

--
Blast off and strike the evil Bydo empire!
http://youtube.com/user/tarcus69
http://www.flickr.com/photos/tarcus/sets/
From: Will Kemp on
On Sat, 12 Apr 2008 09:18:13 +0100, Ian Rawlings wrote:

> On 2008-04-12, Will Kemp <Will(a)xxxx.Swaggie.net> wrote:
>
>> I'm trying to find some way of parsing text files and extracting
>> "keywords" from them. "Keywords" is hard to define, of course, but i
>> guess it means nouns that are used relatively infrequently.
>
> Well, without any context I'd suggest grep being fed a file of words to
> look for and set to search case-insensitive, but that's without knowing
> details like regulary expression requirements, the size of the files to
> search, the size of the keyword list, what you're going to do with them
> and so on, all of which will change what is basically a very generic
> problem description.

Yeah, you're right.

I need to extract "keywords" from html files with an average of about
1000 words in each one. The keywords are to be used in the <meta
keywords> tag - as part of a search engine optimisation process.

It's going to require some manual intervention, because a word a script
might pick as a "keyword" won't necessarily be useful for inclusion in
the tag. But it's better if it extracts words that aren't useful, rather
than *not* extract words that are. There are a couple of hundred such
pages to work on, so i want to find a way of reducing the manual part of
it and speeding up the process.

The keywords in question can't be defined in advance, so it would
probably just be a question of the script or program identifying nouns
and listing all of them that are, say, 4 or more characters long.

Identifying a word as a noun is the tricky part.

A next best solution would probably be to extract all words of 4
characters or more and sort them by word length. That's probably
reasonably easy, but i can't work out how to sort by word length.

I guess i could write a perl script to do it, but i haven't written perl
for so long it would take almost as long to work that out as it would to
do the job manually! It's one of those things that i'd expect a pipe of
two or three basic unix commands should be able to do, but i can't find
anything that will do the word-length sort.

--
http://SnapAndScribble.com/will/blog

From: Martin Gregorie on
On Sat, 12 Apr 2008 09:39:08 +0000, Will Kemp wrote:

> On Sat, 12 Apr 2008 09:18:13 +0100, Ian Rawlings wrote:
>
>> On 2008-04-12, Will Kemp <Will(a)xxxx.Swaggie.net> wrote:
>>
>>> I'm trying to find some way of parsing text files and extracting
>>> "keywords" from them. "Keywords" is hard to define, of course, but i
>>> guess it means nouns that are used relatively infrequently.
>>
>> Well, without any context I'd suggest grep being fed a file of words to
>> look for and set to search case-insensitive, but that's without knowing
>> details like regulary expression requirements, the size of the files to
>> search, the size of the keyword list, what you're going to do with them
>> and so on, all of which will change what is basically a very generic
>> problem description.
>
> Yeah, you're right.
>
> I need to extract "keywords" from html files with an average of about
> 1000 words in each one. The keywords are to be used in the <meta
> keywords> tag - as part of a search engine optimisation process.
>
> It's going to require some manual intervention, because a word a script
> might pick as a "keyword" won't necessarily be useful for inclusion in
> the tag. But it's better if it extracts words that aren't useful, rather
> than *not* extract words that are. There are a couple of hundred such
> pages to work on, so i want to find a way of reducing the manual part of
> it and speeding up the process.
>
> The keywords in question can't be defined in advance, so it would
> probably just be a question of the script or program identifying nouns
> and listing all of them that are, say, 4 or more characters long.
>
> Identifying a word as a noun is the tricky part.
>
> A next best solution would probably be to extract all words of 4
> characters or more and sort them by word length. That's probably
> reasonably easy, but i can't work out how to sort by word length.
>
> I guess i could write a perl script to do it, but i haven't written perl
> for so long it would take almost as long to work that out as it would to
> do the job manually! It's one of those things that i'd expect a pipe of
> two or three basic unix commands should be able to do, but i can't find
> anything that will do the word-length sort.

Could you solve the noun recognition problem by using a list of non-verbs
with a matcher (such as column ) to remove all the non-nouns found in a
page. I admit I don't know where you'd find such a list, but you may get
one by parsing one (or all) of the pages into a sorted word list and
manually weeding it.

OTOH you could easily make your length ordered list with awk. Build an
array indexed by the string "nn word" where nn is the word's length and
omitting words of less than 4 characters. As a bonus, this will discard
duplicates. Then use the END action to read the array in sorted string
order, outputting the word from each index string.


--
martin@ | Martin Gregorie
gregorie. |
org | Zappa fan & glider pilot


From: Gordon Henderson on
In article <VUZLj.37949$4f4.35146(a)newsfe6-win.ntli.net>,
Will Kemp <Will(a)xxxx.Swaggie.net> wrote:
>I'm trying to find some way of parsing text files and extracting
>"keywords" from them. "Keywords" is hard to define, of course, but i
>guess it means nouns that are used relatively infrequently.
>
>I've tried to install TextMine - but didn't have any luck with it, and
>couldn't be bothered going any further down that particular track. To be
>honest, although it was only a couple of days ago, i can't remember what
>the specific problems were, but i decided it wasn't worth the effort.
>
>I've also tried to install the Lingua::EN::Tagger perl module - but the
>dependency mesh is horrific. I got part of the way through it, but gave
>up because it would take forever to find all the dependencies and get
>them to install.
>
>Has anyone got any better suggestions than either of those two?

Swish-e ?

http://swish-e.org/

I use it to index mailling lists, but I don't know how easy it might be
to extract the index it builds for other purposes...

Gordon
 |  Next  |  Last
Pages: 1 2 3
Prev: tar / zip query
Next: Distro recommendation