getting content exceprts from the database [General]

Prev: Is the case of <?php important in any way?
Next: LDAP import a csv file from php

From: Phpster on 26 Apr 2010 07:58

On Apr 26, 2010, at 7:23 AM, Ashley Sheridan
<ash(a)ashleysheridan.co.uk> wrote:

> On Mon, 2010-04-26 at 13:20 +0200, Peter Lind wrote:
>
>> On 26 April 2010 12:52, Ashley Sheridan <ash(a)ashleysheridan.co.uk>
>> wrote:
>>> I've been thinking about this problem for a little while, and the
>>> thing
>>> is, I can think of ways of doing it, but they're not very nice,
>>> and I
>>> don't think they're going to be fast.
>>>
>>> Basically, I have a load of HTML formatted content in a database
>>> that
>>> get displayed onto the site. It's part of a rudimentary CMS.
>>>
>>> Currently, the titles for each article are displayed on a page,
>>> and each
>>> title links to the full article. However, that leaves me with a page
>>> which is essentially a list of links, and that's not ideal for
>>> SEO. What
>>> I wanted to do to enhance the page is to have a short excerpt of x
>>> number of words/characters beneath each article title. The idea
>>> being
>>> that search engines will find the page as more than a link farm, and
>>> visitors won't have to just rely on the title alone for the content.
>>>
>>> Here's the rub though. As the content is in HTML form, I can't
>>> just grab
>>> the first 100 characters and display them as that could leave an
>>> open
>>> tag without a closing one, potentially breaking the page. I could
>>> use
>>> strip_tags on the 100-character excerpt, but what if the excerpt
>>> itself
>>> broke a tag in half (i.e. <acronym title="something"> could become
>>> <acron )
>>>
>>> The only solutions I can see are:
>>>
>>>
>>> * retrieve the entire article, perform a strip_tags and then
>>> take
>>> the excerpt
>>> * use a regex inside of mysql to pull out only the text
>>>
>>>
>>> The thing is, neither of these seems particularly pretty, and I am
>>> sure
>>> there's a better way, but it's too early in the week for my brain
>>> to be
>>> fully functional I think!
>>>
>>> Does anyone have any ideas about what I could do, or do you think
>>> I'm
>>> seeing problems where there are none?
>>
>> Use htmltidy or htmlpurifier to clean up things. I.e. grab the amount
>> of content you want, then use one of the tools to repair and clean
>> the
>> html.
>>
>> Regards
>> Peter
>>
>> --
>> <hype>
>> WWW: http://plphp.dk / http://plind.dk
>> LinkedIn: http://www.linkedin.com/in/plind
>> Flickr: http://www.flickr.com/photos/fake51
>> BeWelcome: Fake51
>> Couchsurfing: Fake51
>> </hype>
>>
>
>
> Would that work on content that stopped mid-tag? Assuming the original
> copy is:
>
> This is some sentence, with an <abbr title="Abbreviation">abbr</
> abbr>
> in the middle of it.
>
> If I was asking for only the first 50 characters, I'd get this:
>
> This is some sentence, with an <abbr title="Abb
>
> Would either htmltidy or htmlpurifier be able to handle that? I don't
> mind whether it tries to repair the tag or remove it completely, as
> long
> as it does something to it.
>
> Thanks,
> Ash
> http://www.ashleysheridan.co.uk
>
>

When looking at the performance side of things, couldn't you add
another column to the table and do this work to tidy / strip tags
during the insert going forward?

Any current data would need a one time script to clean / tidy the
current data. you could run this on a nightly cron ( depending on how
much data there is) until the new column is filled with clean data.

Bastien

Sent from my iPod

From: Ashley Sheridan on 26 Apr 2010 07:54

On Mon, 2010-04-26 at 07:58 -0400, Phpster wrote:

>
> On Apr 26, 2010, at 7:23 AM, Ashley Sheridan
> <ash(a)ashleysheridan.co.uk> wrote:
>
> > On Mon, 2010-04-26 at 13:20 +0200, Peter Lind wrote:
> >
> >> On 26 April 2010 12:52, Ashley Sheridan <ash(a)ashleysheridan.co.uk>
> >> wrote:
> >>> I've been thinking about this problem for a little while, and the
> >>> thing
> >>> is, I can think of ways of doing it, but they're not very nice,
> >>> and I
> >>> don't think they're going to be fast.
> >>>
> >>> Basically, I have a load of HTML formatted content in a database
> >>> that
> >>> get displayed onto the site. It's part of a rudimentary CMS.
> >>>
> >>> Currently, the titles for each article are displayed on a page,
> >>> and each
> >>> title links to the full article. However, that leaves me with a page
> >>> which is essentially a list of links, and that's not ideal for
> >>> SEO. What
> >>> I wanted to do to enhance the page is to have a short excerpt of x
> >>> number of words/characters beneath each article title. The idea
> >>> being
> >>> that search engines will find the page as more than a link farm, and
> >>> visitors won't have to just rely on the title alone for the content.
> >>>
> >>> Here's the rub though. As the content is in HTML form, I can't
> >>> just grab
> >>> the first 100 characters and display them as that could leave an
> >>> open
> >>> tag without a closing one, potentially breaking the page. I could
> >>> use
> >>> strip_tags on the 100-character excerpt, but what if the excerpt
> >>> itself
> >>> broke a tag in half (i.e. <acronym title="something"> could become
> >>> <acron )
> >>>
> >>> The only solutions I can see are:
> >>>
> >>>
> >>> * retrieve the entire article, perform a strip_tags and then
> >>> take
> >>> the excerpt
> >>> * use a regex inside of mysql to pull out only the text
> >>>
> >>>
> >>> The thing is, neither of these seems particularly pretty, and I am
> >>> sure
> >>> there's a better way, but it's too early in the week for my brain
> >>> to be
> >>> fully functional I think!
> >>>
> >>> Does anyone have any ideas about what I could do, or do you think
> >>> I'm
> >>> seeing problems where there are none?
> >>
> >> Use htmltidy or htmlpurifier to clean up things. I.e. grab the amount
> >> of content you want, then use one of the tools to repair and clean
> >> the
> >> html.
> >>
> >> Regards
> >> Peter
> >>
> >> --
> >> <hype>
> >> WWW: http://plphp.dk / http://plind.dk
> >> LinkedIn: http://www.linkedin.com/in/plind
> >> Flickr: http://www.flickr.com/photos/fake51
> >> BeWelcome: Fake51
> >> Couchsurfing: Fake51
> >> </hype>
> >>
> >
> >
> > Would that work on content that stopped mid-tag? Assuming the original
> > copy is:
> >
> > This is some sentence, with an <abbr title="Abbreviation">abbr</
> > abbr>
> > in the middle of it.
> >
> > If I was asking for only the first 50 characters, I'd get this:
> >
> > This is some sentence, with an <abbr title="Abb
> >
> > Would either htmltidy or htmlpurifier be able to handle that? I don't
> > mind whether it tries to repair the tag or remove it completely, as
> > long
> > as it does something to it.
> >
> > Thanks,
> > Ash
> > http://www.ashleysheridan.co.uk
> >
> >
>
> When looking at the performance side of things, couldn't you add
> another column to the table and do this work to tidy / strip tags
> during the insert going forward?
>
> Any current data would need a one time script to clean / tidy the
> current data. you could run this on a nightly cron ( depending on how
> much data there is) until the new column is filled with clean data.
>
> Bastien
>
> Sent from my iPod
>

That's not a bad idea actually, I hadn't thought of it! I'm kicking
myself now, because it's such an obvious solution!

Thanks,
Ash
http://www.ashleysheridan.co.uk

From: Phpster on 26 Apr 2010 09:17

On Apr 26, 2010, at 7:54 AM, Ashley Sheridan
<ash(a)ashleysheridan.co.uk> wrote:

> On Mon, 2010-04-26 at 07:58 -0400, Phpster wrote:
>>
>>
>> On Apr 26, 2010, at 7:23 AM, Ashley Sheridan
>> <ash(a)ashleysheridan.co.uk> wrote:
>>
>> > On Mon, 2010-04-26 at 13:20 +0200, Peter Lind wrote:
>> >
>> >> On 26 April 2010 12:52, Ashley Sheridan <ash(a)ashleysheridan.co.uk>
>> >> wrote:
>> >>> I've been thinking about this problem for a little while, and the
>> >>> thing
>> >>> is, I can think of ways of doing it, but they're not very nice,
>> >>> and I
>> >>> don't think they're going to be fast.
>> >>>
>> >>> Basically, I have a load of HTML formatted content in a database
>> >>> that
>> >>> get displayed onto the site. It's part of a rudimentary CMS.
>> >>>
>> >>> Currently, the titles for each article are displayed on a page,
>> >>> and each
>> >>> title links to the full article. However, that leaves me with a
>> page
>> >>> which is essentially a list of links, and that's not ideal for
>> >>> SEO. What
>> >>> I wanted to do to enhance the page is to have a short excerpt
>> of x
>> >>> number of words/characters beneath each article title. The idea
>> >>> being
>> >>> that search engines will find the page as more than a link
>> farm, and
>> >>> visitors won't have to just rely on the title alone for the
>> content.
>> >>>
>> >>> Here's the rub though. As the content is in HTML form, I can't
>> >>> just grab
>> >>> the first 100 characters and display them as that could leave an
>> >>> open
>> >>> tag without a closing one, potentially breaking the page. I
>> could
>> >>> use
>> >>> strip_tags on the 100-character excerpt, but what if the excerpt
>> >>> itself
>> >>> broke a tag in half (i.e. <acronym title="something"> could
>> become
>> >>> <acron )
>> >>>
>> >>> The only solutions I can see are:
>> >>>
>> >>>
>> >>> * retrieve the entire article, perform a strip_tags and then
>> >>> take
>> >>> the excerpt
>> >>> * use a regex inside of mysql to pull out only the text
>> >>>
>> >>>
>> >>> The thing is, neither of these seems particularly pretty, and I
>> am
>> >>> sure
>> >>> there's a better way, but it's too early in the week for my brain
>> >>> to be
>> >>> fully functional I think!
>> >>>
>> >>> Does anyone have any ideas about what I could do, or do you think
>> >>> I'm
>> >>> seeing problems where there are none?
>> >>
>> >> Use htmltidy or htmlpurifier to clean up things. I.e. grab the
>> amount
>> >> of content you want, then use one of the tools to repair and clean
>> >> the
>> >> html.
>> >>
>> >> Regards
>> >> Peter
>> >>
>> >> --
>> >> <hype>
>> >> WWW: http://plphp.dk / http://plind.dk
>> >> LinkedIn: http://www.linkedin.com/in/plind
>> >> Flickr: http://www.flickr.com/photos/fake51
>> >> BeWelcome: Fake51
>> >> Couchsurfing: Fake51
>> >> </hype>
>> >>
>> >
>> >
>> > Would that work on content that stopped mid-tag? Assuming the
>> original
>> > copy is:
>> >
>> > This is some sentence, with an <abbr title="Abbreviation">abbr</
>> > abbr>
>> > in the middle of it.
>> >
>> > If I was asking for only the first 50 characters, I'd get this:
>> >
>> > This is some sentence, with an <abbr title="Abb
>> >
>> > Would either htmltidy or htmlpurifier be able to handle that? I
>> don't
>> > mind whether it tries to repair the tag or remove it completely, as
>> > long
>> > as it does something to it.
>> >
>> > Thanks,
>> > Ash
>> > http://www.ashleysheridan.co.uk
>> >
>> >
>>
>> When looking at the performance side of things, couldn't you add
>> another column to the table and do this work to tidy / strip tags
>> during the insert going forward?
>>
>> Any current data would need a one time script to clean / tidy the
>> current data. you could run this on a nightly cron ( depending on how
>> much data there is) until the new column is filled with clean data.
>>
>> Bastien
>>
>> Sent from my iPod
>>
>
> That's not a bad idea actually, I hadn't thought of it! I'm kicking
> myself now, because it's such an obvious solution!
>
> Thanks,
> Ash
> http://www.ashleysheridan.co.uk
>
>

I always prefer simple solutions! It keeps things easy!

Bastien

Sent from my iPod

From: tedd on 26 Apr 2010 09:26

At 11:52 AM +0100 4/26/10, Ashley Sheridan wrote:
>-snip- SEO concerns
>
>Does anyone have any ideas about what I could do, or do you think I'm
>seeing problems where there are none?
>
>Thanks,
>Ash

Ash:

Not only do you have to consider SEO for content, but what about
content for an internal Site Search?

I was confronted with the same problem (links to lot's of PDF files)
and created a brief description of each article (PDF) that would be
provided to SEO's and for Internal Searches. Sure, it's another
field, but it works.

Not that it's bad, but I do everything I can to keep html out of my
database. In my view, the database is there to deliver content not
code. I have entire sites that spring from a single index.php page
that is loaded with different content depending upon what the user
wants -- the site looks big, but consists of a single page.

Cheers,

tedd

--
-------
http://sperling.com http://ancientstones.com http://earthstones.com

From: Nathan Rixham on 26 Apr 2010 14:38

Ashley Sheridan wrote:
> I've been thinking about this problem for a little while, and the thing
> is, I can think of ways of doing it, but they're not very nice, and I
> don't think they're going to be fast.
>
> Basically, I have a load of HTML formatted content in a database that
> get displayed onto the site. It's part of a rudimentary CMS.
>
> Currently, the titles for each article are displayed on a page, and each
> title links to the full article. However, that leaves me with a page
> which is essentially a list of links, and that's not ideal for SEO. What
> I wanted to do to enhance the page is to have a short excerpt of x
> number of words/characters beneath each article title. The idea being
> that search engines will find the page as more than a link farm, and
> visitors won't have to just rely on the title alone for the content.
>
> Here's the rub though. As the content is in HTML form, I can't just grab
> the first 100 characters and display them as that could leave an open
> tag without a closing one, potentially breaking the page. I could use
> strip_tags on the 100-character excerpt, but what if the excerpt itself
> broke a tag in half (i.e. <acronym title="something"> could become
> <acron )
>
> The only solutions I can see are:
>
>
> * retrieve the entire article, perform a strip_tags and then take
> the excerpt
> * use a regex inside of mysql to pull out only the text
>
>
> The thing is, neither of these seems particularly pretty, and I am sure
> there's a better way, but it's too early in the week for my brain to be
> fully functional I think!
>
> Does anyone have any ideas about what I could do, or do you think I'm
> seeing problems where there are none?
>
> Thanks,
> Ash
> http://www.ashleysheridan.co.uk
>

/**
* creates an abstract from any string, a nice one that stops at a full
* stop or end of a word betwen 140-180 chars.
*
*/
function createAbstract( $string )
{
$lines = explode( "\n" , $string );
if( count($lines) > 1 && strlen($lines[0]) > 140 ) {
$string = $lines[0];
}
if( strlen($string) < 180 ) return $string;
$string = substr( $string , 0 , 180);
$chars = str_split( $string );
$string = '';
foreach( $chars as $char ) {
$string .= $char;
if( $char == '.' && strlen($string) > 120 ) {
return $string;
}
}
$string = '';
foreach( $chars as $char ) {
$string .= $char;
if( $char == ' ' && strlen($string) > 140 ) {
return trim( $string ) . '...';
}
}
return $string;
}

/**
* given an html (or fragment) tidy in to usable html
* and strip back to text, new lines in tact
*
*/
function htmlToText( $html )
{
$html = str_replace( '&' , '&' , str_replace( '&' , '&' ,
$html ) );
$config = array(
'clean' => true,
'drop-proprietary-attributes' => true,
'output-xhtml' => true,
'show-body-only' => true,
'word-2000' => true,
'wrap' => '0'
);
$tidy = new tidy();
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();
$html = tidy_get_output($tidy);
$text = str_replace( '&' , '&' , str_replace( '&' , '&' ,
$text ) );
return strip_tags($text);
}

using those two together should do it; they're pretty basic and could do
with a tidy, but gets the job done (you'll probably want to change the
140 chars to something different)

Best,

Nathan

First | Prev |
Pages: 1 2
Prev: Is the case of <?php important in any way?
Next: LDAP import a csv file from php