From: tedd on
Hi gang:

Considering all the recent parsing, here's another problem to
consider -- given any text, parse the domain-names out of it.

You may limit the parsing to the most popular TDL's, such as .com,
..net, and .org, but the finished result should be an array containing
all the domain-names found in a text file.

Cheers,

tedd
--
-------
http://sperling.com http://ancientstones.com http://earthstones.com
From: Ashley Sheridan on
On Mon, 2010-06-14 at 09:14 -0400, tedd wrote:

> Hi gang:
>
> Considering all the recent parsing, here's another problem to
> consider -- given any text, parse the domain-names out of it.
>
> You may limit the parsing to the most popular TDL's, such as .com,
> .net, and .org, but the finished result should be an array containing
> all the domain-names found in a text file.
>
> Cheers,
>
> tedd
> --
> -------
> http://sperling.com http://ancientstones.com http://earthstones.com
>


I'm assuming it won't be anything as simple as assuming all the domains
begin with the http:// prefix? :p

Thanks,
Ash
http://www.ashleysheridan.co.uk


From: tedd on
At 2:18 PM +0100 6/14/10, Ashley Sheridan wrote:
>On Mon, 2010-06-14 at 09:14 -0400, tedd wrote:
>
>>
>>Hi gang:
>>
>>Considering all the recent parsing, here's another problem to
>>consider -- given any text, parse the domain-names out of it.
>>
>>You may limit the parsing to the most popular TDL's, such as .com,
>>.net, and .org, but the finished result should be an array containing
>>all the domain-names found in a text file.
>>
>>Cheers,
>>
>>tedd
>>--
>>-------
>><http://sperling.com>http://sperling.com
>><http://ancientstones.com>http://ancientstones.com
>><http://earthstones.com>http://earthstones.com
>>
>
>I'm assuming it won't be anything as simple as assuming all the
>domains begin with the http:// prefix? :p
>
>Thanks,
>Ash

Ash:

Nope, just a text file containing whatever and domain-names. The only
domain-name indicator would be the period followed by an approved
TDL, such as .com, .net, or .org.

Cheers,

tedd

--
-------
http://sperling.com http://ancientstones.com http://earthstones.com
From: Robert Cummings on
tedd wrote:
> At 2:18 PM +0100 6/14/10, Ashley Sheridan wrote:
>> On Mon, 2010-06-14 at 09:14 -0400, tedd wrote:
>>
>>> Hi gang:
>>>
>>> Considering all the recent parsing, here's another problem to
>>> consider -- given any text, parse the domain-names out of it.
>>>
>>> You may limit the parsing to the most popular TDL's, such as .com,
>>> .net, and .org, but the finished result should be an array containing
>>> all the domain-names found in a text file.
>>>
>>> Cheers,
>>>
>>> tedd
>>> --
>>> -------
>>> <http://sperling.com>http://sperling.com
>>> <http://ancientstones.com>http://ancientstones.com
>>> <http://earthstones.com>http://earthstones.com
>>>
>> I'm assuming it won't be anything as simple as assuming all the
>> domains begin with the http:// prefix? :p
>>
>> Thanks,
>> Ash
>
> Ash:
>
> Nope, just a text file containing whatever and domain-names. The only
> domain-name indicator would be the period followed by an approved
> TDL, such as .com, .net, or .org.

<?php

function rip_domains( $text )
{
$domains = false;

$pattern =
'[^-[:alnum:]]*'
.'('
. '[-[:alnum:]][-.[:alnum:]]*'
. '\.(com|net|org)'
.')'
.'[^-_[:alnum:]]*';

if( preg_match_all( "#$pattern#", $text, $matches ) )
{
$domains = array();
foreach( $matches[1] as $domain )
{
$domains[$domain] = true;
}
$domains = array_keys( $domains );
}

return $domains;
}

?>

Naive implementation. I'm sure I've missed edge cases someplace.

Cheers,
Rob.
--
E-Mail Disclaimer: Information contained in this message and any
attached documents is considered confidential and legally protected.
This message is intended solely for the addressee(s). Disclosure,
copying, and distribution are prohibited unless authorized.
From: "Daniel P. Brown" on
On Mon, Jun 14, 2010 at 09:14, tedd <tedd(a)sperling.com> wrote:
> Hi gang:
>
> Considering all the recent parsing, here's another problem to consider --
> given any text, parse the domain-names out of it.
>
> You may limit the parsing to the most popular TDL's, such as .com, .net, and
> .org, but the finished result should be an array containing all the
> domain-names found in a text file.

<?php
$text =<<<TXT
To test example.com and www.php.net and other domain names
such as january.pilotpig.net and ca2.php.parasane.net, we need a
reliable method of checking. We don't want to match on regular
periods, nor on the 2.2million or 2.2 million or just 2,200,000
other potential matches. And not when we are double-spacing or
single-spacing, just when oidk.net and similar domains are found.
We'll match hyphen domains like l-i-e.com, but not fake_underscored_domain.net.
We also want to match http://-fronted domains like http://php1.net/,
which also contains a number. If we wanted to match domains plus
paths, but there was no leading http:// to indicate that it should
be a URL, we could extend this to grab things like www.facebook.com/parasane,
so long as we don't ignore the rare one-character SLDs like x.com,
as well as the domains in email addresses like danbrown(a)php.net
So if everything works as expected, we should see eleven domains
matched here, because ccTLDs like guthr.ie should be matched as well.

TXT;

/**
* $fromText can be defined via a file_get_contents() or
* similar function, while $fullLink should be anything
* but false to enable link-matching, which will return
* only link-like domains with paths attached.
*/
function extract_domains($fromText,$fullLink=false) {

// If we only want to match the domain names.
if ($fullLink === false) {
preg_match_all('/\b([a-z0-9\-\.]{1,}\.[a-z]{2,5})\b/',$fromText,$matches);
return $matches[1];
}

// If we want to match just domain names with trailing paths.
preg_match_all('/\b([a-z0-9\-\.]{1,}\.[a-z]{2,5}\/.+?)\b/',$fromText,$matches);
return $matches[1];
}

// Demo
echo "<pre>".PHP_EOL;

echo "Just domains:".PHP_EOL;
var_dump(extract_domains($text));

echo PHP_EOL;

echo "Full links:".PHP_EOL;
var_dump(extract_domains($text,true));

echo "</pre>".PHP_EOL;
?>


--
</Daniel P. Brown>
daniel.brown(a)parasane.net || danbrown(a)php.net
http://www.parasane.net/ || http://www.pilotpig.net/
We now offer SAME-DAY SETUP on a new line of servers!