From: John Morley on
I didn't have any luck posting this over in the 'Controls' group, so I'm
reposting it here:

Hi All,

I'm sure this has been covered before, but I haven't found anything that
seems to help.

I'm trying to read the raw html data from a web page, so that I can
parse it and extract the information I need. The data is primarily text
that I wish to capture.

When I try to grab a page using the Winsock control, I get a 'page not
found error'. I know that the URL and page location are correct because
I can see a test file using a web browser.

Some (possibly) relevant code:

Private Sub cmdconnect_Click()
On Error Resume Next

TxtWebPage.Text = "" ' clear the text window
Winsock1.RemoteHost = ksBaseURL
Winsock1.RemotePort = 80
Winsock1.Connect

End Sub

Private Sub Winsock1_Connect()
On Error Resume Next
Dim strCommand As String
Dim strWebPage As String

strWebPage = TxtFileLocation.Text
strCommand = "GET " + strWebPage + " HTTP/1.0" + vbCrLf
strCommand = strCommand + "Accept: */*" + vbCrLf
strCommand = strCommand + "Accept: text/html" + vbCrLf
strCommand = strCommand + vbCrLf

Debug.Print strCommand
Winsock1.SendData strCommand

End Sub

I *think* my problem is with the GET request, but I haven't found out
what it is yet!

Ideas?

Thanks,

John
From: Nobody on
"John Morley" <jmorley(a)nospamanalysistech.com> wrote in message
news:e7bJe7WuKHA.3504(a)TK2MSFTNGP06.phx.gbl...
>I didn't have any luck posting this over in the 'Controls' group, so I'm
>reposting it here:
>
> Hi All,
>
> I'm sure this has been covered before, but I haven't found anything that
> seems to help.
>
> I'm trying to read the raw html data from a web page, so that I can parse
> it and extract the information I need. The data is primarily text that I
> wish to capture.
>
> When I try to grab a page using the Winsock control, I get a 'page not
> found error'. I know that the URL and page location are correct because I
> can see a test file using a web browser.
>
> Some (possibly) relevant code:
>
> Private Sub cmdconnect_Click()
> On Error Resume Next
>
> TxtWebPage.Text = "" ' clear the text window
> Winsock1.RemoteHost = ksBaseURL
> Winsock1.RemotePort = 80
> Winsock1.Connect
>
> End Sub
>
> Private Sub Winsock1_Connect()
> On Error Resume Next
> Dim strCommand As String
> Dim strWebPage As String
>
> strWebPage = TxtFileLocation.Text
> strCommand = "GET " + strWebPage + " HTTP/1.0" + vbCrLf
> strCommand = strCommand + "Accept: */*" + vbCrLf
> strCommand = strCommand + "Accept: text/html" + vbCrLf
> strCommand = strCommand + vbCrLf
>
> Debug.Print strCommand
> Winsock1.SendData strCommand
>
> End Sub
>
> I *think* my problem is with the GET request, but I haven't found out what
> it is yet!
>
> Ideas?

You need to include:

strCommand = "Host: www.somesite.com" + vbCrLf

That line is needed if multiple hosts share the same IP. Also, you need to
use URLEncode function(search the web).

Also, why not use WinInet? it's easier to use. See this sample:

SAMPLE: Vbhttp.exe Demonstrates How to Use HTTP WinInet APIs in Visual Basic
http://support.microsoft.com/kb/259100

As for parsing HTML, try using "Microsoft HTML Object Library", which is
part of IE. See the sample in this post which prints a list of links in a
web page. It can be adopted to parse various aspects of HTML tags easily.

http://groups.google.com/group/microsoft.public.vb.general.discussion/msg/ce903530d703561c




From: mayayana on
"A client MUST include a Host header field in all HTTP/1.1 request messages
.."

http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

I don't know about http 1.0, but I'd guess that's the
problem. Also note: If you don't include a Content-Encoding
line in the header you *should* get back plain text, but
it's a good idea to be prepared for gzip compression.

There's a userControl here that might be useful:

http://www.jsware.net/jsware/vbcode.php5#htp

It encapsulates the process of downloading files
via HTTP, using Windows sockets directly so that
the winsock control isn't necessary. You could use
that instead of what you've got, or just use it as
example code for the call.
There are a lot of possible details in terms of
the HTTP header, but most of it's not necessary.
Once you've downloaded some files you can see
what a typical header looks like.


> I didn't have any luck posting this over in the 'Controls' group, so I'm
> reposting it here:
>
> Hi All,
>
> I'm sure this has been covered before, but I haven't found anything that
> seems to help.
>
> I'm trying to read the raw html data from a web page, so that I can
> parse it and extract the information I need. The data is primarily text
> that I wish to capture.
>
> When I try to grab a page using the Winsock control, I get a 'page not
> found error'. I know that the URL and page location are correct because
> I can see a test file using a web browser.
>
> Some (possibly) relevant code:
>
> Private Sub cmdconnect_Click()
> On Error Resume Next
>
> TxtWebPage.Text = "" ' clear the text window
> Winsock1.RemoteHost = ksBaseURL
> Winsock1.RemotePort = 80
> Winsock1.Connect
>
> End Sub
>
> Private Sub Winsock1_Connect()
> On Error Resume Next
> Dim strCommand As String
> Dim strWebPage As String
>
> strWebPage = TxtFileLocation.Text
> strCommand = "GET " + strWebPage + " HTTP/1.0" + vbCrLf
> strCommand = strCommand + "Accept: */*" + vbCrLf
> strCommand = strCommand + "Accept: text/html" + vbCrLf
> strCommand = strCommand + vbCrLf
>
> Debug.Print strCommand
> Winsock1.SendData strCommand
>
> End Sub
>
> I *think* my problem is with the GET request, but I haven't found out
> what it is yet!
>
> Ideas?
>
> Thanks,
>
> John


From: C. Kevin Provance on
"John Morley" <jmorley(a)nospamanalysistech.com> wrote in message
news:e7bJe7WuKHA.3504(a)TK2MSFTNGP06.phx.gbl...
| Hi All,
|
| I'm sure this has been covered before, but I haven't found anything that
| seems to help.
|
| I'm trying to read the raw html data from a web page, so that I can
| parse it and extract the information I need. The data is primarily text
| that I wish to capture.
|
| When I try to grab a page using the Winsock control, I get a 'page not
| found error'. I know that the URL and page location are correct because
| I can see a test file using a web browser.

If you are attempting to screen scrape, ensure the page is not generated
dynamically when it loads.