|
Prev: Compare Two Datatables
Next: Marshalling: GCHandle Pinning vs Marshal.PtrToStruct & StructToPtr?
From: SteveB on 2 Jul 2008 15:01 I have posted this question in the Visual Basic 2005 and Visual Basic .Net 2005 discussion groups, also. Hi. I am developing an application/web page with VB.Net that will populate a SQL database from text extracted from PDF documents. However, I am having a difficult time finding or developing the appropriate code to convert the PDF streams into text strings. Has anyone developed code to convert PDF's to Text? I was able write a Perl script that would call a PDF to text conversion application, but, I am having difficulty writing a similiar shell command in VB. Any ideas? Once I have the text strings, I can parse the data easily into the SQL database tables.
From: Gillard on 2 Jul 2008 18:05 1 get this to convert pdf2text ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl2-win32.zip 2 use this sub Sub Pdf2Txt(ByVal options As String, ByVal pdfFile As String, ByVal txtFile As String) Dim arguments As String = options & " " & pdfFile & " " & txtFile 'make sure to provide the path with the pdfFile and the txtFile System.Diagnostics.Process.Start("pdftotext.exe", arguments) End Sub "SteveB" <stephen.bray(a)usbank.com> wrote in message news:a968949d-c814-4093-8646-c1cc9c470870(a)34g2000hsh.googlegroups.com... > I have posted this question in the Visual Basic 2005 and Visual > Basic .Net 2005 discussion groups, also. > > Hi. I am developing an application/web page with VB.Net that will > populate a SQL database from text extracted from PDF documents. > However, I am having a difficult time finding or developing the > appropriate code to convert the PDF streams into text strings. Has > anyone developed code to convert PDF's to Text? > > I was able write a Perl script that would call a PDF to text > conversion application, but, I am having difficulty writing a > similiar > shell command in VB. Any ideas? > > > Once I have the text strings, I can parse the data easily into the > SQL > database tables. >
From: Kevin S Gallagher on 7 Jul 2008 11:15 I have tried many free libraries and had mixed results. The only reliable avenue was using Aspose library. http://www.aspose.com/categories/file-format-components/aspose.pdf.kit-for-.net-and-java/default.aspx My steps using Aspose 1. Init library, open file. 2. Loop thru each page 3 Collect page data/massage/post to database A typical PDF file for me has 1,500 pages with no forms, 20+ elements per page to extract. Average time per document to extract, massage data, pass to database is 10-15 seconds Apose can read each document in 3 seconds total. Downside, it cost money yet it's a great investment as I have found out because it has served me well on multiple projects. "SteveB" <stephen.bray(a)usbank.com> wrote in message news:a968949d-c814-4093-8646-c1cc9c470870(a)34g2000hsh.googlegroups.com... >I have posted this question in the Visual Basic 2005 and Visual > Basic .Net 2005 discussion groups, also. > > Hi. I am developing an application/web page with VB.Net that will > populate a SQL database from text extracted from PDF documents. > However, I am having a difficult time finding or developing the > appropriate code to convert the PDF streams into text strings. Has > anyone developed code to convert PDF's to Text? > > I was able write a Perl script that would call a PDF to text > conversion application, but, I am having difficulty writing a > similiar > shell command in VB. Any ideas? > > > Once I have the text strings, I can parse the data easily into the > SQL > database tables. >
From: SteveB on 11 Jul 2008 12:43 On Jul 2, 5:05 pm, "Gillard" <gillard_geor...(a)hotmail.com> wrote: > 1 get this to convert pdf2textftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl2-win32.zip > 2 use this sub > Sub Pdf2Txt(ByVal options As String, ByVal pdfFile As String, ByVal txtFile > As String) > Dim arguments As String = options & " " & pdfFile & " " & txtFile > 'make sure to provide the path with the pdfFile and the txtFile > System.Diagnostics.Process.Start("pdftotext.exe", arguments) > End Sub > > "SteveB" <stephen.b...(a)usbank.com> wrote in message > > news:a968949d-c814-4093-8646-c1cc9c470870(a)34g2000hsh.googlegroups.com... > > > > > I have posted this question in the Visual Basic 2005 and Visual > > Basic .Net 2005 discussion groups, also. > > > Hi. I am developing an application/web page with VB.Net that will > > populate a SQL database from text extracted from PDF documents. > > However, I am having a difficult time finding or developing the > > appropriate code to convert the PDF streams into text strings. Has > > anyone developed code to convert PDF's to Text? > > > I was able write a Perl script that would call a PDF to text > > conversion application, but, I am having difficulty writing a > > similiar > > shell command in VB. Any ideas? > > > Once I have the text strings, I can parse the data easily into the > > SQL > > database tables.- Hide quoted text - > > - Show quoted text - I tried your suggestion and this app works great from a command line. However, when I try to call pdftotext as you sugeested, I keep getting an exception this error: System.ComponentModel.Win32Exception was unhandled by user code ErrorCode=-2147467259 Message="The system cannot find the file specified" Source="System" StackTrace: at System.Diagnostics.Process.StartWithShellExecuteEx(ProcessStartInfo startInfo) at System.Diagnostics.Process.Start() at System.Diagnostics.Process.Start(ProcessStartInfo startInfo) at System.Diagnostics.Process.Start(String fileName) at _Default.Pdf2Txt(String options, String pdffile, String textfile) in D:\documents and settings\srbray\My Documents\Visual Studio 2005\Websites\RegCC\FRB.aspx.vb:line 48 at _Default.Submit1_Click(Object sender, EventArgs e) in D: \documents and settings\srbray\My Documents\Visual Studio 2005\Websites \RegCC\FRB.aspx.vb:line 27 at System.Web.UI.WebControls.Button.OnClick(EventArgs e) at System.Web.UI.WebControls.Button.RaisePostBackEvent(String eventArgument) at System.Web.UI.WebControls.Button.System.Web.UI.IPostBackEventHandler.RaisePostBackEvent(String eventArgument) at System.Web.UI.Page.RaisePostBackEvent(IPostBackEventHandler sourceControl, String eventArgument) at System.Web.UI.Page.RaisePostBackEvent(NameValueCollection postData) at System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint) This is my code: Protected Sub Submit1_Click(ByVal sender As Object, ByVal e As System.EventArgs) Handles Submit1.Click Dim Path As String = System.IO.Path.GetDirectoryName(File1.PostedFile.FileName) Dim FileName As String Dim MyText() As String Dim NewFileName As String Dim DataPath As String = "D:\Documents and Settings\srbray\My Documents\Visual Studio 2005\WebSites\RegCC\Data\" Dim ArchivePath As String = "D:\Documents and Settings\srbray \My Documents\Visual Studio 2005\WebSites\RegCC\Archive\" Dim MMM As String = MonthName(Month(Now()), True) Dim YYYY As String = Year(Now()) 'Create new archive directory. My.Computer.FileSystem.CreateDirectory(ArchivePath & YYYY & "\" & MMM) ArchivePath = ArchivePath & YYYY & "\" & MMM & "\" System.IO.Directory.SetCurrentDirectory(DataPath) If Not File1.PostedFile Is Nothing And File1.PostedFile.ContentLength > 0 Then For Each oneFile As String In My.Computer.FileSystem.GetFiles(Path, FileIO.SearchOption.SearchTopLevelOnly, "*.pdf") FileName = System.IO.Path.GetFileName(oneFile) MyText = Split(FileName, ".") NewFileName = MyText(0) & ".txt" movepdffile(oneFile, DataPath & FileName) Pdf2Txt("-layout", DataPath & FileName, DataPath & NewFileName) Next oneFile Else MsgBox("Please select the file(s) to upload.") End If 'Insert code here to: 'Convert .pdf documents into .txt documents with additional code to 'import data into the Float Reg CC database. 'Move .pdf files from working directory to archive directory and delete .txt files. 'My.Computer.FileSystem.MoveFile(DataPath & FileName, ArchivePath & FileName, True) 'My.Computer.FileSystem.DeleteFile(DataPath & NewFileName) End Sub Sub Pdf2Txt(ByVal options As String, ByVal pdffile As String, ByVal textfile As String) Dim exe As String = "D:\xpdf-win32\pdftotext.exe" Dim cmd As String = ("'" & exe & "' " & options & " '" & pdffile & "' '" & textfile & "'") MsgBox(cmd) System.Diagnostics.Process.Start(cmd) End Sub Sub movepdffile(ByVal origin As String, ByVal destination As String) Try My.Computer.FileSystem.MoveFile(origin, destination, false) Catch Exc As Exception MsgBox("Error: " & Exc.Message) End Try MsgBox("Move is successful.") End Sub I believe I can make this work, but I am missing something minor....
|
Pages: 1 Prev: Compare Two Datatables Next: Marshalling: GCHandle Pinning vs Marshal.PtrToStruct & StructToPtr? |