string stuff [Fortran]

Prev: Namelist/module-question
Next: Integer return values in interfaces

From: Tobias Burnus on 11 Oct 2009 12:16

Dan Nagle wrote:
> If you can find a compiler that supports the f08
> i/o encoding= specifier, you might be able to tinker
> with the character set (just as another knob to twist).

To my knowledge ENCODING= is not new in Fortran 2008 but was already in
Fortran 2003 - albeit there it was optional whether UTF-8/ISO 10646 is
supported.

For what it is worth gfortran 4.4.x ([released in April 2009] and 4.5)
have a mostly working support for UTF-8 files and a corresponding
CHARACTER(kind=4) type.

http://gcc.gnu.org/gcc-4.4/changes.html
"Fortran 2003 support has been extended:
* Wide characters (ISO 10646, UCS-4, kind=4) and UTF-8 I/O is now
supported (except internal reads from/writes to wide strings).
-fbackslash now supports also \unnnn and \Unnnnnnnn to enter Unicode
characters."

Tobias

From: Ron Shepard on 11 Oct 2009 14:40

In article <pdgAm.46773$ze1.38057(a)news-server.bigpond.net.au>,
"robin" <robin_v(a)bigpond.com> wrote:

> "glen herrmannsfeldt" <gah(a)ugcs.caltech.edu> wrote in message
> news:haofna$of4$1(a)naig.caltech.edu...
> | analyst41(a)hotmail.com wrote:
> | (snip, someone wrote)
> |
> | <> > ?read (cscr,*,end=100,err=100)
> | onvar,(var1(j),j=1,6),pulsevar,(var2(j),j=1,1000)
> |
> | (snip)
> |
> | < why is it so hard to retain the value of j when the first of the two
> | < following events occurs
> |
> | < (1) The read list is satified (j = number items asked for plus 1)
> |
> | < (2) EOR, EOF, ERR occurs (j = number of items read
> | upto that time plus 1)
> |
> | It isn't that it is so hard, but that the standard doesn't requite it.
> |
> | It often happens deep in the I/O library.
>
> The control variable, namely, "I" is in the program, not in the library.
> That's the value that's updated on each iteration of the loop.

The above read statement is equivalent to

read (cscr,*,end=100,err=100)onvar,var1(1:6),pulsevar,var2(1:1000)

where there is no loop index that is updated because there is no
loop. That is the essential difference in an "implied" and an
"explicit" loop. I have used compilers in the past that attempted
to treat the two read statements differently (using full arrays in
f77, not with f90 slices), and there is a very large difference in
the efficiency in the two cases. In one approach, there is a
library i/o call for each item within the implied do loops, in the
other approach there is a single library call for the entire array,
obviously a large difference in overhead for the two approaches.

If the standard had required the information to be available, then
it would be possible to return that value from the i/o library to
the calling program (or, given that the i/o library probably works
in terms of memory addresses and strides, at least enough
information could be returned for the calling program to determine
the effective loop value) but that is not the situation, the
language does not require it to be returned and it prohibits the
programmer from attempting to use the value. In some cases
involving implied do loops (e.g. in data statements, I forget now if
i/o lists are included), the loop variables are actually a different
entity from other integers that have the same name -- effectively,
in the same scoping unit there are different "integers" that all
share the same name.

$.02 -Ron Shepard

From: dpb on 11 Oct 2009 14:42

analyst41(a)hotmail.com wrote:
> On Oct 11, 10:35 am, dpb <n...(a)non.net> wrote:
....
>> 2) Clean up after their mess (which is obviously what you're trying to do)
>>
>> I don't think 2) is possible unequivocally unless there is some way to
>> tell what a record and field length should be a priori and know there
>> are a fixed number of fields per record. Or, iff the field separators
>> are reliable, then you should be able to count fields.
>
> The problems are caused by large varchar columns - so I don't think
> the notion of record length makes sense here.

OK, then you'll have to have fixed number of fields/record in the export
file to have any chance at all.

>> If that is the case, then it would seem that only way would be to open
>> the file first as "binary" (sorry for the vernacular usage, Richard :) )
>> stream and count field delimiters and simply toss out EOR characters
>> that don't belong and rewrite the file before processing it as csv.
>
> That sounds interesting: We know that "true" EORs can only occur
> after the last columnn in the database. So if one sees them "in
> between" one can throw them out.
>
> Is there a Windows/DOS tool that will let me see the EOR characters?

Any hex editor...many programming editors have a facility.

The DOS DEBUG command will in a most rudimentary way. I use the
JPSoftware command processor that includes a very flexible LIST utility
that has a hex mode so don't know about other utilities specifically,
but there surely are a zillion shareware/freeware ones around I'd think.

> I haven't used binary files in ages - any pointers as to how I can do
> that would be appreciated and I suppose I can look for "EOR" (real or
> spurious) with the IACHAR value of the EOR marker (it is control M on
> unix but I don't exactly know what the csv downloader in the database
> client puts at the end of records.)
....

character*1 :: c
parameter, character*(*) :: bumchar = char(13) ! or whatever
parameter, character*(*) :: okchar = ' ' ! or whatever

open(11, file=filename1, form='binary', action = 'read')
! "binary" is the F95 CVF extension, stream in F03
open(12, file=filename2, form='binary', action = 'write')

do while (.not. eof(1)) ! eof() is a CVF extension, salt to suit...
read (11) c
if (c == bumchar) then
write(12) okchar
else
write(12) c
end do
close(11)
close(12)

What's missing in the above is unless the bumchar is always a bad
character you'll have to separate out the good/bad occurrences if it is
included in a string as valid.

But, on thinking what you wrote, maybe it isn't...in that case you could
simply slurp up large chunks of the file at a time and do a global
substitution and write it back out instead of the character-at-a-time
approach above. That would be much faster if the files are sizable.

--

call find_field(s, f, posn, ',')
write(*,*) f(1:len_trim(f))
enddo
end program main

From: glen herrmannsfeldt on 11 Oct 2009 15:24

Ron Shepard <ron-shepard(a)nospam.comcast.net> wrote:
(snip)

< The above read statement is equivalent to

< read (cscr,*,end=100,err=100)onvar,var1(1:6),pulsevar,var2(1:1000)

< where there is no loop index that is updated because there is no
< loop. That is the essential difference in an "implied" and an
< "explicit" loop. I have used compilers in the past that attempted
< to treat the two read statements differently (using full arrays in
< f77, not with f90 slices), and there is a very large difference in
< the efficiency in the two cases. In one approach, there is a
< library i/o call for each item within the implied do loops, in the
< other approach there is a single library call for the entire array,
< obviously a large difference in overhead for the two approaches.

In either case, valid implementations may use an actual loop in
the generated code, or in the library routine. The more complicated
implied DO structures are more likely to be loops in generated code.
Loops inside the library routine are likely faster.

< If the standard had required the information to be available, then
< it would be possible to return that value from the i/o library to
< the calling program (or, given that the i/o library probably works
< in terms of memory addresses and strides, at least enough
< information could be returned for the calling program to determine
< the effective loop value) but that is not the situation, the
< language does not require it to be returned and it prohibits the
< programmer from attempting to use the value.

This actually happened to me in the VAX/VMS Fortran days. I had
never seen a copy of the standard, but I was sure it was wrong and
was considering sending a bug report to DEC. I never did, though.

< In some cases
< involving implied do loops (e.g. in data statements, I forget now if
< i/o lists are included), the loop variables are actually a different
< entity from other integers that have the same name -- effectively,
< in the same scoping unit there are different "integers" that all
< share the same name.

DATA statements, last I knew, were not exectuable statements.

I believe the dummy variables in statement functions are also
different (and not necessarily integers).

-- glen

From: dpb on 11 Oct 2009 16:49

analyst41(a)hotmail.com wrote:
....

> That sounds interesting: We know that "true" EORs can only occur
> after the last columnn in the database. So if one sees them "in
> between" one can throw them out.
....

One last thought that might help the parsing...is there the possibility
of exporting as tab-delimited instead of csv? _IF_ (the proverbial "big
if" :) ) could and there's not a tab in the text fields, could make the
field identification simpler...

--

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: Namelist/module-question
Next: Integer return values in interfaces