From: Ed Morton on
On 5/7/2008 11:04 AM, pk wrote:
> On Wednesday 7 May 2008 17:37, Ed Morton wrote:
>
>
>>>>these locales, `[a-dx-z]' is typically not equivalent to `[abcdxyz]';
>>>>instead it might be equivalent to `[aBbCcDdxXyYz]', for example.
>>>
>>>
>>>Is there a way to explicitly print out that information (or, better, the
>>>entire collating sequence in use)? I've been looking for a method to do
>>>that for long time, but I have found no complete answer.
>>>
>>
>>I expect you could use the ord() and chr() functions described here:
>>
>>http://www.gnu.org/software/gawk/manual/gawk.html#Ordinal-Functions
>>
>>to do something like:
>>
>>for (i=ord("a");i<=ord("z");i++) {
>>print chr(i)
>>}
>
>
> Take this scenario:
>
> $ cat file
> 100e3
> $ echo $LC_ALL
> en_GB
> $ awk '/[A-Z]/' file
> 100e3
> $ LC_ALL=C awk '/[A-Z]/' file
> $
>
> (or, perhaps more elegant,
> $ awk '[[:upper:]]' file
> $ )
>
> It seems that the function you point out use the mere numeric character
> values and don't take locale into account. Using the proposed code for the
> ord() and chr() functions, a loop to print the sequence from "A" to "Z"
> always yields
>
> A
> B
> C
> ...
> Z
>
> under many different locales, even en_GB which, as seen above, clearly
> expands [A-Z] differently.

Good point. In that case, you could do something like this:

range="[a-z]"
for (i=low;i<=high;i++)
if (chr(i) ~ range)
print chr(i)

where "low" and "high" are set by the _ord_init() function in the above link. It
won't necessarily tell you the actual order each character appears in in the
given range, but that shouldn't matter.

Ed.

From: pk on
On Wednesday 7 May 2008 18:25, Ed Morton wrote:

> Good point. In that case, you could do something like this:
>
> range="[a-z]"
> for (i=low;i<=high;i++)
> if (chr(i) ~ range)
> print chr(i)
>
> where "low" and "high" are set by the _ord_init() function in the above
> link. It won't necessarily tell you the actual order each character
> appears in in the given range, but that shouldn't matter.

This is actually somewhat better. With LC_ALL=en_GB, low=0, high=127, both
a...z and A...Z are printed, while with LC_ALL=C only a...z (as expected)
are printed. But it seems that those functions are ascii-oriented anyway,
so something more general, which works with the actual charset in use (with
accented and special characters, etc), would be great.
And moreover, I'm looking for some method that works in the reverse way, ie,
given the expression, print the expansion.

However, many thanks for your help and suggestions!

--
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.
From: Ed Morton on


On 5/7/2008 11:46 AM, pk wrote:
> On Wednesday 7 May 2008 18:25, Ed Morton wrote:
>
>
>>Good point. In that case, you could do something like this:
>>
>>range="[a-z]"
>>for (i=low;i<=high;i++)
>>if (chr(i) ~ range)
>>print chr(i)
>>
>>where "low" and "high" are set by the _ord_init() function in the above
>>link. It won't necessarily tell you the actual order each character
>>appears in in the given range, but that shouldn't matter.
>
>
> This is actually somewhat better. With LC_ALL=en_GB, low=0, high=127, both
> a...z and A...Z are printed, while with LC_ALL=C only a...z (as expected)
> are printed. But it seems that those functions are ascii-oriented anyway,
> so something more general, which works with the actual charset in use (with
> accented and special characters, etc), would be great.

I think only ord() is ASCII/EBCDIC-oriented and you don't use that. All you
really need is a max value for all character sets (or pick some ridiculously
high value if you don't know) then this:

$ cat showrange.awk
BEGIN{
for (i=0;i<=1000;i++)
chars[sprintf("%c",i)]
for (c in chars)
if (c ~ range)
s=s c
print range":"s
}
$ awk -v range="[a-z]" -f showrange.awk
[a-z]:abcdefghijklmnopqrstuvwxyz
$ awk -v range="[A-Z0-9]" -f showrange.awk
[A-Z0-9]:0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ

should work.

> And moreover, I'm looking for some method that works in the reverse way, ie,
> given the expression, print the expansion.

That is what it does - given the expression contained in the variable "range",
it'll print every character that is in that expression, i.e. the expansion.

> However, many thanks for your help and suggestions!
>

You're welcome.

Ed.

From: Ed Morton on


On 5/7/2008 12:16 PM, Ed Morton wrote:
>
> On 5/7/2008 11:46 AM, pk wrote:
>
>>On Wednesday 7 May 2008 18:25, Ed Morton wrote:
>>
>>
>>
>>>Good point. In that case, you could do something like this:
>>>
>>>range="[a-z]"
>>>for (i=low;i<=high;i++)
>>>if (chr(i) ~ range)
>>>print chr(i)
>>>
>>>where "low" and "high" are set by the _ord_init() function in the above
>>>link. It won't necessarily tell you the actual order each character
>>>appears in in the given range, but that shouldn't matter.
>>
>>
>>This is actually somewhat better. With LC_ALL=en_GB, low=0, high=127, both
>>a...z and A...Z are printed, while with LC_ALL=C only a...z (as expected)
>>are printed. But it seems that those functions are ascii-oriented anyway,
>>so something more general, which works with the actual charset in use (with
>>accented and special characters, etc), would be great.
>
>
> I think only ord() is ASCII/EBCDIC-oriented and you don't use that. All you
> really need is a max value for all character sets (or pick some ridiculously
> high value if you don't know) then this:
>
> $ cat showrange.awk
> BEGIN{
> for (i=0;i<=1000;i++)
> chars[sprintf("%c",i)]
> for (c in chars)
> if (c ~ range)
> s=s c
> print range":"s
> }
> $ awk -v range="[a-z]" -f showrange.awk
> [a-z]:abcdefghijklmnopqrstuvwxyz
> $ awk -v range="[A-Z0-9]" -f showrange.awk
> [A-Z0-9]:0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ
>
> should work.
>
>
>>And moreover, I'm looking for some method that works in the reverse way, ie,
>>given the expression, print the expansion.
>
>
> That is what it does - given the expression contained in the variable "range",
> it'll print every character that is in that expression, i.e. the expansion.
>
>
>>However, many thanks for your help and suggestions!
>>
>
>
> You're welcome.
>
> Ed.
>

Of course, I should've used "re" instead of "range" in the above since the
script outputs every character that matches an RE, not just those in a specific
range. "set" or "list" would also have been more appropriate for the original
intent.

Ed.

From: pk on
On Wednesday 7 May 2008 19:16, Ed Morton wrote:

> I think only ord() is ASCII/EBCDIC-oriented and you don't use that. All
> you really need is a max value for all character sets (or pick some
> ridiculously high value if you don't know) then this:
>
> $ cat showrange.awk
> BEGIN{
> for (i=0;i<=1000;i++)
> chars[sprintf("%c",i)]
> for (c in chars)
> if (c ~ range)
> s=s c
> print range":"s
> }
> $ awk -v range="[a-z]" -f showrange.awk
> [a-z]:abcdefghijklmnopqrstuvwxyz
> $ awk -v range="[A-Z0-9]" -f showrange.awk
> [A-Z0-9]:0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ
>
> should work.

Well...sort of. Currently, I'm using the en_GB.utf8 locale, and I get this:

$ echo "aa" | wc -c
3
$ echo "àà" | wc -c
5
$ echo "€€" | wc -c
7

so, it seems I'm really using utf-8. The [a-z] re does match accented
characters:

$ echo 'è' | grep '[a-z]'
è
$ echo 'ò' | awk '/[a-z]/'
ò

etc.

However, running your script gives:

$ awk -v range='[a-z]' -f sr.awk
[a-z]:ABCDEFGHIJKLMNOPQRSTUVWXYabcdefghijklmnopqrstuvwxyz

I modified the script slightly, like this:

for (i=0;i<=1000;i++) {
chars[sprintf("%c",i)]
printf "%c\n", i
}

to see what goes into chars[], and I saw this:

^@
^A
^B
^C
^D
........
^Z

^\
^]
^^
^_

!
"
#
$
%
........
u
v
w
x
y
z
{
|
}
~
^?
<80>
<81>
<82>
........
<FC>
<FD>
<FE>
<FF>
^@
^A
^B
^C
^D
......etc.

So, it seems awk (or perhaps the "%c" specifier) only uses ascii characters.
I'm not sure where the problem is here (maybe PEBKAC as well, of course). I
know I'm throwing many elements in the picture at once here, but...where
can I look to check whether awk is behaving properly, or where any other
problem exists?
I must say that, while I understand the general ideas behind locales and
utf-8, I've never dug terribly deep into those concepts, so I may very well
be missing something here.

>> And moreover, I'm looking for some method that works in the reverse way,
>> ie, given the expression, print the expansion.
>
> That is what it does - given the expression contained in the variable
> "range", it'll print every character that is in that expression, i.e. the
> expansion.

Well yes, but using kind of "brute force" method. I was thinking about a
program that, by reading "something" "somewhere" (for some values
of "something" and "somewhere", perhaps some locale-definition file or the
like) automagically produces the requested collating sequence.

Thanks for any help.

--
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.