|
From: Ed Morton on 7 May 2008 12:25 On 5/7/2008 11:04 AM, pk wrote: > On Wednesday 7 May 2008 17:37, Ed Morton wrote: > > >>>>these locales, `[a-dx-z]' is typically not equivalent to `[abcdxyz]'; >>>>instead it might be equivalent to `[aBbCcDdxXyYz]', for example. >>> >>> >>>Is there a way to explicitly print out that information (or, better, the >>>entire collating sequence in use)? I've been looking for a method to do >>>that for long time, but I have found no complete answer. >>> >> >>I expect you could use the ord() and chr() functions described here: >> >>http://www.gnu.org/software/gawk/manual/gawk.html#Ordinal-Functions >> >>to do something like: >> >>for (i=ord("a");i<=ord("z");i++) { >>print chr(i) >>} > > > Take this scenario: > > $ cat file > 100e3 > $ echo $LC_ALL > en_GB > $ awk '/[A-Z]/' file > 100e3 > $ LC_ALL=C awk '/[A-Z]/' file > $ > > (or, perhaps more elegant, > $ awk '[[:upper:]]' file > $ ) > > It seems that the function you point out use the mere numeric character > values and don't take locale into account. Using the proposed code for the > ord() and chr() functions, a loop to print the sequence from "A" to "Z" > always yields > > A > B > C > ... > Z > > under many different locales, even en_GB which, as seen above, clearly > expands [A-Z] differently. Good point. In that case, you could do something like this: range="[a-z]" for (i=low;i<=high;i++) if (chr(i) ~ range) print chr(i) where "low" and "high" are set by the _ord_init() function in the above link. It won't necessarily tell you the actual order each character appears in in the given range, but that shouldn't matter. Ed.
From: pk on 7 May 2008 12:46 On Wednesday 7 May 2008 18:25, Ed Morton wrote: > Good point. In that case, you could do something like this: > > range="[a-z]" > for (i=low;i<=high;i++) > if (chr(i) ~ range) > print chr(i) > > where "low" and "high" are set by the _ord_init() function in the above > link. It won't necessarily tell you the actual order each character > appears in in the given range, but that shouldn't matter. This is actually somewhat better. With LC_ALL=en_GB, low=0, high=127, both a...z and A...Z are printed, while with LC_ALL=C only a...z (as expected) are printed. But it seems that those functions are ascii-oriented anyway, so something more general, which works with the actual charset in use (with accented and special characters, etc), would be great. And moreover, I'm looking for some method that works in the reverse way, ie, given the expression, print the expansion. However, many thanks for your help and suggestions! -- All the commands are tested with bash and GNU tools, so they may use nonstandard features. I try to mention when something is nonstandard (if I'm aware of that), but I may miss something. Corrections are welcome.
From: Ed Morton on 7 May 2008 13:16 On 5/7/2008 11:46 AM, pk wrote: > On Wednesday 7 May 2008 18:25, Ed Morton wrote: > > >>Good point. In that case, you could do something like this: >> >>range="[a-z]" >>for (i=low;i<=high;i++) >>if (chr(i) ~ range) >>print chr(i) >> >>where "low" and "high" are set by the _ord_init() function in the above >>link. It won't necessarily tell you the actual order each character >>appears in in the given range, but that shouldn't matter. > > > This is actually somewhat better. With LC_ALL=en_GB, low=0, high=127, both > a...z and A...Z are printed, while with LC_ALL=C only a...z (as expected) > are printed. But it seems that those functions are ascii-oriented anyway, > so something more general, which works with the actual charset in use (with > accented and special characters, etc), would be great. I think only ord() is ASCII/EBCDIC-oriented and you don't use that. All you really need is a max value for all character sets (or pick some ridiculously high value if you don't know) then this: $ cat showrange.awk BEGIN{ for (i=0;i<=1000;i++) chars[sprintf("%c",i)] for (c in chars) if (c ~ range) s=s c print range":"s } $ awk -v range="[a-z]" -f showrange.awk [a-z]:abcdefghijklmnopqrstuvwxyz $ awk -v range="[A-Z0-9]" -f showrange.awk [A-Z0-9]:0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ should work. > And moreover, I'm looking for some method that works in the reverse way, ie, > given the expression, print the expansion. That is what it does - given the expression contained in the variable "range", it'll print every character that is in that expression, i.e. the expansion. > However, many thanks for your help and suggestions! > You're welcome. Ed.
From: Ed Morton on 7 May 2008 13:31 On 5/7/2008 12:16 PM, Ed Morton wrote: > > On 5/7/2008 11:46 AM, pk wrote: > >>On Wednesday 7 May 2008 18:25, Ed Morton wrote: >> >> >> >>>Good point. In that case, you could do something like this: >>> >>>range="[a-z]" >>>for (i=low;i<=high;i++) >>>if (chr(i) ~ range) >>>print chr(i) >>> >>>where "low" and "high" are set by the _ord_init() function in the above >>>link. It won't necessarily tell you the actual order each character >>>appears in in the given range, but that shouldn't matter. >> >> >>This is actually somewhat better. With LC_ALL=en_GB, low=0, high=127, both >>a...z and A...Z are printed, while with LC_ALL=C only a...z (as expected) >>are printed. But it seems that those functions are ascii-oriented anyway, >>so something more general, which works with the actual charset in use (with >>accented and special characters, etc), would be great. > > > I think only ord() is ASCII/EBCDIC-oriented and you don't use that. All you > really need is a max value for all character sets (or pick some ridiculously > high value if you don't know) then this: > > $ cat showrange.awk > BEGIN{ > for (i=0;i<=1000;i++) > chars[sprintf("%c",i)] > for (c in chars) > if (c ~ range) > s=s c > print range":"s > } > $ awk -v range="[a-z]" -f showrange.awk > [a-z]:abcdefghijklmnopqrstuvwxyz > $ awk -v range="[A-Z0-9]" -f showrange.awk > [A-Z0-9]:0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ > > should work. > > >>And moreover, I'm looking for some method that works in the reverse way, ie, >>given the expression, print the expansion. > > > That is what it does - given the expression contained in the variable "range", > it'll print every character that is in that expression, i.e. the expansion. > > >>However, many thanks for your help and suggestions! >> > > > You're welcome. > > Ed. > Of course, I should've used "re" instead of "range" in the above since the script outputs every character that matches an RE, not just those in a specific range. "set" or "list" would also have been more appropriate for the original intent. Ed.
From: pk on 7 May 2008 14:01 On Wednesday 7 May 2008 19:16, Ed Morton wrote: > I think only ord() is ASCII/EBCDIC-oriented and you don't use that. All > you really need is a max value for all character sets (or pick some > ridiculously high value if you don't know) then this: > > $ cat showrange.awk > BEGIN{ > for (i=0;i<=1000;i++) > chars[sprintf("%c",i)] > for (c in chars) > if (c ~ range) > s=s c > print range":"s > } > $ awk -v range="[a-z]" -f showrange.awk > [a-z]:abcdefghijklmnopqrstuvwxyz > $ awk -v range="[A-Z0-9]" -f showrange.awk > [A-Z0-9]:0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ > > should work. Well...sort of. Currently, I'm using the en_GB.utf8 locale, and I get this: $ echo "aa" | wc -c 3 $ echo "àà" | wc -c 5 $ echo "€€" | wc -c 7 so, it seems I'm really using utf-8. The [a-z] re does match accented characters: $ echo 'è' | grep '[a-z]' è $ echo 'ò' | awk '/[a-z]/' ò etc. However, running your script gives: $ awk -v range='[a-z]' -f sr.awk [a-z]:ABCDEFGHIJKLMNOPQRSTUVWXYabcdefghijklmnopqrstuvwxyz I modified the script slightly, like this: for (i=0;i<=1000;i++) { chars[sprintf("%c",i)] printf "%c\n", i } to see what goes into chars[], and I saw this: ^@ ^A ^B ^C ^D ........ ^Z ^\ ^] ^^ ^_ ! " # $ % ........ u v w x y z { | } ~ ^? <80> <81> <82> ........ <FC> <FD> <FE> <FF> ^@ ^A ^B ^C ^D ......etc. So, it seems awk (or perhaps the "%c" specifier) only uses ascii characters. I'm not sure where the problem is here (maybe PEBKAC as well, of course). I know I'm throwing many elements in the picture at once here, but...where can I look to check whether awk is behaving properly, or where any other problem exists? I must say that, while I understand the general ideas behind locales and utf-8, I've never dug terribly deep into those concepts, so I may very well be missing something here. >> And moreover, I'm looking for some method that works in the reverse way, >> ie, given the expression, print the expansion. > > That is what it does - given the expression contained in the variable > "range", it'll print every character that is in that expression, i.e. the > expansion. Well yes, but using kind of "brute force" method. I was thinking about a program that, by reading "something" "somewhere" (for some values of "something" and "somewhere", perhaps some locale-definition file or the like) automagically produces the requested collating sequence. Thanks for any help. -- All the commands are tested with bash and GNU tools, so they may use nonstandard features. I try to mention when something is nonstandard (if I'm aware of that), but I may miss something. Corrections are welcome.
|
Next
|
Last
Pages: 1 2 Prev: scp script Next: Get the md5sum of every 64MB block in a large file using bash. |