From: gmagklaras on
This question is with reference to RFC 4648 (http://tools.ietf.org/
html/rfc4648#section-3.4) addressing the canonical encoding format.
What's the common practice for sorting base64 numbers? One could in
theory construct a comparator function as part of a standard sort
procedure, according to the values of the base64 alphabet which could
briefly have the valid symbols in order:

A to Z, a to z, 0 to 9, + /

However, if one wanted to implement alphabetical (asciibetical) order,
ASCII assigns a different order value to the above symbols:

+-,0 to 9, A to Z, a to z

Is there any preference or reason to stick to one or the other sorting
method based on the priority order when dealing with base64 encoded
values? References would be greatly appreciated.

Thanks.

GM
From: [Jongware] on
gmagklaras(a)gmail.com wrote:
> This question is with reference to RFC 4648 (http://tools.ietf.org/
> html/rfc4648#section-3.4) addressing the canonical encoding format.
> What's the common practice for sorting base64 numbers? One could in
> theory construct a comparator function as part of a standard sort
> procedure, according to the values of the base64 alphabet which could
> briefly have the valid symbols in order:
>
> A to Z, a to z, 0 to 9, + /
>
> However, if one wanted to implement alphabetical (asciibetical) order,
> ASCII assigns a different order value to the above symbols:
>
> +-,0 to 9, A to Z, a to z
>
> Is there any preference or reason to stick to one or the other sorting
> method based on the priority order when dealing with base64 encoded
> values?

Why would one want to sort numbers in /any/ base alphabetically?
Consider the list (5, 6, 40, 41, 201, 202). Sorted by ASCII codes it
would come out as (201, 202, 40, 41, 5, 6), which has no numeric meaning
at all.
To sort numbers (or, more generally, to compare two numbers in any
base), you have to compare numbers with the same number of digits.
Besides that, for any non-digit character, you'll have to consider its
assigned numerical value, i.e., in your case that string "A to Z, a to
z, 0 to 9, + /".

I'm a bit surprised about your ordering string -- I'd guess it should be
"0 to 9, A to Z, a to z, + /", so digits lower than a real value of 10
are displayed in their familiar form "-1, 0, 1, 2, 3", as opposed to
what you state -- "-A, ..?, A, B, C".
[Exactly one wiki later:] Aha,

"[..] indices into the string:
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/""

-- without regarding any numerical value. From reading the wiki, it's
not correct to consider this a base64 *number*; it's at best a base64
*string* (as the resulting value has no numerical meaning).
Still, for any sorting you should stick to the definition: the index
order in the predefined string.

>References would be greatly appreciated.

http://en.wikipedia.org/wiki/Base64 mentions a number of useful links.

[Jongware]
From: Greg Herlihy on
On Jul 1, 4:03 am, "gmagkla...(a)gmail.com" <gmagkla...(a)gmail.com>
wrote:
> This question is with reference to RFC 4648 (http://tools.ietf.org/
> html/rfc4648#section-3.4) addressing the canonical encoding format.
> What's the common practice for sorting base64 numbers?

Presumably one sorts base-64 numbers as one would sort numbers of any
other base - lowest to highest. This question seems to have nothing to
do with RFC 4648 (which describes a Base64 data -encoding- protocol).

> One could in theory construct a comparator function as part of a standard sort
> procedure, according to the values of the base64 alphabet which could
> briefly have the valid symbols in order:
>
> A to Z, a to z, 0 to 9, + /
>
> However, if one wanted to implement alphabetical (asciibetical) order,
> ASCII assigns a different order value to the above symbols:
>
> +-,0 to 9, A to Z, a to z
>
> Is there any preference or reason to stick to one or the other sorting
> method based on the priority order when dealing with base64 encoded
> values? References would be greatly appreciated.

Let's try answering these questions by sorting some actual, base64-
encoded data.

For the sample data, I have created (and base64-encoded) a list with
the names of ten, common fruits. So, the task here is to sort the
fruits on my list, alphabetically by (base64-encoded) name. Here is
the (base64-encoded) list to sort:

begin-base64 644 fruits.txt
b3JhbmdlCmFwcGxlCnBlYWNoCmdyYXBlZnJ1aXQKcGVhcgpncmFwZQphcHJpY290CmxlbW9uCm5l
Y3RhcmluZQp0YW5nZXJpbmU=
====

Now, it strikes me that none of the proposed base64 sorting schemes
are likely to be at all effective at sorting the fruit names on my
list. In fact, this list is not even recognizable as such until it has
been decoded - at which point the names in the list can be sorted
quite easily.

So the answer is that there is convention for sorting base64 encoded
data, because there is no way to sort the data without first decoding
it. After all, data encoding is simply a protocol of mapping a set of
data values to another set of corresponding, encoded values (and back
again) solely for the purpose of transporting the data safely. And as
long as the data is specified by the set of encoded data values - the
data itself is not accessible. Therefore, it makes little sense to
discuss ways of sorting or searching or otherwise doing anything with
the encoded data - other than decoding it.

Greg

From: gmagklaras on
On 2 Jul, 01:07, Greg Herlihy <gre...(a)mac.com> wrote:
> On Jul 1, 4:03 am, "gmagkla...(a)gmail.com" <gmagkla...(a)gmail.com>
> wrote:
>
> > This question is with reference to RFC 4648 (http://tools.ietf.org/
> > html/rfc4648#section-3.4) addressing the canonical encoding format.
> > What's the common practice for sorting base64 numbers?
>
> Presumably one sorts base-64 numbers as one would sort numbers of any
> other base - lowest to highest. This question seems to have nothing to
> do with RFC 4648 (which describes a Base64 data -encoding- protocol).
>
> > One could in theory construct a comparator function as part of a standard sort
> > procedure, according to the values of the base64 alphabet which could
> > briefly have the valid symbols in order:
>
> > A to Z, a to z, 0 to 9, + /
>
> > However, if one wanted to implement alphabetical (asciibetical) order,
> > ASCII assigns a different order value to the above symbols:
>
> > +-,0 to 9, A to Z, a to z
>
> > Is there any preference or reason to stick to one or the other sorting
> > method based on the priority order when dealing with base64 encoded
> > values? References would be greatly appreciated.
>
> Let's try answering these questions by sorting some actual, base64-
> encoded data.
>
> For the sample data, I have created (and base64-encoded) a list with
> the names of ten, common fruits. So, the task here is to sort the
> fruits on my list, alphabetically by (base64-encoded) name. Here is
> the (base64-encoded) list to sort:
>
> begin-base64 644 fruits.txt
> b3JhbmdlCmFwcGxlCnBlYWNoCmdyYXBlZnJ1aXQKcGVhcgpncmFwZQphcHJpY290CmxlbW9uCm5l
> Y3RhcmluZQp0YW5nZXJpbmU=
> ====
>
> Now, it strikes me that none of the proposed base64 sorting schemes
> are likely to be at all effective at sorting the fruit names on my
> list. In fact, this list is not even recognizable as such until it has
> been decoded - at which point the names in the list can be sorted
> quite easily.
>
Greg thanks for your answer. You assume that I encode/encapsulate data
in Base64. I should have said that this is not the case. In fact, what
we have is SHA-1 digest values produced to identify uniquely protein
sequences, as part of a bioinformatics project. The digest values are
27 character long digests (without the padding) as specified here:

http://bioinformatics.anl.gov/seguid/overview.aspx

As part of an index generation process, we need to sort a list/array
of these values and produce a new identifier. Thus, my question about
the practice of sorting base64 values. I should of course had been
more specific.

GM
From: Sigmund Lappegård Lahn on
gmagklaras(a)gmail.com wrote:

> On 2 Jul, 01:07, Greg Herlihy <gre...(a)mac.com> wrote:
>> On Jul 1, 4:03 am, "gmagkla...(a)gmail.com" <gmagkla...(a)gmail.com>
>> wrote:
>>
(snip------------------)
>>
> Greg thanks for your answer. You assume that I encode/encapsulate data
> in Base64. I should have said that this is not the case. In fact, what
> we have is SHA-1 digest values produced to identify uniquely protein
> sequences, as part of a bioinformatics project. The digest values are
> 27 character long digests (without the padding) as specified here:
>
> http://bioinformatics.anl.gov/seguid/overview.aspx
>
> As part of an index generation process, we need to sort a list/array
> of these values and produce a new identifier. Thus, my question about
> the practice of sorting base64 values. I should of course had been
> more specific.
>
> GM

Here is a sketch of a compare function for two base64 strings of length 27.
Havn't actually tried it, but I think you get my drift.

int base64_charvalue(const char c) {
if(c >= 'A' && c <= 'Z')
return c - 'A';
else if(c >= 'a' && c <= 'z')
return 'Z'-'A' + c - 'a';
else if(c >= '0' && c <= '9')
return 'Z'-'A'+'z'-'a' + c - '0';
else //must be '+' or '/' now
return 'Z'-'A'+'z'-'a'+'0'-'9'+ (c=='+'?1:2);
}

int base64_cmp(const char* stra, const char* strb) {
int i = 0, a, b;
while(i < 27 && str[i] == strb[i] ) {
++i;
}
if(i == 27) return 0; //strings are equal.

a = base64_charvalue(stra[i]);
b = base64_charvalue(strb[i]);
return a-b;
}



- Sigmund