From: Dotan Cohen on
On Tue, Aug 3, 2010 at 18:41, Dave Angel <davea(a)ieee.org> wrote:
> I don't understand your wording. Certainly the server launches the python
> script, and captures stdout. It then sends that stream of bytes out over
> tcp/ip to the waiting browser. You ask when does it become html ? I don't
> think the question has meaning.
>

ׁHTML is just plain text. So the answer to the question is that
ideally, the plain text that is sent to stdout would already be HTML.

print ( "<title>My Greek Page</title>\n" )

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com
From: MRAB on
Dave Angel wrote:
> ¯º¿Â wrote:
>>> On 3 Αύγ, 18:41, Dave Angel <da...(a)ieee.org> wrote:
>>>
>>>> Different encodings equal different ways of storing the data to the
>>>> media, correct?
>>>>
>>> Exactly. The file is a stream of bytes, and Unicode has more than 256
>>> possible characters. Further, even the subset of characters that *do*
>>> take one byte are different for different encodings. So you need to tell
>>> the editor what encoding you want to use.
>>>
>>
>> For example an 'a' char in iso-8859-1 is stored different than an 'a'
>> char in iso-8859-7 and an 'a' char of utf-8 ?
>>
>>
>>
> Nope, the ASCII subset is identical. It's the ones between 80 and ff
> that differ, and of course not all of those. Further, some of the codes
> that are one byte in 8859 are two bytes in utf-8.
>
> You *could* just decide that you're going to hardwire the assumption
> that you'll be dealing with a single character set that does fit in 8
> bits, and most of this complexity goes away. But if you do that, do
> *NOT* use utf-8.
>
> But if you do want to be able to handle more than 256 characters, or
> more than one encoding, read on.
>
> Many people confuse encoding and decoding. A unicode character is an
> abstraction which represents a raw character. For convenience, the first
> 128 code points map directly onto the 7 bit encoding called ASCII. But
> before Unicode there were several other extensions to 256, which were
> incompatible with each other. For example, a byte which might be a
> European character in one such encoding might be a kata-kana character
> in another one. Each encoding was 8 bits, but it was difficult for a
> single program to handle more than one such encoding.
>
One encoding might be ASCII + accented Latin, another ASCII + Greek,
another ASCII + Cyrillic, etc. If you wanted ASCII + accented Latin +
Greek then you'd need more than 1 byte per character.

If you're working with multiple alphabets it gets very messy, which is
where Unicode comes in. It contains all those characters, and UTF-8 can
encode all of them in a straightforward manner.

> So along comes unicode, which is typically implemented in 16 or 32 bit
> cells. And it has an 8 bit encoding called utf-8 which uses one byte for
> the first 192 characters (I think), and two bytes for some more, and
> three bytes beyond that.
>
[snip]
In UTF-8 the first 128 codepoints are encoded to 1 byte.
From: Dave Angel on


MRAB wrote:
> <div class="moz-text-flowed" style="font-family: -moz-fixed">Dave
> Angel wrote:
>> ¯º¿Â wrote:
>>>> On 3 Αύγ, 18:41, Dave Angel <da...(a)ieee.org> wrote:
>>>>> Different encodings equal different ways of storing the data to the
>>>>> media, correct?
>>>> Exactly. The file is a stream of bytes, and Unicode has more than 256
>>>> possible characters. Further, even the subset of characters that *do*
>>>> take one byte are different for different encodings. So you need to
>>>> tell
>>>> the editor what encoding you want to use.
>>>
>>> For example an 'a' char in iso-8859-1 is stored different than an 'a'
>>> char in iso-8859-7 and an 'a' char of utf-8 ?
>>>
>>>
>> Nope, the ASCII subset is identical. It's the ones between 80 and ff
>> that differ, and of course not all of those. Further, some of the
>> codes that are one byte in 8859 are two bytes in utf-8.
>>
>> You *could* just decide that you're going to hardwire the assumption
>> that you'll be dealing with a single character set that does fit in 8
>> bits, and most of this complexity goes away. But if you do that, do
>> *NOT* use utf-8.
>>
>> But if you do want to be able to handle more than 256 characters, or
>> more than one encoding, read on.
>>
>> Many people confuse encoding and decoding. A unicode character is an
>> abstraction which represents a raw character. For convenience, the
>> first 128 code points map directly onto the 7 bit encoding called
>> ASCII. But before Unicode there were several other extensions to 256,
>> which were incompatible with each other. For example, a byte which
>> might be a European character in one such encoding might be a
>> kata-kana character in another one. Each encoding was 8 bits, but it
>> was difficult for a single program to handle more than one such
>> encoding.
>>
> One encoding might be ASCII + accented Latin, another ASCII + Greek,
> another ASCII + Cyrillic, etc. If you wanted ASCII + accented Latin +
> Greek then you'd need more than 1 byte per character.
>
> If you're working with multiple alphabets it gets very messy, which is
> where Unicode comes in. It contains all those characters, and UTF-8 can
> encode all of them in a straightforward manner.
>
>> So along comes unicode, which is typically implemented in 16 or 32
>> bit cells. And it has an 8 bit encoding called utf-8 which uses one
>> byte for the first 192 characters (I think), and two bytes for some
>> more, and three bytes beyond that.
>>
> [snip]
> In UTF-8 the first 128 codepoints are encoded to 1 byte.
>
>
Thanks for the correction. As I said, I wasn't sure. I did utf-8 encoder
and decoder about a dozen years ago, and I remember parts of it use the
top two bits specially. But I've checked now, and you're right, the
cutoff is 7f.

DaveA

From: Νίκος on
>On 3 Αύγ, 21:00, Dave Angel <da...(a)ieee.org> wrote:

> A string is an object containing characters. A string literal is one of
> the ways you create such an object. When you create it that way, you
> need to make sure the compiler knows the correct encoding, by using the
> encoding: line at beginning of file.


mymessage = "καλημέρα" <==== string
mymessage = u"καλημέρα" <==== string literal?

So, a string literal is one of the encodings i use to create a string
object?

Can the encodign of a python script file be in iso-8859-7 which means
the file contents is saved to the hdd as greek-iso but the part of
this variabel value mymessage = u"καλημέρα" is saved as utf-8 ot the
opposite?

have the file saved as utf-8 but one variuable value as greek
encoding?

Encodings still give me headaches. I try to understand them as
different ways to store data in a media.

Tell me something. What encoding should i pick for my scripts knowing
that only contain english + greek chars??
iso-8859-7 or utf-8 and why?

Can i save the sting lets say "Νίκος" in different encodings and still
print out correctly in browser?

ascii = the standard english character set only, right?

> The web server wraps a few characters before and after your html stream,
> but it shouldn't touch the stream itself.

So the pythoon compiler using the cgi module is the one that is
producing the html output that immediately after send to the web
server, right?


> > For example if i say mymessage = "καλημέρα" and the i say mymessage = u"καλημέρα" then the 1st one is a greek encoding variable while the
> > 2nd its a utf-8 one?
>
> No, the first is an 8 bit copy of whatever bytes your editor happened to
> save.

But since mymessage = "καλημέρα" is a string containing greek
characaters why the editor doesn't save it as such?

It reminds me of varibles an valeus where if you say

a = 5 , a var becomes instantly an integer variable
while
a = 'hello' , become instantly a string variable


> mymessage = u"καλημέρα"
>
> creates an object that is *not* encoded.

Because it isn't saved by the editor yet? In what satet is this object
in before it gets encoded?
And it egts encoded the minute i tell the editor to save the file?

> Encoding is taking the unicode
> stream and representing it as a stream of bytes, which may or may have
> more bytes than the original has characters.


So this line mymessage = u"καλημέρα" what it does is tell the browser
thats when its time to save the whole file to save this string as
utf-8?

If yes, then if were to save the above string as greek encoding how
was i suppose to right it?

Also if u ise the 'coding line' in the beggining of the file is there
a need for using the u literal?

> I personally haven't done any cookie code. If I were debugging this, I'd
> factor out the multiple parts of that if statement, and find out which
> one isn't true. From here I can't guess.

I did what you say and foudn out that both of the if condition parts
were always false thast why the if code blck never got executed.

And it is alwsy wrong because the cookie never gets set.

So can you please tell me why this line

cookie['visitor'] = ( 'nikos', time() + 60*60*24*365 ) #this cookie
will expire in an year

never created a cookie?
From: Benjamin Kaplan on
2010/8/3 Íßêïò <nikos.the.gr33k(a)gmail.com>:
>>On 3 Áýã, 21:00, Dave Angel <da...(a)ieee.org> wrote:
>
>> A string is an object containing characters. A string literal is one of
>> the ways you create such an object. When you create it that way, you
>> need to make sure the compiler knows the correct encoding, by using the
>> encoding: line at beginning of file.
>
>
> mymessage = "êáëçìÝñá"   <==== string
> mymessage = u"êáëçìÝñá"  <==== string literal?

Not quite. A literal is the actual string in the file, those letters
between the quotes:
"êáëçìÝñá" <=== String literal (a literal value of the string/str type)
u"êáëçìÝñá" <=== Unicode literal (a literal value of the Unicode
type. The bytes on the page will be converted to unicode using the
file's encoding)
mymessage <==== String (not literal, because it's a value)
>
> So, a string literal is one of the encodings i use to create a string
> object?
>
> Can the encodign of a python script file be in iso-8859-7 which means
> the file contents is saved to the hdd as greek-iso but the part of
> this variabel value mymessage = u"êáëçìÝñá" is saved as utf-8 ot the
> opposite?
>

The compiler does not see u"êáëçìÝñá" on the page. All it sees is the
bytes ['0x75', '0x22', '0xea', '0xe1', '0xeb', '0xe7', '0xec', '0xdd',
'0xf1', '0xe1', '0x22']

Now the compiler knows that the sequence 0x75 0x22 (Stuff) 0x22 means
to create a Unicode literal. So it takes those bytes ('0xea', '0xe1',
'0xeb', '0xe7', '0xec', '0xdd', '0xf1', '0xe1') and decodes them using
the pages encoding, in your case ISO-8859-7. At this point, they don't
have an encoding. They aren't bytes as far as you are concerned, they
are code points. Internally, they're stored as either UTF-16 or UTF-32
depending on how Python was compiled, but that doesn't matter. You can
treat them as if they are characters.

> have the file saved as utf-8 but one variuable value as greek
> encoding?
>

Sure you can. A unicode literal will always have the encoding of the
file. But a string is just a sequence of bytes (forget about the
characters that show up on the page for now). If you do
"\xce\xba\xce\xb1\xce\xbb\xce\xb7\xce\xbc\xce\xad\xcf\x81\xce\xb1".encode('UTF-8')
Then Python will take that sequence of bytes and interpret them as
UTF-8. That will give you the same Unicode string you started out
with: u"êáëçìÝñá"

> Encodings still give me headaches. I try to understand them as
> different ways to store data in a media.
>
> Tell me something. What encoding should i pick for my scripts knowing
> that only contain english + greek chars??
> iso-8859-7 or utf-8 and why?
>
> Can i save the sting lets say "Íßêïò" in different encodings and still
> print out correctly in browser?
>
> ascii = the standard english character set only, right?
>

Yes.

>> The web server wraps a few characters before and after your html stream,
>> but it shouldn't touch the stream itself.
>
> So the pythoon compiler using the cgi module is the one that is
> producing the html output that immediately after send to the web
> server, right?
>
>
>> > For example if i say mymessage = "êáëçìÝñá" and the i say mymessage = u"êáëçìÝñá" then the 1st one is a greek encoding variable while the
>> > 2nd its a utf-8 one?

No. They both are in whatever encoding your file is using. But the
first one will be interpreted as a sequence of bytes. the second one
will be interpreted as a sequence of characters. For a single-byte
encoding like ISO-8859-7, it doesn't make a difference. But if you
were to encode it in UTF-8, the first one would have a length of 16
(because the Greek characters are all 2 bytes) and the 2nd one would
have a length of 8.

>>
>> No, the first is an 8 bit copy of whatever bytes your editor happened to
>> save.
>
> But since mymessage = "êáëçìÝñá" is a string containing greek
> characaters why the editor doesn't save it as such?
>

Because you don't save characters, you save bytes.

\xce\xba\xce\xb1\xce\xbb\xce\xb7\xce\xbc\xce\xad\xcf\x81\xce\xb1 is
your String in UTF-8
\xea\xe1\xeb\xe7\xec\xdd\xf1\xe1 is that exact same string in ISO-8859-7

They are two different ways of representing the same characters


> It reminds me of varibles an valeus where if you say
>
> a = 5 , a var becomes instantly an integer variable
> while
> a = 'hello' , become instantly a string variable
>
>
>> mymessage = u"êáëçìÝñá"
>>
>> creates an object that is *not* encoded.
>
> Because it isn't saved by the editor yet? In what satet is this object
> in before it gets encoded?
> And it egts encoded the minute i tell the editor to save the file?
>
>> Encoding is taking the unicode
>> stream and representing it as a stream of bytes, which may or may have
>> more bytes than the original has characters.
>
>
> So this line mymessage = u"êáëçìÝñá" what it does is tell the browser
> thats when its time to save the whole file to save this string as
> utf-8?
>
> If yes, then if were to save the above string as greek encoding how
> was i suppose to right it?
>
> Also if u ise the 'coding line' in the beggining of the file is there
> a need for using the u literal?
>
>> I personally haven't done any cookie code. If I were debugging this, I'd
>> factor out the multiple parts of that if statement, and find out which
>> one isn't true. From here I can't guess.
>
> I did what you say and foudn out that both of the if condition parts
> were always false thast why the if code blck never got executed.
>
> And it is alwsy wrong because the cookie never gets set.
>
> So can you please tell me why this line
>
> cookie['visitor'] = ( 'nikos', time() + 60*60*24*365 )          #this cookie
> will expire in an year
>
> never created a cookie?
> --
> http://mail.python.org/mailman/listinfo/python-list
>