From: Nik Gr on
Στις 3/8/2010 10:39 πμ, ο/η Chris Rebert έγραψε:
>> Please tell me the difference between 3 things.
>>
>> a) Asking Notepad++(my editor) to save all my python scripts as UTF-8
>> without BOM.
> That affects what encoding the text file comprising the source code
> itself is in.

What does this practically mean? Perhaps you mean that it affects the
way this file will be stored in the hard disk?

For example is it different to say to Notapad++ to save it as 'Asci'i
and different to save it as 'UTF-8 without BOM'?

What should i use? My script only containes python code(english) and
greek chars inside print statemetns.

>> b) Using this line '# -*- coding: utf-8 -*-' Isn't this line supposed
>> to tell browser that the contents of this python script as in UTF-8
>> and to handle it as such?
> This tells Python what encoding the text file comprising the source
> code itself is in.
>
What practically does this mean?

What difference does it have with (a) ?

>> c) print ''' Content-Type: text/html; charset=UTF-8 /n'''
> This tells the web browser what encoding the HTML you're sending it is
> in. Said HTML is output by your Python script and must match the
> encoding you specify in (c).
When a python script runs it produces html output or only after the
python's output to the Web Server the html output is produced?
From: Νίκος on
> On 3 Αύγ, 11:10, Dave Angel <da...(a)ieee.org> wrote:

> a) a text editor takes keystrokes and cut/paste info and other data, and
> produces a stream of (unicode) characters.  It then encodes each of  
> those character into one or more bytes and saves it to a file.  You have
> to tell Notepad++ how to do that encoding.  Note that unless it's saving
> a BOM, there's no clue in the file what encoding it used.

So actually when i'm selecting an encoding from Notepad++'s options
iam basically telling the editor the way the it's suppose to store
those streams of characters to the hard disk drive.

Different encodings equals different ways of storting the data to the
media, correct?


> b) The python compiler has to interpret the bytes it finds (spec. within
> string literals and comments), and decode them into unicode for its own
> work.  It uses the 'coding:' comment to decide how to do this.  But once
> the file has been compiled, that comment is totally irrelevant, and ignored.

What is a "String Literal" ?

Basically if i understood you right, this line of code tells Python
the opposite thign from (a).
(a) told the editor how to store data to the media, while (b) tells
the python compiler how to retrive these data from the media(how to
read it, that is!) Right?


> c1) Your python code has to decide how to encode its information when
> writing to stdout.  There are several ways to accomplish that.

what other ways except the prin '''Content-Type blah blah... ''' ?


> c2) The browser sees only what was sent to stdout, starting with the
> "Content-Type..." line.  It uses that line to decide how to decode the
> rest of the stream.  Let me reemphasize, the browser does not see any of
> the python code, or comments.

I was under the impression the the stdout of a cgi python script was
the web server itself since this is the one app that awaits for the
data to be displayed.

When a python script runs it produces html output that time or only
after the
python's output to the Web Server the html output is produced?


And something else please.
My cgi python scripts contains english and greek letters, hence this
is an indication of tellign the editor to save the file to disk as
utf-8 right?

Well i told Notepad++ to save ti as Ascii and also removed the '# -*-
coding: utf-8 -*-' line.

and only used print ''' Content-Type: text/html; charset=UTF-8 /n'''

So, how the editor managed to save the file as ascii although my file
coaniens characters that are beyond the usual 7-bit ascci set?

and how could the python compiler 'read them and executed them' ?

I shoulds have saved in utf-8 and have inside the script the line so
the compiler knew to open it as utf-8. How come it dit work as ascii
both in stroing and retreiving!!



From: Dave Angel on
¯º¿Â wrote:
>> On 3 Αύγ, 11:10, Dave Angel <da...(a)ieee.org> wrote:
>>
>
>
>> a) a text editor takes keystrokes and cut/paste info and other data, and
>> produces a stream of (unicode) characters. It then encodes each of
>> those character into one or more bytes and saves it to a file. You have
>> to tell Notepad++ how to do that encoding. Note that unless it's saving
>> a BOM, there's no clue in the file what encoding it used.
>>
>
> So actually when i'm selecting an encoding from Notepad++'s options
> iam basically telling the editor the way the it's suppose to store
> those streams of characters to the hard disk drive.
>
> Different encodings equals different ways of storting the data to the
> media, correct?
>
>
Exactly. The file is a stream of bytes, and Unicode has more than 256
possible characters. Further, even the subset of characters that *do*
take one byte are different for different encodings. So you need to tell
the editor what encoding you want to use.
>
>> b) The python compiler has to interpret the bytes it finds (spec. within
>> string literals and comments), and decode them into unicode for its own
>> work. It uses the 'coding:' comment to decide how to do this. But once
>> the file has been compiled, that comment is totally irrelevant, and ignored.
>>
>
> What is a "String Literal" ?
>
>
In python, a string literal is enclosed by single quotes, double quotes,
or triples.
myvar = u"tell me more"
myvar = u'hello world'
The u prefix is used in python 2.x to convert to Unicode; it's not
needed in 3.x and I forget which one you're using.

these are affected by the coding comment, but
myvar = myfile.readline()
is not.

> Basically if i understood you right, this line of code tells Python
> the opposite thign from (a).
> (a) told the editor how to store data to the media, while (b) tells
> the python compiler how to retrive these data from the media(how to
> read it, that is!) Right?
>
>
>
>> c1) Your python code has to decide how to encode its information when
>> writing to stdout. There are several ways to accomplish that.
>>
>
> what other ways except the prin '''Content-Type blah blah... ''' ?
>
>
>
You can use the write() method of sys.stdout, or various equivalents,
such as the one produced by io.open(). You can probably also use
fdopen(1, "w")

But probably the easiest is to do something like:
sys.stdout = codecs.getwriter('utf8')(sys.stdout)

and then print to stdout will use the utf8 encoding for its output.
>> c2) The browser sees only what was sent to stdout, starting with the
>> "Content-Type..." line. It uses that line to decide how to decode the
>> rest of the stream. Let me reemphasize, the browser does not see any of
>> the python code, or comments.
>>
>
> I was under the impression the the stdout of a cgi python script was
> the web server itself since this is the one app that awaits for the
> data to be displayed.
>
> When a python script runs it produces html output that time or only
> after the
> python's output to the Web Server the html output is produced?
>
>
>
I don't understand your wording. Certainly the server launches the
python script, and captures stdout. It then sends that stream of bytes
out over tcp/ip to the waiting browser. You ask when does it become html
? I don't think the question has meaning.

> And something else please.
> My cgi python scripts contains english and greek letters, hence this
> is an indication of tellign the editor to save the file to disk as
> utf-8 right?
>
> Well i told Notepad++ to save ti as Ascii and also removed the '# -*-
> coding: utf-8 -*-' line.
>
> and only used print ''' Content-Type: text/html; charset=UTF-8 /n'''
>
> So, how the editor managed to save the file as ascii although my file
> coaniens characters that are beyond the usual 7-bit ascci set?
>
>
I don't know Notepad++, so I don't know how it handles a character
outside the legal ASCII range. So I'd only be guessing. But I'm guessing
it ignored the ASCII restriction, and just wrote the bottom 8 bits of
each character. That'll work for some of the non-ASCII characters.
> and how could the python compiler 'read them and executed them' ?
>
>
I'd only be speculating, since I've seen only a few lines of your
source. Perhaps you're using Python 2.x, and not specifying u"" for
those literals, which is unreasonable, but does tend to work for *some*
of the second 128 characters.
> I shoulds have saved in utf-8 and have inside the script the line so
> the compiler knew to open it as utf-8. How come it dit work as ascii
> both in stroing and retreiving!!
>
>
>

Since you have the setup that shows this effect, why not take a look at
the file, and see whether there are any non-ASCII characters (codes
above hex 7f) in it ? And whether there's a BOM. Then you can examine
the unicode characters produced. by changing your source code.

The more I think about it, the more I suspect your confusion comes
because maybe you're not using the u-prefix on your literals. That can
lead to some very subtle bugs, and code that works for a while, then
fails in inexplicable ways.

DaveA

From: Νίκος on
>On 3 Αύγ, 18:41, Dave Angel <da...(a)ieee.org> wrote:
> > Different encodings equal different ways of storing the data to the
> > media, correct?
>
> Exactly. The file is a stream of bytes, and Unicode has more than 256
> possible characters. Further, even the subset of characters that *do*
> take one byte are different for different encodings. So you need to tell
> the editor what encoding you want to use.

For example an 'a' char in iso-8859-1 is stored different than an 'a'
char in iso-8859-7 and an 'a' char of utf-8 ?


> > What is a "String Literal" ?
>
> In python, a string literal is enclosed by single quotes, double quotes,
> or triples.
> myvar = u"tell me more"
> myvar = u'hello world'
> The u prefix is used in python 2.x to convert to Unicode; it's not
> needed in 3.x and I forget which one you're using.

I use Python 2.4 and never used the u prefix.

i Still don't understand the difference between a 'string' and a
'string literal'

If i save a file as iso-8859-1 but in some of my variabels i use greek
characters instead of telling the browser to change encoding and save
the file as utf-8 i can just use the u prefix like your examples to
save the variables as iso-8859-1 ?

> I don't understand your wording. Certainly the server launches the
> python script, and captures stdout. It then sends that stream of bytes
> out over tcp/ip to the waiting browser. You ask when does it become html
> ? I don't think the question has meaning.

http cliens send request to http server(apache) , apache call python
interpreter python call mysql to handle SQL queries right?

My question is what is the difference of the python's script output
and the web server's output to the http client?

Who is producing the html code? the python output or the apache web
server after it receive the python's output?


> The more I think about it, the more I suspect your confusion comes
> because maybe you're not using the u-prefix on your literals. That can
> lead to some very subtle bugs, and code that works for a while, then
> fails in inexplicable ways.

I'm not sure whatr exaclty the do just yet.

For example if i say mymessage = "καλημέρα" and the i say mymessage =
u"καλημέρα" then the 1st one is a greek encoding variable while the
2nd its a utf-8 one?

So one script can be in some encoding and some parts of the script
like th2 2nd varible can be in another?

==============================
Also can you please help me in my cookie problem as to why only the
else block executed each time and never the if?

here is the code:

[code]
if os.environ.get('HTTP_COOKIE') and cookie.has_key('visitor') ==
'nikos': #if visitor cookie exist
print "ΑΠΟ ΤΗΝ ΕΠΟΜΕΝΗ ΕΠΙΣΚΕΨΗ ΣΟΥ ΘΑ ΣΕ ΥΠΟΛΟΓΙΖΩ ΩΣ ΕΠΙΣΚΕΠΤΗ
ΑΥΞΑΝΟΝΤΑΣ ΤΟΝ ΜΕΤΡΗΤΗ!"
cookie['visitor'] = ( 'nikos', time() - 1 ) #this cookie will expire
now
else:
print "ΑΠΟ ΔΩ ΚΑΙ ΣΤΟ ΕΞΗΣ ΔΕΝ ΣΕ ΕΙΔΑ, ΔΕΝ ΣΕ ΞΕΡΩ, ΔΕΝ ΣΕ ΑΚΟΥΣΑ!
ΘΑ ΕΙΣΑΙ ΠΛΕΟΝ Ο ΑΟΡΑΤΟΣ ΕΠΙΣΚΕΠΤΗΣ!!"
cookie['visitor'] = ( 'nikos', time() + 60*60*24*365 ) #this cookie
will expire in an year
[/code]

How do i check if the cookie is set and why if set never gets unset?!
From: Dave Angel on
¯º¿Â wrote:
>> On 3 Αύγ, 18:41, Dave Angel <da...(a)ieee.org> wrote:
>>
>>> Different encodings equal different ways of storing the data to the
>>> media, correct?
>>>
>> Exactly. The file is a stream of bytes, and Unicode has more than 256
>> possible characters. Further, even the subset of characters that *do*
>> take one byte are different for different encodings. So you need to tell
>> the editor what encoding you want to use.
>>
>
> For example an 'a' char in iso-8859-1 is stored different than an 'a'
> char in iso-8859-7 and an 'a' char of utf-8 ?
>
>
>
Nope, the ASCII subset is identical. It's the ones between 80 and ff
that differ, and of course not all of those. Further, some of the codes
that are one byte in 8859 are two bytes in utf-8.

You *could* just decide that you're going to hardwire the assumption
that you'll be dealing with a single character set that does fit in 8
bits, and most of this complexity goes away. But if you do that, do
*NOT* use utf-8.

But if you do want to be able to handle more than 256 characters, or
more than one encoding, read on.

Many people confuse encoding and decoding. A unicode character is an
abstraction which represents a raw character. For convenience, the first
128 code points map directly onto the 7 bit encoding called ASCII. But
before Unicode there were several other extensions to 256, which were
incompatible with each other. For example, a byte which might be a
European character in one such encoding might be a kata-kana character
in another one. Each encoding was 8 bits, but it was difficult for a
single program to handle more than one such encoding.

So along comes unicode, which is typically implemented in 16 or 32 bit
cells. And it has an 8 bit encoding called utf-8 which uses one byte for
the first 192 characters (I think), and two bytes for some more, and
three bytes beyond that.

You encode unicode to utf-8, or to 8859, or to ...
You decode utf-8 or 8859, or cp1252 , or ... to unicode

>>> What is a "String Literal" ?
>>>
>> In python, a string literal is enclosed by single quotes, double quotes,
>> or triples.
>> myvar ="tell me more"
>> myvar ='hello world'
>> The u prefix is used in python 2.x to convert to Unicode; it's not
>> needed in 3.x and I forget which one you're using.
>>
>
> I use Python 2.4 and never used the u prefix.
>
>
Then you'd better hope you never manipulate those literals. For example,
the second character of some international characters expressed in utf8
may be a percent symbol, which would mess up string formatting.
> i Still don't understand the difference between a 'string' and a
> 'string literal'
>
>
A string is an object containing characters. A string literal is one of
the ways you create such an object. When you create it that way, you
need to make sure the compiler knows the correct encoding, by using the
encoding: line at beginning of file.
> If i save a file as iso-8859-1 but in some of my variabels i use greek
> characters instead of telling the browser to change encoding and save
> the file as utf-8 i can just use the u prefix like your examples to
> save the variables as iso-8859-1 ?
>
>
>> I don't understand your wording. Certainly the server launches the
>> python script, and captures stdout. It then sends that stream of bytes
>> out over tcp/ip to the waiting browser. You ask when does it become html
>> ? I don't think the question has meaning.
>>
>
> http cliens send request to http server(apache) , apache call python
> interpreter python call mysql to handle SQL queries right?
>
> My question is what is the difference of the python's script output
> and the web server's output to the http client?
>
>
The web server wraps a few characters before and after your html stream,
but it shouldn't touch the stream itself.

> Who is producing the html code? the python output or the apache web
> server after it receive the python's output?
>
>
>
see above.
>> The more I think about it, the more I suspect your confusion comes
>> because maybe you're not using the u-prefix on your literals. That can
>> lead to some very subtle bugs, and code that works for a while, then
>> fails in inexplicable ways.
>>
>
> I'm not sure whatr exaclty the do just yet.
>
> For example if i say mymessage = "καλημέρα" and the i say mymessage = u"καλημέρα" then the 1st one is a greek encoding variable while the
> 2nd its a utf-8 one?
>
>
No, the first is an 8 bit copy of whatever bytes your editor happened to
save. The second is unicode, which may be either 16 or 32 bits per
character, depending on OS platform. Neither is utf-8.
> So one script can be in some encoding and some parts of the script
> like th2 2nd varible can be in another?
>
>
mymessage = u"καλημέρα"

creates an object that is *not* encoded. Encoding is taking the unicode
stream and representing it as a stream of bytes, which may or may have
more bytes than the original has characters.

> ============================
> Also can you please help me in my cookie problem as to why only the
> else block executed each time and never the if?
>
> here is the code:
>
> [code]
> if os.environ.get('HTTP_COOKIE') and cookie.has_key('visitor') =
> 'nikos': #if visitor cookie exist
> print "ΑΠΟ ΤΗΝ ΕΠΟΜΕΝΗ ΕΠΙΣΚΕΨΗ ΣΟΥ ΘΑ ΣΕ ΥΠΟΛΟΓΙΖΩ ΩΣ ΕΠΙΣΚΕΠΤΗ
> ΑΥΞΑΝΟΝΤΑΣ ΤΟΝ ΜΕΤΡΗΤΗ!"
> cookie['visitor'] = 'nikos', time() - 1 ) #this cookie will expire
> now
> else:
> print "ΑΠΟ ΔΩ ΚΑΙ ΣΤΟ ΕΞΗΣ ΔΕΝ ΣΕ ΕΙΔΑ, ΔΕΝ ΣΕ ΞΕΡΩ, ΔΕΝ ΣΕ ΑΚΟΥΣΑ!
> ΘΑ ΕΙΣΑΙ ΠΛΕΟΝ Ο ΑΟΡΑΤΟΣ ΕΠΙΣΚΕΠΤΗΣ!!"
> cookie['visitor'] = 'nikos', time() + 60*60*24*365 ) #this cookie
> will expire in an year
> [/code]
>
> How do i check if the cookie is set and why if set never gets unset?!
>
>
I personally haven't done any cookie code. If I were debugging this, I'd
factor out the multiple parts of that if statement, and find out which
one isn't true. From here I can't guess.

DaveA