public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-dev] python-2.3.2 testing required
@ 2003-11-12 18:46 Alastair Tse
  2003-11-13  8:07 ` Nick Jones
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: Alastair Tse @ 2003-11-12 18:46 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 6186 bytes --]

Hi All,

I need some more testing for python-2.3.2 before it is released into
~x86. This is a call to Gentoo developers and people interested in
helping to test. 

If you test, make sure you know what you are doing. Portage requires
python. It is best to have a binary package of your current portage and
python, just in case something majorly bad happens.

I would greatly appreciate any feedback on bugs and niggles with the
upgrade process. Report bugs on bugs.gentoo.org please. I'll list some
of the changes, points of interest and errata.

How to Upgrade:
===============
1. Unmask _both_ =dev-lang/python-2.3* and >=sys-apps/portage-2.0.49-r16
in package.mask.
2. run: emerge -u portage python 
(note that -r16 should work with both 2.2 and 2.3)
3. run: /usr/portage/dev-lang/python/python-updater

Major Changes:
==============
Unicode support has a choice of UCS2 or UCS4. 
---------------------------------------------

UCS4 uses significantly more memory than UCS2. For those who are not
familiar with unicode, UCS2 means representing unicode characters with
16 bit words and UCS4 means representing unicode characters in 32 bit
words. Not that both Redhat (>9) and Debian (unstable) have python-2.3
compiled with UCS4 by default.

Currently, you get UCS4 if you have "cjk" in your USE flags because the
only language to use the extra pane in UCS4 are CJK langauges. I've been
using UCS4 for a while now and it works without any problems. 

The reason why I'm not making this default is because UCS4 python uses
more memory. An example is supybot (Python IRC bot) that uses 8M for
UCS2 and 13M for UCS4. But note that this example is not scientific
because the machines were different in kernel version, compiler and
compiler optimisations.

I'm willing to listen to any ideas about whether to support UCS4 by
default or not.

The only problem that may occur is if you change between UCS2 and UCS4,
you will need to recompile any external python codec packages like
japanesecodec, koreancodec, cjkcodecs, iconvcodec.

For those who want to know more about UCS and Unicode, you maybe
interested in some articles like:
http://uche.ogbuji.net/tech/akara/nodes/2003-01-01/characters
http://www.python.org/cgi-bin/faqw.py?req=show&file=faq04.107.htp
http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html#s-unicodes-ccs

Handling *.pyc *.pyo
---------------------

The only other major change in the handling of python is with
byte-compiled modules. With previous versions of python, portage had a
problem tracking *.pyc and *.pyo files. The reason is that these files
were registered with portage at install time, but may be modified at
runtime if either the timestamp on the python executable has changed or
the version has been upgraded.

This traditionally created two problems. Firstly, *.pyc and *.pyo were
left behind because their timestamps and md5sum didn't match when they
were installed[1]. The second problem is that sometimes python gets run
inside sandbox and suddenly requests certain .pyc and .pyo to be
re-compiled, creating sandbox violations. You will notice special
workarounds in libsandbox to avoid this problem.

The solution I am now employing is the not let portage track the .pyc
and .pyo, and generating them at pkg_postinst(). For package removals, a
pkg_postrm(), orphaned *.pyc *.pyo (orphaned means the .pyc that don't
have a corresponding .py) will be removed. The other problem with python
recompiling modules at runtime is now solved by setting an environment
variable (PY_DONTCOMPILE) during merging that prevents python from
attempting to recompile modules.

This is now implemented in python.eclass and distutils.eclass. If you
maintain anything that creates python modules, you should either read
the eclass to find out what needs to change for python-2.3 or email me
the packages that need to be looked at. Most packages that use distutils
should work automatically with this new scheme.

[1] .. http://bugs.gentoo.org/show_bug.cgi?id=8804

Python Updater
--------------
A neccessary evil for the upgrade is to recompile all the modules that
depend on Python. I use a very simple rule to determine that. The rule
is if a package has a file in /usr/lib/python2.2, then it will need to
be re-emerged.

There is some special magic that does that in
/usr/portage/dev-lang/python/files/python-updater which is able to get
all the packages, sort the dependencies to avoid merging problems and
output a log file to /tmp/python-updater.log.

I'd appreciate as much feedback on this as possible. Especially on
whether I should run it in pkg_postinst() to make sure people who don't
read the trailing einfo's actually upgrade their python modules. It
might be technically unsound for me to run it in pkg_postinst() (nested
emerge's sound like a bad idea.)

Other Python Improvements
--------------------------
* New version, supposedly faster by 30%.
* bsddb support for 4.x
* Supports generators (very spiffy for implementing weightless threads)
* A new idle (python gui console)

Known Problems:
===============
* I haven't tried this on a install from scratch. Maybe someone can help
me to create a stage with only 2.3 and see how things work.

* Even though python-2.3.2 requires portage-2.0.49-r16, I can't put a
dep in otherwise it may create a circular dependency. Probably the best
solution would be to have a pkg_setup check before it goes ahead.

* On some machines, portageq doesn't work after python upgrade. Make
sure you are using the latest python version.
(http://bugs.gentoo.org/show_bug.cgi?id=32374)

Plan for Python's Future
========================
* I'm aiming to push Python-2.3.2 to ~x86 in the next week or so. There
are most likely packages that don't work under 2.3 that need to be
upgraded. If you find one, please report it.

* Maybe we should think about using a non version specific directory for
Python modules to ease upgrade.

That's all folks!

Thanks,
-- 
Alastair 'liquidx' Tse
 >> Gentoo Developer
 >> http://www.liquidx.net/ | http://dev.gentoo.org/~liquidx/


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [gentoo-dev] python-2.3.2 testing required
  2003-11-12 18:46 [gentoo-dev] python-2.3.2 testing required Alastair Tse
@ 2003-11-13  8:07 ` Nick Jones
  2003-11-13  9:57   ` Alastair Tse
  2003-11-13  8:07 ` Alastair Tse
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Nick Jones @ 2003-11-13  8:07 UTC (permalink / raw
  To: Alastair Tse; +Cc: gentoo-dev

> * Even though python-2.3.2 requires portage-2.0.49-r16, I can't put a
> dep in otherwise it may create a circular dependency. Probably the best
> solution would be to have a pkg_setup check before it goes ahead.

That might be acceptable. python and portage will always be present
on a Gentoo system. It's a litle weird, but may benefit everyone in
this situation. I'll get around to testing this soon. I'd prefer
that portage be ~arch before you make python ~arch, as it'll cause
headaches otherwise. The only reason I haven't yet is that it can
break the GUI's for portage. I'll be notifing the authors to be
sure that they are up to date, as soon as I have internet access
again. It's down as I'm writing this email.

> * Maybe we should think about using a non version specific directory for
> Python modules to ease upgrade.

I agree.

--NJ

--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [gentoo-dev] python-2.3.2 testing required
  2003-11-12 18:46 [gentoo-dev] python-2.3.2 testing required Alastair Tse
  2003-11-13  8:07 ` Nick Jones
@ 2003-11-13  8:07 ` Alastair Tse
  2003-11-13  9:05 ` Paul de Vrieze
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Alastair Tse @ 2003-11-13  8:07 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 362 bytes --]

On Wed, 2003-11-12 at 18:46, Alastair Tse wrote:

> 3. run: /usr/portage/dev-lang/python/python-updater
> 

Oops, that should be:

/usr/portage/dev-lang/python/files/python-updater

Thanks for g2boojum for pointing that out.

Cheers,
-- 
Alastair 'liquidx' Tse
 >> Gentoo Developer
 >> http://www.liquidx.net/ | http://dev.gentoo.org/~liquidx/


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [gentoo-dev] python-2.3.2 testing required
  2003-11-12 18:46 [gentoo-dev] python-2.3.2 testing required Alastair Tse
  2003-11-13  8:07 ` Nick Jones
  2003-11-13  8:07 ` Alastair Tse
@ 2003-11-13  9:05 ` Paul de Vrieze
  2003-11-13  9:38   ` Alastair Tse
  2003-11-13  9:10 ` Toby Dickenson
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Paul de Vrieze @ 2003-11-13  9:05 UTC (permalink / raw
  To: gentoo-dev

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wednesday 12 November 2003 19:46, Alastair Tse wrote:
> UCS4 uses significantly more memory than UCS2. For those who are not
> familiar with unicode, UCS2 means representing unicode characters with
> 16 bit words and UCS4 means representing unicode characters in 32 bit
> words. Not that both Redhat (>9) and Debian (unstable) have python-2.3
> compiled with UCS4 by default.
>
> Currently, you get UCS4 if you have "cjk" in your USE flags because
> the only language to use the extra pane in UCS4 are CJK langauges.
> I've been using UCS4 for a while now and it works without any
> problems.
>

Isn't it true that it is possible to to encode the few 4 byte characters 
into a number of 2byte sequences. I think that is more than enough for 
most cases (who needs to read/write cjk anyway ;-) )

Paul

- -- 
Paul de Vrieze
Gentoo Developer
Mail: pauldv@gentoo.org
Homepage: http://www.devrieze.net
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQE/s0lgbKx5DBjWFdsRAgOIAJsHcPE58KMv6o2uTcnu5/HIETlt3ACgkLpZ
RQ5yubmh2lsSSUMpBwxllzY=
=TarQ
-----END PGP SIGNATURE-----


--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [gentoo-dev] python-2.3.2 testing required
  2003-11-12 18:46 [gentoo-dev] python-2.3.2 testing required Alastair Tse
                   ` (2 preceding siblings ...)
  2003-11-13  9:05 ` Paul de Vrieze
@ 2003-11-13  9:10 ` Toby Dickenson
  2003-11-13  9:51   ` Alastair Tse
  2003-11-13 23:34 ` Toby Dickenson
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Toby Dickenson @ 2003-11-13  9:10 UTC (permalink / raw
  To: Alastair Tse; +Cc: gentoo-dev

On Wednesday 12 November 2003 18:46, Alastair Tse wrote:

That all looks good. Im keen to give it a spin...

> The reason why I'm not making this default is because UCS4 python uses
> more memory. An example is supybot (Python IRC bot) that uses 8M for
> UCS2 and 13M for UCS4. 

Ive not used ucs4 python yet, but it is one of the things I was looking 
forward to in version 2.3. It would much nicer to leave ucs2 behind.

If ucs4 strings were the only cause of that difference, supybot would need to 
be storing 2.5 million unicode characters. I guess that isnt likely. 
Excluding bugs, I dont see any reason why a program that doesnt use any 
unicode objects would use more memory when running on a ucs4 python 
interpreter.

> But note that this example is not scientific
> because the machines were different in kernel version, compiler and
> compiler optimisations.

Those reasons sound much more plausibe to me. Does anyone have a more 
scientific comparison of the effect of the ucs4 option on python?


-- 
Toby Dickenson


--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [gentoo-dev] python-2.3.2 testing required
  2003-11-13  9:05 ` Paul de Vrieze
@ 2003-11-13  9:38   ` Alastair Tse
  0 siblings, 0 replies; 14+ messages in thread
From: Alastair Tse @ 2003-11-13  9:38 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1381 bytes --]

On Thu, 2003-11-13 at 09:05, Paul de Vrieze wrote:

> Isn't it true that it is possible to to encode the few 4 byte characters 
> into a number of 2byte sequences. I think that is more than enough for 
> most cases (who needs to read/write cjk anyway ;-) )

According to my understanding of UCS, it doesn't seem to be the case. I
believe that UCS is the internal representation of a unicode character,
whereas UTF is the encoding of the character into octets for
representation on a computer. 

As an example, a UCS4 character on one of the higher planes (the one
where extra CJK characters are placed), UTF-8 would require 6
characters. UTF-8 (or -16 -32) are all able to represent the whole UCS4
space. UCS2 does not do any "chaining" so it can only have, at most, 16
bit characters (eg. 65535). UCS2 is a subset of UCS4[1].

Of course, I've left out alot of details like UCS2 doesn't actually have
64K chars, etc. 

With that said, most Linux machines that have wchar support, has wchar
defined as 4 bytes (int). So anything with wchar support probably
already uses 4 bytes. Maybe someone who has used wchar support can
comment on this.

Cheers,

[1]
http://www.gnuenterprise.org/doc/console-tools-libs/html/lct-4.html#sec-unicode

-- 
Alastair 'liquidx' Tse
 >> Gentoo Developer
 >> http://www.liquidx.net/ | http://dev.gentoo.org/~liquidx/


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [gentoo-dev] python-2.3.2 testing required
  2003-11-13  9:10 ` Toby Dickenson
@ 2003-11-13  9:51   ` Alastair Tse
  0 siblings, 0 replies; 14+ messages in thread
From: Alastair Tse @ 2003-11-13  9:51 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 2040 bytes --]

On Thu, 2003-11-13 at 09:10, Toby Dickenson wrote:
> Ive not used ucs4 python yet, but it is one of the things I was looking 
> forward to in version 2.3. It would much nicer to leave ucs2 behind.

I would like to move away from UCS2 as well, but I'd like some arguments
to say why this is a good thing apart from "it's more compatible.".

> If ucs4 strings were the only cause of that difference, supybot would need to 
> be storing 2.5 million unicode characters. I guess that isnt likely. 
> Excluding bugs, I dont see any reason why a program that doesnt use any 
> unicode objects would use more memory when running on a ucs4 python 
> interpreter.

All unicode string objects would have been stored in UCS4 instead of
UCS2. Things like XML parsers all use unicode string objects to store
their representations because UTF-8 is the default encoding for XML.
Those sorts of applications may have a more significant  memory
footprint growth.

> > But note that this example is not scientific
> > because the machines were different in kernel version, compiler and
> > compiler optimisations.
> 
> Those reasons sound much more plausibe to me. Does anyone have a more 
> scientific comparison of the effect of the ucs4 option on python?

I'd like to do that some time. Otherwise, someone with a faster machine
than mine may want to try it. It would be an interesting to see what the
real impact is. If the memory footprint doesn't grow as much as I claims
it does, then it is a powerful argument for moving to UCS4 as default.

The reason why UCS2 is still default in the masked python-2.3.2 is
because (a) not many people use anything at the moment that requires
anything above UCS2 and (b) UCS4 does take up more memory compared to
the UCS2. How much more, I'm not certain.

For instance, how much more memory would portage take if it doesn't use
unicode strings at all?

Cheers,
-- 
Alastair 'liquidx' Tse
 >> Gentoo Developer
 >> http://www.liquidx.net/ | http://dev.gentoo.org/~liquidx/


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [gentoo-dev] python-2.3.2 testing required
  2003-11-13  8:07 ` Nick Jones
@ 2003-11-13  9:57   ` Alastair Tse
  0 siblings, 0 replies; 14+ messages in thread
From: Alastair Tse @ 2003-11-13  9:57 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1088 bytes --]

On Thu, 2003-11-13 at 08:07, Nick Jones wrote:
> this situation. I'll get around to testing this soon. I'd prefer
> that portage be ~arch before you make python ~arch, as it'll cause
> headaches otherwise. 

Yes, that is something I forgot to mention. I won't be putting
python-2.3.2 into ~ until the new portage is unmasked. 

> > * Maybe we should think about using a non version specific directory for
> > Python modules to ease upgrade.
> 
> I agree.

Probably to expand on that a bit. Basically, only pure python modules
can go in /usr/lib/site-python (this is supported in python) and
c-modules (bindings) can go into
/usr/lib/python${version}/site-packages. 

There are a number of issues to address before this works, namely, how
.pth files will be installed and how we get distutils to distinguish the
two. I haven't looked into this in depth, but probably distutils has
something similar to prefix and exec_prefix like autoconf.

Cheers,
-- 
Alastair 'liquidx' Tse
 >> Gentoo Developer
 >> http://www.liquidx.net/ | http://dev.gentoo.org/~liquidx/


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [gentoo-dev] python-2.3.2 testing required
  2003-11-12 18:46 [gentoo-dev] python-2.3.2 testing required Alastair Tse
                   ` (3 preceding siblings ...)
  2003-11-13  9:10 ` Toby Dickenson
@ 2003-11-13 23:34 ` Toby Dickenson
  2003-11-14  9:37   ` Alastair Tse
  2003-11-15  8:09 ` Simon Watson
  2003-11-17  0:29 ` Alastair Tse
  6 siblings, 1 reply; 14+ messages in thread
From: Toby Dickenson @ 2003-11-13 23:34 UTC (permalink / raw
  To: Alastair Tse, gentoo-dev

On Wednesday 12 November 2003 18:46, Alastair Tse wrote:

> I need some more testing for python-2.3.2 before it is released into
> ~x86. 

This update is looking good here. Fine work. 8)

> 2. run: emerge -u portage python

note that -u will update alot of things with ~x86 that you might not want. 

> UCS4 uses significantly more memory than UCS2. 

I have compared ucs2 (with the ebuild in portage) and ucs4 (with hacking that 
ebuild to include the --enable-unicode=ucs4 configure switch). I compared:
1. The sizes of a newly started interpreter using the RES column on 'top'
2. The size of a "btdownloadheadless" process seeding a knoppix cd image
3. The size of the .tbz2 binary package.

The empty interpreters are of identical size. The bittorrent process is 
considerably *smaller* with ucs4, and the tbz2 files are largely unchanged in 
size. Numbers below. IMHO thats a clear but suprising win for ucs4. Lets use 
it always.

> Plan for Python's Future
> ========================

Portage has a number of packages that still need 2.2 but reference 
#!/usr/bin/python. For example, rdiff-backup. The distutil eclass has support 
for forcing /usr/bin/python2.1 (thanks to zope), but I cant see any way to 
force 2.2. am I overlooking something?




                      ucs2        ucs4
empty interpreter     2928        2928
btdownloadheadless    6076        5464, 5420
tbz2 size             5954517     5951807

-- 
Toby Dickenson


--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [gentoo-dev] python-2.3.2 testing required
  2003-11-13 23:34 ` Toby Dickenson
@ 2003-11-14  9:37   ` Alastair Tse
  0 siblings, 0 replies; 14+ messages in thread
From: Alastair Tse @ 2003-11-14  9:37 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 2246 bytes --]

On Thu, 2003-11-13 at 23:34, Toby Dickenson wrote:
> > 2. run: emerge -u portage python
> 
> note that -u will update alot of things with ~x86 that you might not want. 
> 

True, although in theory, it works also if you mark them as stable. I've
been running python-2.3 on a stable box without any problems. YMMV.

> > UCS4 uses significantly more memory than UCS2. 
> 
> I have compared ucs2 (with the ebuild in portage) and ucs4 (with hacking that 
> ebuild to include the --enable-unicode=ucs4 configure switch). I compared:
> 1. The sizes of a newly started interpreter using the RES column on 'top'
> 2. The size of a "btdownloadheadless" process seeding a knoppix cd image
> 3. The size of the .tbz2 binary package.
> 
> The empty interpreters are of identical size. The bittorrent process is 
> considerably *smaller* with ucs4, and the tbz2 files are largely unchanged in 
> size. Numbers below. IMHO thats a clear but suprising win for ucs4. Lets use 
> it always.

That is rather suprising. I could only explain that by predicting that
btdownloadheadless doesn't use any unicode objects at all. As I said
before, I think the real test case would be XML parsing.

But hopefully this weekend I could formulate some tests to run to get
something more "scientific". But it might be that my initial
observations were off.

> 
> > Plan for Python's Future
> > ========================
> 
> Portage has a number of packages that still need 2.2 but reference 
> #!/usr/bin/python. For example, rdiff-backup. The distutil eclass has support 
> for forcing /usr/bin/python2.1 (thanks to zope), but I cant see any way to 
> force 2.2. am I overlooking something?

Zope is a special case because it is not supported if you run it on
anything about 2.1. I'm not too familiar with Zope, so maybe some others
can chime in. As for packages that don't work with 2.3, that needs to be
fixed. There is a bug open now that people can report apps that don't
work with 2.3 either because 2.2 is hardcoded or they just need to be
patched.

http://bugs.gentoo.org/show_bug.cgi?id=33372

Cheers,
-- 
Alastair 'liquidx' Tse
 >> Gentoo Developer
 >> http://www.liquidx.net/ | http://dev.gentoo.org/~liquidx/


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [gentoo-dev] python-2.3.2 testing required
  2003-11-12 18:46 [gentoo-dev] python-2.3.2 testing required Alastair Tse
                   ` (4 preceding siblings ...)
  2003-11-13 23:34 ` Toby Dickenson
@ 2003-11-15  8:09 ` Simon Watson
  2003-11-17  0:29 ` Alastair Tse
  6 siblings, 0 replies; 14+ messages in thread
From: Simon Watson @ 2003-11-15  8:09 UTC (permalink / raw
  To: Alastair Tse; +Cc: gentoo-dev

On Wed, Nov 12, 2003 at 06:46:44PM +0000, Alastair Tse wrote:
> Hi All,
> 
> I need some more testing for python-2.3.2 before it is released into
> ~x86. This is a call to Gentoo developers and people interested in
> helping to test. 

Been running python 2.3.2 since Thursday, along with the updated portage. I
have had no obvious problems! :D


-- 
Simon Watson <simon@swat.me.uk>

--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [gentoo-dev] python-2.3.2 testing required
  2003-11-12 18:46 [gentoo-dev] python-2.3.2 testing required Alastair Tse
                   ` (5 preceding siblings ...)
  2003-11-15  8:09 ` Simon Watson
@ 2003-11-17  0:29 ` Alastair Tse
  2003-11-17 10:28   ` Toby Dickenson
  6 siblings, 1 reply; 14+ messages in thread
From: Alastair Tse @ 2003-11-17  0:29 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 4749 bytes --]

On Wed, 2003-11-12 at 18:46, Alastair Tse wrote:
> The reason why I'm not making this default is because UCS4 python uses
> more memory. An example is supybot (Python IRC bot) that uses 8M for
> UCS2 and 13M for UCS4. But note that this example is not scientific
> because the machines were different in kernel version, compiler and
> compiler optimisations.

I've found a little spare time this weekend to do a little bit of memory
benchmarking to prove/disprove my point about UCS4 using more memory
than UCS2.

I wrote and conducted 2 simple tests that I thought were relevant to
Python on Gentoo. The two tests I conducted were:

1. Generating a large number of Python Unicode Strings and recording the
memory usage.
2. Running "emerge" on various different options and recording the
memory usage.

The results demonstrate that UCS4 is more memory hungry _only_ if a
script/module/application uses unicode strings. This means any bindings
that use PyUnicode_* objects (for example, pygtk) or any script that
uses unicode strings. If a script/module/application does not use
unicode objects, it suffers from no noticable memory impact.

The numbers reported are averages from 3 or more runs. In nearly all
cases, the memory usage was constant.

Results:
========
1 : Generating Unicode Multi-Byte Strings (1 to 10000) strings
(String Size of 256 mbchars stored in a regular python list)
-------------------------------------------------------------------
Strings: (UCS2) Mem   RSS  Shared (UCS4) Mem  RSS  Shared    %+
1               1839  710  1535          1839 711 1535       0
10              1871  712  1535          1871 717 1535       0
100             1904  765  1535          1971 830 1535       3.5
1000            2465  1336 1535          3102 1960 1535      25.84
10000           8213  7052 1535          14445 13309 1535    75.80

2 : Generating Unicode ASCII Strings (1 to 10000) strings
(String Size of 256 chars stored in a regular python list)
-------------------------------------------------------------------
Strings: (UCS2) Mem   RSS  Shared (UCS4) Mem  RSS  Shared    %+
1               1839  710  1535          1839 711  1535      0
10              1871  712  1535          1871 717  1535      0 
100             1904  765  1535          1971 830  1535      3.5
1000            2465  1336 1535          3102 1960 1535      25.84
10000           8213  7053 1535          14445 13309 1535    75.80

3: Max Memory Usage under "emerge -p kde"
-------------------------------------------------------------------
      Mem  RSS  Shared
UCS2: 3222 1893 1955
UCS4: 3123 1769 1955

4: Max Memory Usage under "emerge search kde"
-------------------------------------------------------------------
      Mem  RSS  Shared
UCS2: 3221 1898 1955
UCS4: 3160 1803 1955

Discussion
==========

There are two immediate observations. One is that UCS4 does use more
memory compared to UCS2 when unicode strings are involved. From Test 1
and 2, the VM has an overhead of 1.8M and as more strings are created,
their memory usage difference steadily increase to 75% difference.

The other observation is that if there are is no unicode usage in
application, like "emerge", there is virtually no impact. Actually, in
this case, you'll find that UCS4 uses about 60K ot 100K less memory than
UCS2. I don't have an explanation for that behaviour.

Other observations that can be made which do not relate to the UCS2/UCS4
benchmark is that it doesn't matter if you are primarily dealing with
ASCII or Multi-Byte (eg, CJK characters) strings. As soon as they are
cast as unicode objects, they use more memory. Note that the two runs
have identical memory usage, that is not a mistake.

Another one is that 'emerge' uses the same amount of memory regardless
of what is being run. I had an informal test running just "emerge info"
and it still used approximately the same memory as running more
complicated things like merging packages or searching the package
database.

Other Details
=============
The above results were run with dev-lang/python-2.3.2-r1 with:
Kernel 2.6.0-test9-mm1
Glibc-2.3.2-r8 (w/ nptl)
GCC-3.3.2
Portage 2.0.49-r16

The raw logs for the tests and the scripts used can be found at:
http://dev.gentoo.org/~liquidx/python-test/

Remarks
=======

After running these tests, I still divided about whether UCS4 should be
enabled by default. I'm not seeing the added benefits of UCS4 in
contrast with the memory usage increase it brings. Yet, it also seems
like the "right" thing to do for m17n support.

Cheers,
-- 
Alastair 'liquidx' Tse
 >> Gentoo Developer
 >> http://www.liquidx.net/ | http://dev.gentoo.org/~liquidx/


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [gentoo-dev] python-2.3.2 testing required
  2003-11-17  0:29 ` Alastair Tse
@ 2003-11-17 10:28   ` Toby Dickenson
  2003-11-17 10:48     ` Alastair Tse
  0 siblings, 1 reply; 14+ messages in thread
From: Toby Dickenson @ 2003-11-17 10:28 UTC (permalink / raw
  To: Alastair Tse, gentoo-dev

On Monday 17 November 2003 00:29, Alastair Tse wrote:

> After running these tests, I still divided about whether UCS4 should be
> enabled by default. I'm not seeing the added benefits of UCS4 in
> contrast with the memory usage increase it brings. Yet, it also seems
> like the "right" thing to do for m17n support.

"Im still divided about whether 4 digit years should be enabled by default. Im 
not seeing any benefits of 4 digit years over 2 digit years, in contrast with 
the memory usage increase it brings"

-- 
Toby Dickenson


--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [gentoo-dev] python-2.3.2 testing required
  2003-11-17 10:28   ` Toby Dickenson
@ 2003-11-17 10:48     ` Alastair Tse
  0 siblings, 0 replies; 14+ messages in thread
From: Alastair Tse @ 2003-11-17 10:48 UTC (permalink / raw
  To: tdickenson; +Cc: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 565 bytes --]

On Mon, 2003-11-17 at 10:28, Toby Dickenson wrote:
> "Im still divided about whether 4 digit years should be enabled by default. Im 
> not seeing any benefits of 4 digit years over 2 digit years, in contrast with 
> the memory usage increase it brings"

True. Point taken. In contrast to your support of UCS4 being turned on,
I haven't heard any cries of outrage against the increased memory usage
if UCS4 was enabled by default.

Cheers,
-- 
Alastair 'liquidx' Tse
 >> Gentoo Developer
 >> http://www.liquidx.net/ | http://dev.gentoo.org/~liquidx/


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2003-11-17 10:48 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-11-12 18:46 [gentoo-dev] python-2.3.2 testing required Alastair Tse
2003-11-13  8:07 ` Nick Jones
2003-11-13  9:57   ` Alastair Tse
2003-11-13  8:07 ` Alastair Tse
2003-11-13  9:05 ` Paul de Vrieze
2003-11-13  9:38   ` Alastair Tse
2003-11-13  9:10 ` Toby Dickenson
2003-11-13  9:51   ` Alastair Tse
2003-11-13 23:34 ` Toby Dickenson
2003-11-14  9:37   ` Alastair Tse
2003-11-15  8:09 ` Simon Watson
2003-11-17  0:29 ` Alastair Tse
2003-11-17 10:28   ` Toby Dickenson
2003-11-17 10:48     ` Alastair Tse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox