From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from pigeon.gentoo.org ([208.92.234.80] helo=lists.gentoo.org) by finch.gentoo.org with esmtp (Exim 4.60) (envelope-from ) id 1OefUo-0008Bt-K8 for garchives@archives.gentoo.org; Fri, 30 Jul 2010 02:38:50 +0000 Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id C555DE09AB; Fri, 30 Jul 2010 02:38:48 +0000 (UTC) Received: from smtp.gentoo.org (smtp.gentoo.org [140.211.166.183]) by pigeon.gentoo.org (Postfix) with ESMTP id DCDE4E0990 for ; Fri, 30 Jul 2010 02:38:36 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by smtp.gentoo.org (Postfix) with ESMTP id 7BD6C1B4023; Fri, 30 Jul 2010 02:38:36 +0000 (UTC) X-Virus-Scanned: amavisd-new at gentoo.org X-Spam-Score: -2.312 X-Spam-Level: X-Spam-Status: No, score=-2.312 required=5.5 tests=[AWL=0.287, BAYES_00=-2.599] Received: from smtp.gentoo.org ([127.0.0.1]) by localhost (smtp.gentoo.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id mTHbZUipYcah; Fri, 30 Jul 2010 02:38:29 +0000 (UTC) Received: from mail-pz0-f49.google.com (mail-pz0-f49.google.com [209.85.210.49]) by smtp.gentoo.org (Postfix) with ESMTP id 80CB61B404E; Fri, 30 Jul 2010 02:38:24 +0000 (UTC) Received: by pzk3 with SMTP id 3so418875pzk.36 for ; Thu, 29 Jul 2010 19:38:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:received:date:from:to:cc :subject:message-id:references:mime-version:content-type :content-disposition:in-reply-to:user-agent; bh=KLX1QLNl6AmlpcBHT1BJhNAy8+T1BWGeSB5p+1dqKRs=; b=q1Fvaqo4pyjD914OptG+DDHUOQ/DwhZ7Jkw9Z6OZTCiUfoow3McxF/RJlCViEU/bUs u7ZxDxROXrooXzQN8OOA17QiIWo+YEHuj3IW6FsBXK1JyDn0IaeGCpNF3orGNAmzWxfm E5puKMW0Vw79V+4Zs+xGkjqWihANhGFXyWzvA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; b=lM6PXOpw8ZT94oz2nk6wLEs2tsPJf1LjLbaTyXNqUvvMkfri4GgaXI3cnHxamxbwFj s4YjSpxbAJn+sIe69zzGrQbXE++pdqXz8il16UL+tslU1tbNvi5L9GKd7kgXWHD1m6dJ xwlDKCjqAn+ZBvtrM9Yp/qvrSLJGpE1UE7tnU= Received: by 10.114.131.8 with SMTP id e8mr1405544wad.95.1280457504304; Thu, 29 Jul 2010 19:38:24 -0700 (PDT) Received: from smtp.gmail.com (c-67-171-128-62.hsd1.wa.comcast.net [67.171.128.62]) by mx.google.com with ESMTPS id d38sm2723974wam.8.2010.07.29.19.38.21 (version=TLSv1/SSLv3 cipher=RC4-MD5); Thu, 29 Jul 2010 19:38:23 -0700 (PDT) Received: by smtp.gmail.com (sSMTP sendmail emulation); Thu, 29 Jul 2010 19:36:22 -0700 Date: Thu, 29 Jul 2010 19:36:22 -0700 From: Brian Harring To: Arfrever Frehtes Taifersar Arahesis Cc: gentoo-dev@lists.gentoo.org, qa@gentoo.org Subject: Re: [gentoo-dev] Locale check in python_pkg_setup() Message-ID: <20100730023622.GB15031@hrair> References: <201007300116.43653.Arfrever@gentoo.org> Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-dev@lists.gentoo.org Reply-to: gentoo-dev@lists.gentoo.org MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="RnlQjJ0d97Da+TV1" Content-Disposition: inline In-Reply-To: <201007300116.43653.Arfrever@gentoo.org> User-Agent: Mutt/1.5.20 (2009-06-14) X-Archives-Salt: c2e909c6-d169-4bc1-92a7-6facb83e3908 X-Archives-Hash: 60791017163f7744e3aa018af982b6b5 --RnlQjJ0d97Da+TV1 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Jul 30, 2010 at 01:16:42AM +0200, Arfrever Frehtes Taifersar Arahes= is wrote: > --- python.eclass > +++ python.eclass > @@ -355,6 +355,8 @@ > # Check if phase is pkg_setup(). > [[ "${EBUILD_PHASE}" !=3D "setup" ]] && die "${FUNCNAME}() can be used = only in pkg_setup() phase" > =20 > + local locale > + > if [[ "$#" -ne 0 ]]; then > die "${FUNCNAME}() does not accept arguments" > fi > @@ -407,6 +409,16 @@ > unset -f python_pkg_setup_check_USE_flags > fi > =20 > + locale=3D"$(python -c 'import os; print(os.environ.get("LC_ALL", os.env= iron.get("LC_CTYPE", os.environ.get("LANG", "POSIX"))))')" You're using python to get the exported env. Don't. Use bash (you're=20 invoking python from freaking bash after all)... > + if [[ "${locale}" !=3D *.UTF-8 ]]; then > + eerror > + eerror "Currently used locale '${locale}' is unsupported and can cause= build-time or run-time" > + eerror "problems (usually UnicodeDecodeErrors or UnicodeEncodeErrors).= Bugs caused by this locale" > + eerror "will be closed as invalid. It is recommended to use a UTF-8 lo= cale to avoid problems." > + eerror "See http://www.gentoo.org/doc/en/utf-8.xml for information on = how to fix locale." > + eerror For cases such as this, ewarn, not eerror. It's not an actual error,=20 it's a potential source of problems people may see. The more I look into this issue, the more I'm convinced it's not user=20 settings that are problem- the problem is in the code, not in user=20 env. You've stated in a couple of places that "C/Posix locales are=20 not supported", which frankly is very whacked- that's not really a=20 proclamation you can make on your own for python, and you're actually=20 ignoring that this problem would just as easily rear it's head with a=20 latin-1 encoded file. Take a look at 302425; the traceback in that is a classic example of=20 where they *should* be using bytes mode (they don't need to interpret=20 the data, just write the script across, thus bytes). bug 328047 is induced by a patch we add (it's not in upstream python). =20 The code in question also is invoking fricking ldd a few steps prior=20 which is questionable in multiple ways: either way, relevant chunk is + os.system("ldd %s > %s" % (do_readline, tmpfile)) + fp =3D open(tmpfile) + for ln in fp: So... roughly, it invokes os.system, which will pass the environment=20 straight through to it, meaning locale gets passed down. Then it open's the file. Note it specifes *NO ENCODING* nor is their=20 actually an enforced locale best I can tell , thus ascii being the=20 default. The screwup here is in our patches- said patches should be=20 forcing posix locale for the ldd call (resulting in ascii). If you=20 think through this bug, we've seen this multiple times in grep/sed=20 calls- this is literally no different. bug 287439 is a screw up in the programs source... should've been=20 using bytes (non arguable). Matter of fact, while generally I think=20 Tarek knows what the hell he's doing, the skip they added to the=20 tests ignored an actual valid bug in setuptools/distribute- shebangs=20 =66rom the standpoint of the kernel need to be consistant. Thus reading=20 the shebang line itself should be done in bytes, than converted to=20 ascii and interpretted- they tried opening the file (in whole) in=20 bytes, meaning they tried enforcing ascii across the whole buffer-=20 not just the first line. Program bug. These bugs I got via searching for 'ALL python locale', and=20 identifying the ones that were actually locale related. I've at this=20 point looked into the source of 3 bugs- meaning literally, 3 bugs=20 checked into, 3 instances where the code was wrong. I'll leave it as an exercise for others to keep digging, but the point=20 here is that the programs themselves screwup their locale handling-=20 trying to force all systems to use a utf-8 locale for the env is just=20 a hack instead of fixing the actual issue. A pretty bad hack=20 considering I've spent all of 30 minutes digging into this and rooting=20 out the actual flaws in the src I might add. For shits and giggles, lets add one more bug in- one that has the=20 potential of rearing its head in random consuming pkgs, bug 322425=20 (docutils's build_html being flawed), their encoding handling is=20 intrinsically flawed. The encoding of a file their=20 installing/parsing should be determined by the file itself- not=20 attempting to arbitrarily force it to whatever locale the user happens=20 to be running (which is exactly the first thing buildhtml.py attempts,=20 literally `locale.setlocale(locale.LC_ALL, '')` at line 20). The=20 issue is not people using ascii locales, the issue is that these tools=20 do not handle encoding correctly. Recall, one of the purposes of py3k going bytes vs text (aka unicode)=20 was to make clear that textual data's encoding need be known. All of=20 this code isn't actually forcing/handling the encoding for the data=20 they deal in- meaning these are literal bugs, exposed purely due to=20 py3k actually enforcing encoding in normal file opens. So... this is a big -1 on adding such a warning (especially=20 considering it doesn't actually resolve the raw issues, it just=20 sidesteps a couple of cases). Fix the actual problem instead... Finally, cc'ing QA since this is a class of bugs they should be aware=20 of with py3k. This is a bit of a sign that a lot of source isn't=20 really py3k ready yet either imo, but so it goes... ~harring --RnlQjJ0d97Da+TV1 Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.15 (GNU/Linux) iEYEARECAAYFAkxSOqYACgkQsiLx3HvNzge/IQCePA6TJ1DUIc1bs6Z/EMwqUWyS 0kcAn2i7UjrkYl7v0kovkBsk1kD+Ht/l =TWrU -----END PGP SIGNATURE----- --RnlQjJ0d97Da+TV1--