From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from pigeon.gentoo.org ([208.92.234.80] helo=lists.gentoo.org)
	by finch.gentoo.org with esmtp (Exim 4.60)
	(envelope-from <gentoo-science+bounces-1148-garchives=archives.gentoo.org@lists.gentoo.org>)
	id 1OYPme-0006ba-P5
	for garchives@archives.gentoo.org; Mon, 12 Jul 2010 20:39:25 +0000
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id 597F9E0AB7;
	Mon, 12 Jul 2010 20:38:56 +0000 (UTC)
Received: from smtp.webfaction.com (mail6.webfaction.com [74.55.86.74])
	by pigeon.gentoo.org (Postfix) with ESMTP id 2AFEEE0AB5;
	Mon, 12 Jul 2010 20:38:56 +0000 (UTC)
Received: from mail-ww0-f53.google.com (mail-ww0-f53.google.com [74.125.82.53])
	by smtp.webfaction.com (Postfix) with ESMTP id 1BE0F390AF2;
	Mon, 12 Jul 2010 15:38:54 -0500 (CDT)
Received: by wwb24 with SMTP id 24so415099wwb.10
        for <multiple recipients>; Mon, 12 Jul 2010 13:38:53 -0700 (PDT)
Precedence: bulk
List-Post: <mailto:gentoo-science@lists.gentoo.org>
List-Help: <mailto:gentoo-science+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-science+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-science+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-science.gentoo.org>
X-BeenThere: gentoo-science@lists.gentoo.org
Reply-to: gentoo-science@lists.gentoo.org
MIME-Version: 1.0
Received: by 10.216.63.147 with SMTP id a19mr125987wed.35.1278967133417; Mon, 
	12 Jul 2010 13:38:53 -0700 (PDT)
Received: by 10.216.182.72 with HTTP; Mon, 12 Jul 2010 13:38:53 -0700 (PDT)
Date: Mon, 12 Jul 2010 22:38:53 +0200
Message-ID: <AANLkTikupO3Ei5CdsJIJQcsTUPbJeCvb16mplqHqEdPc@mail.gmail.com>
Subject: [gentoo-science] G-CRAN weekly report #7 (warning: big read)
From: Auke Booij <auke@tulcod.com>
To: gentoo-soc@lists.gentoo.org, gentoo-science@lists.gentoo.org
Content-Type: text/plain; charset=ISO-8859-1
X-Archives-Salt: fb5aa6cb-7074-4ca8-b3e7-b731c38c1b63
X-Archives-Hash: 2462615c9a54675c12a1c7282521af5c

As the subject says, this report is pretty long. It's intended for
those who haven't closely followed my work up until now and would like
to catch up, so go grab a cup of coffee if you really want to read
this to the end.

Subjects in this report (in order):
-intro of the project
-what have I been up to last week
-instructions on installing packages from bioconductor and CRAN
-g-common, the interface (or actually lack of interface) this project will have
-plans for the coming week and next week

Perhaps an introduction of the circumstances is in place. R is a
language for statisticians. With statistics being such a wide topic,
there are thousands of additional packages you can install to further
analyze data, and the Bioconductor project adds another field to R by
introducing genomics. My job is to cleanly enable Gentoo users to
install the latest versions of these packages systemwide, as opposed
to directly calling R's package installers and ending up with dangling
files. Last week, I was up to the point where some packages installed
correctly, but there were some rough edges too. For packages not
relying on external (non-R) libraries, this should all be smoothed out
now.

I've spent a lot of time communicating with several parties last week.
There was a minor issue with the Bioconductor repositories, I've
spoken to some people about g-common, talked a bit with the CRAN
maintainers and had some technical discussions with rafaelmartins,
who's a gsoc student working on g-octave, as you may know.

Then there are some helpful dependency resolution changes.
Dependencies on R packages now work perfectly fine, and external
dependencies are going to be tackled soon (but it won't be pretty).

So why is this helpful? It means you can install most Bioconductor
packages flawlessly.

As promised in an earlier email to the gentoo-science ML, some
instructions. Please note that this will of course not be the way
you'll eventually use g-cran, but I'm still working on the interface
(more on that later).

First, create two overlays. I'm simply calling them bioconductor_1 and
bioconductor_2. One of them primarily contains code, the other
consists primarily of gene databases.
# mkdir -p /usr/local/portage/bioconductor_1/profiles
# mkdir -p /usr/local/portage/bioconductor_2/profiles
Now we need to set the repo_name and categories of these overlays, too.
# echo "bioconductor_1" >> /usr/local/portage/bioconductor_1/profiles/repo_name
# echo "bioconductor_2" >> /usr/local/portage/bioconductor_2/profiles/repo_name
# echo "dev-R" >> /usr/local/portage/bioconductor_1/profiles/categories
# echo "dev-R" >> /usr/local/portage/bioconductor_2/profiles/categories
It's time to actually get the tree. Make sure you've installed g-cran
(it's in the science overlay), sync the repositories and then generate
the tree:
# g-cran /usr/local/portage/bioconductor_1 sync
http://www.bioconductor.org/packages/devel/bioc
# g-cran /usr/local/portage/bioconductor_2 sync
http://www.bioconductor.org/packages/devel/data/annotation
# g-cran /usr/local/portage/bioconductor_1 generate-tree
# g-cran /usr/local/portage/bioconductor_2 generate-tree

You can now add the overlays to your favorite package manager and
start emerging (*ahem* - installing) packages. If all is well, you
should be able to install, for example, dev-R/zebrafishdb (this is a
bioconductor_2 database package that pulls in several bioconductor_1
packages). I have absolutely no clue as to what you can do with these
packages, but I suppose some biology fans out there can clarify that.

Now, it may be that portage complains about missing Manifest files. If
that's the case, then also run:
# for x in /usr/local/portage/bioconductor_{1,2}/dev-R/*; do touch
"${x}/Manifest"; done
I hope that should do the trick, please tell me if it does, and if
it's needed at all. Once you've done this and this trick actually
works, you should be able to install dev-R/zebrafishdb.

If you don't need no stinkin' databases of deoxyribonucleic acid, but
are interested in CRAN, just create a cran overlay as we did for
bioconductor_1 and bioconductor_2, but use http://cran.r-project.org
as the source repository, and 'cran' for the overlay name. Better yet,
find a mirror close to you at http://cran.r-project.org/mirrors.html

Okay, so that was quite a journey to get a simple sqlite database of
gene data. g-common is what will be making all this easier.
Unfortunately I haven't heard much from the other two students I was
cooperating with before, anymore, so I'm going to invent something of
my own. The plan has remained roughly the same, but time after time
I'm struggling to explain it, so please bear with me as you read this.

[start explanation of g-common]
Current projects to install non-ebuild packages generate ebuild files
at request, put them in an overlay and tell portage to install them.
The problem with this approach is that the ebuilds are only generated
when you know what you want to install, ie. the overlay doesn't get
fully populated upfront. This approach implies you cannot search for
packages in such repositories, you cannot depend on packages in such
repositories, and you can't trivially update packages in such
repositories. I'd like to generate a full package tree at sync time,
no matter if you want to use it or not. Further, this syncing should
work like any other overlay: ideally, support for non-ebuild
repositories is transparent to the users. I'm going to do this via an
abstraction layer called g-common, for which support needs to be
written for all package managers. But once that support is written,
and the non-ebuild repository reading code is adjusted to work with
g-common, there is nothing stopping you from using a non-ebuild
repository like a regular ebuild overlay.
How this works is not exactly trivial to explain, but the important
part is that even though tools like g-cran are really functioning, the
package managers thinks it's dealing with a regular PMS-worthy tree.
At sync time, the package manager simply calls the g-common method for
syncing a tree, which in turn calls the appropriate repository driver
to fetch the new package listing from the true remote repository. To
integrate this well, some patching is needed. At install time, all the
various pkg_unpack, src_install, etc. phases result in calls to
g-common, and again those result in calls to the appropriate
repository driver, which then executes the phase, but all this is sort
of PMS-compliant. Call it over-engineering, but it'll feel like magic
and I'm going to prove it.
[end explanation of g-common]

The plan for this week is to /finally/ get some work done on g-common
and perhaps prepare the code for external dependency resolution. On
Saturday, I'm unfortunately leaving for vacation, so you won't see me
doing much. After that vacation, first of all there's GUADEC 2010
which I'm going to attend, but of course I'm also going to continue
developing g-common and finish external dependency resolution.

Now, if you've come to this point in my email, I'd really like to
thank you, because I know how easy it is to simply mark an email as
read and move on. You are why I'm developing this, thanks a lot!

The next weekly report will be in two weeks,
Auke Booij / tulcod.