* [gentoo-cluster] examples of (large) Gentoo clusters @ 2006-12-01 23:16 Bryan Green 2006-12-02 3:48 ` Donnie Berkholz 2006-12-02 14:40 ` Nick Anderson 0 siblings, 2 replies; 29+ messages in thread From: Bryan Green @ 2006-12-01 23:16 UTC (permalink / raw To: gentoo-cluster Hello all, I am looking for something of a survey of examples of Gentoo-driven clusters out there. If such a survey has been done, perhaps someone point me to it. But I would like to hear from others on the list about their clusters. I am in the process of advocating for using Gentoo on a new cluster that we will be building. The cluster will be a "hyperwall", meaning that each node will have graphics, forming a grid of displays for multi-parameter, multi-dimensional scientific visualization. There will also be several disk servers which will run Suse in order to get Lustre support (Lustre support on the client side will be OS-neutral when the current beta is officially released). In addition to graphics, the nodes will also be used for compute jobs (scientific), and may serve as a testbed for a production scientific computing environment. In the process of making my case, I've been asked what other examples there are of large Gentoo clusters. This cluster will be 128 nodes (dual socket, dual or quad core). Of particular interest are production and/or scientific environments - not so much database clusters, though all examples are of interest. Use of MPI is particularly relevant. Graphics clusters are also of interest of course. I'd be grateful for any feedback I get from others on the list about the clusters they maintain or use, and perhaps some comments about the efficacy of Gentoo in an environment where stability is very important, and how system administration compares to administration of a Suse or Redhat cluster. Thanks, -bryan -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-01 23:16 [gentoo-cluster] examples of (large) Gentoo clusters Bryan Green @ 2006-12-02 3:48 ` Donnie Berkholz 2006-12-02 13:54 ` Hanni Ali 2006-12-02 17:57 ` Bryan Green 2006-12-02 14:40 ` Nick Anderson 1 sibling, 2 replies; 29+ messages in thread From: Donnie Berkholz @ 2006-12-02 3:48 UTC (permalink / raw To: gentoo-cluster [-- Attachment #1: Type: text/plain, Size: 2006 bytes --] Bryan Green wrote: > Hello all, > > I am looking for something of a survey of examples of Gentoo-driven clusters > out there. If such a survey has been done, perhaps someone point me to it. http://www.gentoo.org/proj/en/cluster/#doc_chap2 > But I would like to hear from others on the list about their clusters. > > I am in the process of advocating for using Gentoo on a new cluster that we > will be building. The cluster will be a "hyperwall", meaning that each node > will have graphics, forming a grid of displays for multi-parameter, > multi-dimensional scientific visualization. There will also be several disk > servers which will run Suse in order to get Lustre support (Lustre support > on the client side will be OS-neutral when the current beta is officially > released). In addition to graphics, the nodes will also be used for compute > jobs (scientific), and may serve as a testbed for a production scientific > computing environment. Joel Martin has previously posted Lustre ebuilds to the list (for both client and server, I thinkg). You may be interested. We'll want to get them into portage at some point, so there's no requirement that you use Suse server-side. > I'd be grateful for any feedback I get from others on the list about the > clusters they maintain or use, and perhaps some comments about the efficacy > of Gentoo in an environment where stability is very important, and how > system administration compares to administration of a Suse or Redhat cluster. The main difference is that, since we're "live," you need to consider how you want to deal with upgrades. You may wish to pick a static portage tree, import it into some sort of version control, and selectively import changes you want (probably just security bumps, which you can find using the wonderful glsa-check tool from gentoolkit). I've got a glsa-check wrapper that I use to make things a little easier, which shows and optionally applies applicable updates. I attached it. Thanks, Donnie [-- Attachment #2: glsa-apply.sh --] [-- Type: application/x-shellscript, Size: 1033 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-02 3:48 ` Donnie Berkholz @ 2006-12-02 13:54 ` Hanni Ali 2006-12-02 18:12 ` Bryan Green 2006-12-02 17:57 ` Bryan Green 1 sibling, 1 reply; 29+ messages in thread From: Hanni Ali @ 2006-12-02 13:54 UTC (permalink / raw To: gentoo-cluster [-- Attachment #1: Type: text/plain, Size: 1817 bytes --] Hi Bryan, I run a start up which provides Gentoo based clusters for a wide variety of applications. I find it far simpler to maintain and run Gentoo, portage simplifies maintenance so much. The cluster will be a "hyperwall", meaning that each node > > will have graphics, forming a grid of displays for multi-parameter, > > multi-dimensional scientific visualization. Sounds fascinating I do hope you will report how it goes and what method you use to achieve this. the nodes will also be used for compute > > jobs (scientific), and may serve as a testbed for a production > scientific > > computing environment. In my experience MPI works very well on Gentoo. > I'd be grateful for any feedback I get from others on the list about the > > clusters they maintain or use, and perhaps some comments about the > efficacy > > of Gentoo in an environment where stability is very important, and how > > system administration compares to administration of a Suse or Redhat > cluster. > I always find sys admin far easier with Gentoo, but wrt clusters I think architecture of the cluster is as important as the OS, I always recommend diskless although local disks for replication etc. are fine, but by keeping the important parts centrally and providing an image for the nodes to boot the chance of stray mistakes is reduced. This also allows you to improve stability by testing a new image on one node before deploying across the cluster. KlustOS (our OS) which is Gentoo based has been designed to scale to hundreds if not thousands of nodes and I believe Gentoo is more than capable of running large production clusters stably. Donnie's advice about security updates in combination with a testing image is useful as well. Hanni -- E-mail: hanni.ali@gmail.com Mobile: 07985580147 Website: www.ainkaboot.co.uk [-- Attachment #2: Type: text/html, Size: 2466 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-02 13:54 ` Hanni Ali @ 2006-12-02 18:12 ` Bryan Green 0 siblings, 0 replies; 29+ messages in thread From: Bryan Green @ 2006-12-02 18:12 UTC (permalink / raw To: gentoo-cluster "Hanni Ali" writes: > > The cluster will be a "hyperwall", meaning that each node > > > will have graphics, forming a grid of displays for multi-parameter, > > > multi-dimensional scientific visualization. > > > Sounds fascinating I do hope you will report how it goes and what method > you use to achieve this. We already have a 3x3 hyperwall running Gentoo, but thats a lot simpler than what we will be building. If you are interested, our mini got a little publicity a year ago in the Gentoo Weekly Newsletter: http://www.gentoo.org/news/en/gwn/20051205-newsletter.xml#doc_chap2 We also just had a paper published in IEEE Transactions on Visualization and Computer Graphics. The graphics cluster described is our 7x7, which is rather old and still running Fedora Core 2. http://ieeexplore.ieee.org/search/wrapper.jsp?arnumber=4015457 The new cluster will be designed to do more of what is described in that paper. -bryan -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-02 3:48 ` Donnie Berkholz 2006-12-02 13:54 ` Hanni Ali @ 2006-12-02 17:57 ` Bryan Green 2006-12-02 19:51 ` Philipp Riegger 2006-12-03 3:49 ` Donnie Berkholz 1 sibling, 2 replies; 29+ messages in thread From: Bryan Green @ 2006-12-02 17:57 UTC (permalink / raw To: gentoo-cluster Donnie Berkholz writes: > Bryan Green wrote: > > I am looking for something of a survey of examples of Gentoo-driven cluster > s > > out there. If such a survey has been done, perhaps someone point me to it. > > http://www.gentoo.org/proj/en/cluster/#doc_chap2 Whoa, I don't know how I missed that page. Thanks! > > Joel Martin has previously posted Lustre ebuilds to the list (for both > client and server, I thinkg). You may be interested. We'll want to get > them into portage at some point, so there's no requirement that you use > Suse server-side. Yes, I actually used those ebuilds to test Lustre on our "mini" 3x3 hyperwall which runs Gentoo. I was able to get it working, but over here they want the supported, released version, whereas those ebuilds are for the beta. I tried to install the released version, but eventually ran into problems. Also, since getting support from CFS is a requirement, that restricts the OS choice to specific versions of Suse or Redhat. The beta is supposed to become stable and supported by what was January, but now has apparently been pushed out to March. :( The only chance of putting Gentoo on the nodes of this cluster is if we can decide to go with the version that is still currently in beta. This is because the beta, version 1.6, has a "patchless client", and so CFS is agnostic about OS on the client side. For the server side, support=Suse as far as anyone I've talked to is concerned. > > > I'd be grateful for any feedback I get from others on the list about the > > clusters they maintain or use, and perhaps some comments about the efficacy > > of Gentoo in an environment where stability is very important, and how > > system administration compares to administration of a Suse or Redhat cluste > r. > > The main difference is that, since we're "live," you need to consider > how you want to deal with upgrades. You may wish to pick a static > portage tree, import it into some sort of version control, and > selectively import changes you want (probably just security bumps, which > you can find using the wonderful glsa-check tool from gentoolkit). > > I've got a glsa-check wrapper that I use to make things a little easier, > which shows and optionally applies applicable updates. I attached it. > I'd very interested in the different approaches here. I had thought about a static portage tree, but that left the problem of getting needed updates, especially GLSA's. Your suggested approach sounds very interesting. How big of an extra administrative burden does that create? Maintaining our own version controlled portage tree might be a hard sell. Thanks for the script - I'll take a look at it. Is there any documentation out there about a static portage tree? -bryan P.S.: I checked out SiCortex at SC06, and talked to one of the guys there. Its definitely Gentoo. It sounds like they are a bunch of Gentoo enthusiasts, actually. -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-02 17:57 ` Bryan Green @ 2006-12-02 19:51 ` Philipp Riegger 2006-12-03 3:49 ` Donnie Berkholz 1 sibling, 0 replies; 29+ messages in thread From: Philipp Riegger @ 2006-12-02 19:51 UTC (permalink / raw To: gentoo-cluster On Dec 2, 2006, at 7:57 PM, Bryan Green wrote: > I'd very interested in the different approaches here. I had > thought about a > static portage tree, but that left the problem of getting needed > updates, > especially GLSA's. Your suggested approach sounds very interesting. > How big of an extra administrative burden does that create? > Maintaining our > own version controlled portage tree might be a hard sell. Thanks > for the > script - I'll take a look at it. Is there any documentation out > there about > a static portage tree? On gentoo-dev there is a discussion going on about a sort of gentoo stable tree. Chris Gianelloni (if i remember it correctly) stated that he wanted to create a 2007.1 tree with the 2007.1 release and only put security fixes and required packages of security fixes in... I did not make it clear: He wants to take a snapshot of the tree when 2007.1 will be released and then like above. _But_ there are like 50 more unread messages of the thread in my mailbox, so might be this is not true anymore. Look at the gentoo-dev archives. Philipp -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-02 17:57 ` Bryan Green 2006-12-02 19:51 ` Philipp Riegger @ 2006-12-03 3:49 ` Donnie Berkholz 2006-12-04 18:15 ` Bryan Green 1 sibling, 1 reply; 29+ messages in thread From: Donnie Berkholz @ 2006-12-03 3:49 UTC (permalink / raw To: gentoo-cluster Bryan Green wrote: > Yes, I actually used those ebuilds to test Lustre on our "mini" 3x3 > hyperwall which runs Gentoo. I was able to get it working, but over here > they want the supported, released version, whereas those ebuilds are for the > beta. I tried to install the released version, but eventually ran into > problems. Also, since getting support from CFS is a requirement, that > restricts the OS choice to specific versions of Suse or Redhat. I guess that means we should get in touch with them to get on the supported systems list. =) > I'd very interested in the different approaches here. I had thought about a > static portage tree, but that left the problem of getting needed updates, > especially GLSA's. Your suggested approach sounds very interesting. > How big of an extra administrative burden does that create? Maintaining our > own version controlled portage tree might be a hard sell. Thanks for the > script - I'll take a look at it. Is there any documentation out there about > a static portage tree? The OSL (Open Source Lab), which hosts much of the Gentoo infrastructure and runs a lot of other projects on Gentoo boxes, takes a similar approach to what I mentioned above. I think you already know Corey Shields, so you could ask him about it. You may also want to take a look at http://article.gmane.org/gmane.linux.gentoo.devel/43984 -- it's from one of our developers who's deployed fairly decent-sized clusters. Thanks, Donnie -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-03 3:49 ` Donnie Berkholz @ 2006-12-04 18:15 ` Bryan Green 2006-12-04 20:53 ` Donnie Berkholz 0 siblings, 1 reply; 29+ messages in thread From: Bryan Green @ 2006-12-04 18:15 UTC (permalink / raw To: gentoo-cluster Donnie Berkholz writes: > Bryan Green wrote: > > Yes, I actually used those ebuilds to test Lustre on our "mini" 3x3 > > hyperwall which runs Gentoo. I was able to get it working, but over here > > they want the supported, released version, whereas those ebuilds are for th > e > > beta. I tried to install the released version, but eventually ran into > > problems. Also, since getting support from CFS is a requirement, that > > restricts the OS choice to specific versions of Suse or Redhat. > > I guess that means we should get in touch with them to get on the > supported systems list. =) Sounds like a fine idea to me. :) I talked to someone there at SC06, but they did not sound terribly open to the idea of directing precious resources in that direction. But it would be great to convince them otherwise. -bryan -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-04 18:15 ` Bryan Green @ 2006-12-04 20:53 ` Donnie Berkholz 2006-12-04 22:35 ` Hanni Ali 2006-12-04 23:00 ` Bryan Green 0 siblings, 2 replies; 29+ messages in thread From: Donnie Berkholz @ 2006-12-04 20:53 UTC (permalink / raw To: gentoo-cluster Bryan Green wrote: > Donnie Berkholz writes: >> I guess that means we should get in touch with them to get on the >> supported systems list. =) > > Sounds like a fine idea to me. :) > I talked to someone there at SC06, but they did not sound terribly open to > the idea of directing precious resources in that direction. But it would be > great to convince them otherwise. We would probably need to show that there are a decent number of places that would like to (or already do) use Lustre on Gentoo, and would be interested in paid support. We've now got at least you and SiCortex using it, but not sure about interest in support from SiCortex. Anyone else? Thanks, Donnie -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-04 20:53 ` Donnie Berkholz @ 2006-12-04 22:35 ` Hanni Ali 2006-12-04 23:00 ` Bryan Green 1 sibling, 0 replies; 29+ messages in thread From: Hanni Ali @ 2006-12-04 22:35 UTC (permalink / raw To: gentoo-cluster [-- Attachment #1: Type: text/plain, Size: 1003 bytes --] Ainkaboot and our customers would probably be interested in Lustre and paid support. Hanni On 04/12/06, Donnie Berkholz <dberkholz@gentoo.org> wrote: > > Bryan Green wrote: > > Donnie Berkholz writes: > >> I guess that means we should get in touch with them to get on the > >> supported systems list. =) > > > > Sounds like a fine idea to me. :) > > I talked to someone there at SC06, but they did not sound terribly open > to > > the idea of directing precious resources in that direction. But it > would be > > great to convince them otherwise. > > We would probably need to show that there are a decent number of places > that would like to (or already do) use Lustre on Gentoo, and would be > interested in paid support. We've now got at least you and SiCortex > using it, but not sure about interest in support from SiCortex. Anyone > else? > > Thanks, > Donnie > -- > gentoo-cluster@gentoo.org mailing list > > -- E-mail: hanni.ali@gmail.com Mobile: 07985580147 Website: www.ainkaboot.co.uk [-- Attachment #2: Type: text/html, Size: 1499 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-04 20:53 ` Donnie Berkholz 2006-12-04 22:35 ` Hanni Ali @ 2006-12-04 23:00 ` Bryan Green 2006-12-04 23:55 ` Daniel van Ham Colchete 1 sibling, 1 reply; 29+ messages in thread From: Bryan Green @ 2006-12-04 23:00 UTC (permalink / raw To: gentoo-cluster Donnie Berkholz writes: > Bryan Green wrote: > > Donnie Berkholz writes: > >> I guess that means we should get in touch with them to get on the > >> supported systems list. =) > > > > Sounds like a fine idea to me. :) > > I talked to someone there at SC06, but they did not sound terribly open to > > the idea of directing precious resources in that direction. But it would b > e > > great to convince them otherwise. > > We would probably need to show that there are a decent number of places > that would like to (or already do) use Lustre on Gentoo, and would be > interested in paid support. We've now got at least you and SiCortex > using it, but not sure about interest in support from SiCortex. Anyone else? > I wonder... They are going to be OS agnostic on the client side when 1.6 comes out, because of the "patchless client", i.e. the kernel on the client side does not need to be patched. On the server side, what is missing is a patched gentoo-sources or vanilla-sources kernel. But we know that there is a lustre-kernel ebuild out there. Depending on the issues involved, getting them to support Gentoo may just be a matter of getting them to support the lustre-kernel package. -bryan -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-04 23:00 ` Bryan Green @ 2006-12-04 23:55 ` Daniel van Ham Colchete 2006-12-05 3:55 ` Donnie Berkholz 2006-12-05 13:18 ` John R. Dunning 0 siblings, 2 replies; 29+ messages in thread From: Daniel van Ham Colchete @ 2006-12-04 23:55 UTC (permalink / raw To: gentoo-cluster I studied Lustre last week a little bit and, talking about MDSs and OSSs, I came with one reason for them not to make Lustre to support Gentoo: Lustre uses a lot of kernel features that if not enabled will cause the kernel to crash. I didn't find any documentation explaning those features but I could make a list of the orbivious ones: LVM, DM, ext3, ... I think that even they can't make a list of all those features, that is why they have to make Lustre available mainly on pre-compiled / pre-configured kernels. And, thank God, Gentoo doesn't have a predefined kernel. Although that would make easy for them to change and distribute it. What do you think about my ideia? But that leads to a more generic question: if Linux is always Linux (the kernel), and the distro is only a way to organize packages, files and init scripts, why would anyone need restrict an open source software to a distro? If my first assumption is right, the quicky (but not necessarily well thought) answer would be: lack of knowledge. Best, Daniel Colchete - On 12/4/06, Bryan Green <bgreen@nas.nasa.gov> wrote: > I wonder... They are going to be OS agnostic on the client side when 1.6 > comes out, because of the "patchless client", i.e. the kernel on the client > side does not need to be patched. > On the server side, what is missing is a patched gentoo-sources or > vanilla-sources kernel. But we know that there is a lustre-kernel ebuild > out there. Depending on the issues involved, getting them to support Gentoo > may just be a matter of getting them to support the lustre-kernel package. > > -bryan -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-04 23:55 ` Daniel van Ham Colchete @ 2006-12-05 3:55 ` Donnie Berkholz 2006-12-05 13:18 ` John R. Dunning 1 sibling, 0 replies; 29+ messages in thread From: Donnie Berkholz @ 2006-12-05 3:55 UTC (permalink / raw To: gentoo-cluster Daniel van Ham Colchete wrote: > I studied Lustre last week a little bit and, talking about MDSs and > OSSs, I came with one reason for them not to make Lustre to support > Gentoo: Lustre uses a lot of kernel features that if not enabled will > cause the kernel to crash. > > I didn't find any documentation explaning those features but I could > make a list of the orbivious ones: LVM, DM, ext3, ... > > I think that even they can't make a list of all those features, that > is why they have to make Lustre available mainly on pre-compiled / > pre-configured kernels. And, thank God, Gentoo doesn't have a > predefined kernel. Although that would make easy for them to change > and distribute it. > > What do you think about my ideia? It really shouldn't be that difficult to add in features until it stops crashing, then specify those features as dependencies in the kernel build system. > But that leads to a more generic question: if Linux is always Linux > (the kernel), and the distro is only a way to organize packages, files > and init scripts, why would anyone need restrict an open source > software to a distro? If my first assumption is right, the quicky (but > not necessarily well thought) answer would be: lack of knowledge. Sure. If they offer to support Lustre on a distribution, they need to be able to fix problems on that distribution. That means being aware of possible distribution-specific interactions that could cause issues and also knowing how to deal with them as well as reproduce them locally. Thanks, Donnie -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-04 23:55 ` Daniel van Ham Colchete 2006-12-05 3:55 ` Donnie Berkholz @ 2006-12-05 13:18 ` John R. Dunning 2006-12-05 16:25 ` Bryan Green 2006-12-05 21:15 ` Daniel van Ham Colchete 1 sibling, 2 replies; 29+ messages in thread From: John R. Dunning @ 2006-12-05 13:18 UTC (permalink / raw To: gentoo-cluster From: "Daniel van Ham Colchete" <daniel.colchete@gmail.com> Date: Mon, 4 Dec 2006 21:55:12 -0200 [...] Lustre uses a lot of kernel features that if not enabled will cause the kernel to crash. [...] I don't think that's true. I've been running lustre on assorted kernels, mostly under gentoo dists, for some months, and found that once you get past the installation issues, it's pretty trouble free. Now, note the caveats there: The installation issues are non-trivial, mostly because lustre is very intrusive into the vfs layer. This causes no end of headaches integrating with various other peoples' kernel patches, due to collisions with the other peoples' patches to vfs. That statement is as true of gentoo as it is of any kernel other than the few for which they supply canned patchsets. But I've never seen anything in there that contitutes using a kernel feature which causes the kernel to crash if not enabled. The closest thing I've seen to that is if you muff the patch merging and end up with an inconsistent patchset, that generally leads to a crash :-} Lustre 1.6 (at least the client end) doesn't even really *require* all those kernel patches, ie they do support the idea of a patchless client. The issue is that lustre changes the logic involved in various kinds of fs operations, including anything related to lookups, so as to short-circuit much of the work involved when it figures out that it can do so. Running the client without the patches will work, but it won't give you the performance that you'd get with the patches. So odds are anybody who's interested in running lustre in the first place probably wants the patches too. Lustre also is not restricted to precompiled kernels, their build script contains patchsets for things other than their recommended redhat and suse ones. We routinely compile it up for all kinds of experimental kernels with no trouble. Again, once you've gotten over the hurdle of getting the kernel patches integrated, the rest of it behaves reasonably well. The reason cfs advocates the small number of kernels they do is because they know what a pain it is to draw outside the lines, and they try to steer people away from that. We at sicortex are planning on rolling out a gentoo-based cluster that depends heavily on lustre, so we've spent a fair bit of time banging on it. I'm pretty sure we understand it at this point. We'll know for sure soon :-} -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-05 13:18 ` John R. Dunning @ 2006-12-05 16:25 ` Bryan Green 2006-12-05 21:15 ` Daniel van Ham Colchete 1 sibling, 0 replies; 29+ messages in thread From: Bryan Green @ 2006-12-05 16:25 UTC (permalink / raw To: gentoo-cluster "John R. Dunning" writes: > > Lustre 1.6 (at least the client end) doesn't even really *require* all those > kernel patches, ie they do support the idea of a patchless client. The issue > is that lustre changes the logic involved in various kinds of fs operations, > including anything related to lookups, so as to short-circuit much of the wor > k > involved when it figures out that it can do so. Running the client without > the patches will work, but it won't give you the performance that you'd get > with the patches. So odds are anybody who's interested in running lustre in > the first place probably wants the patches too. I hadn't realized that the patchless client was potentially lower-performance than a patched client. Are you sure about that? How much of a difference do you think it is? Are you using version 1.6 or 1.4? > > We at sicortex are planning on rolling out a gentoo-based cluster that depend > s > heavily on lustre, so we've spent a fair bit of time banging on it. I'm > pretty sure we understand it at this point. We'll know for sure soon :-} Do you get support from CFS? It seems pretty clear that you do not. What kernel versions do you use? -bryan -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-05 13:18 ` John R. Dunning 2006-12-05 16:25 ` Bryan Green @ 2006-12-05 21:15 ` Daniel van Ham Colchete 2006-12-05 21:22 ` Bryan Green 2006-12-05 21:28 ` John R. Dunning 1 sibling, 2 replies; 29+ messages in thread From: Daniel van Ham Colchete @ 2006-12-05 21:15 UTC (permalink / raw To: gentoo-cluster On 12/5/06, John R. Dunning <jrd@sicortex.com> wrote: > From: "Daniel van Ham Colchete" <daniel.colchete@gmail.com> > Date: Mon, 4 Dec 2006 21:55:12 -0200 > [...] > Lustre uses a lot of kernel features that if not enabled will > cause the kernel to crash. > [...] > I don't think that's true. > > I've been running lustre on assorted kernels, mostly under gentoo dists, for > some months, and found that once you get past the installation issues, it's > pretty trouble free. > > Now, note the caveats there: The installation issues are non-trivial, mostly > because lustre is very intrusive into the vfs layer. This causes no end of > headaches integrating with various other peoples' kernel patches, due to > collisions with the other peoples' patches to vfs. That statement is as true > of gentoo as it is of any kernel other than the few for which they supply > canned patchsets. But I've never seen anything in there that contitutes using > a kernel feature which causes the kernel to crash if not enabled. The closest > thing I've seen to that is if you muff the patch merging and end up with an > inconsistent patchset, that generally leads to a crash :-} Well, my first Lustre test was crashing on every 'write' operation. Them I enabled LVM and it worked. I'm using only the vanilla 2.6.12.6 kernel with the lastest 1.4 release. I have another machine with the same kernel that crashes everytime I try to use Lustre over the network, either as a client or as a server. Locally it works perfectly. But I'm still trying to learn it and I think I still have to spend plenty of time studying it :-). > Lustre 1.6 (at least the client end) doesn't even really *require* all those > kernel patches, ie they do support the idea of a patchless client. That's a very good point. > We at sicortex are planning on rolling out a gentoo-based cluster that depends > heavily on lustre, so we've spent a fair bit of time banging on it. I'm > pretty sure we understand it at this point. We'll know for sure soon :-} Question: would you use Lustre 1.6 now or you would wait until the official version is out? Question: do you expect in upgrade incopability between the current 1.6 beta and next betas or the official version? Best regards, Daniel Colchete -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-05 21:15 ` Daniel van Ham Colchete @ 2006-12-05 21:22 ` Bryan Green 2006-12-05 21:28 ` John R. Dunning 1 sibling, 0 replies; 29+ messages in thread From: Bryan Green @ 2006-12-05 21:22 UTC (permalink / raw To: gentoo-cluster "Daniel van Ham Colchete" writes: > > Question: do you expect in upgrade incopability between the current > 1.6 beta and next betas or the official version? According to the person that I talked to at CFS, the beta is pretty much finished, and they are waiting for tester feedback before releasing it as 1.6. However, I just learned that they pushed the release back from January to March, so perhaps that means another beta release inbetween. I dont know. -bryan -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-05 21:15 ` Daniel van Ham Colchete 2006-12-05 21:22 ` Bryan Green @ 2006-12-05 21:28 ` John R. Dunning 2006-12-07 0:33 ` Bryan Green 1 sibling, 1 reply; 29+ messages in thread From: John R. Dunning @ 2006-12-05 21:28 UTC (permalink / raw To: gentoo-cluster From: "Daniel van Ham Colchete" <daniel.colchete@gmail.com> Date: Tue, 5 Dec 2006 19:15:49 -0200 Well, my first Lustre test was crashing on every 'write' operation. Them I enabled LVM and it worked. I'm using only the vanilla 2.6.12.6 kernel with the lastest 1.4 release. I'd say something's manged in your kernel/patches. Perhaps due to 1.4; I went to 1.6 as soon as I was able to, and have no experience with the latest and greatest 1.4. Question: would you use Lustre 1.6 now or you would wait until the official version is out? If I had to ship today, I'd probably ship the 1.6b5 code. I find lustre 1.4 much more of a headache to configure and manage. Thankfully, I don't have to ship today; I expect by the time I do, cfs will have released the real 1.6 code. Question: do you expect in upgrade incopability between the current 1.6 beta and next betas or the official version? What variety of incompatibility? On-disk format? On-the-wire format? Something else? The short answer is no, in general the cfs guys seem to do a pretty good job at making that stuff backward compatible. Having said that, there was some kind of an incompatibility between 1.6b4 and 1.6b5. So I guess they don't get it right all the time :-} The slightly longer answer is "ask cfs". I believe the answer you'll get is that they claim compatibility for one prior release, and that they make no claims about compatibility of beta code with anything else. -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-05 21:28 ` John R. Dunning @ 2006-12-07 0:33 ` Bryan Green 2006-12-07 13:12 ` John R. Dunning 0 siblings, 1 reply; 29+ messages in thread From: Bryan Green @ 2006-12-07 0:33 UTC (permalink / raw To: gentoo-cluster "John R. Dunning" writes: > From: "Daniel van Ham Colchete" <daniel.colchete@gmail.com> > Date: Tue, 5 Dec 2006 19:15:49 -0200 > > Question: would you use Lustre 1.6 now or you would wait until the > official version is out? > > If I had to ship today, I'd probably ship the 1.6b5 code. I find lustre 1.4 > much more of a headache to configure and manage. Thankfully, I don't have to > ship today; I expect by the time I do, cfs will have released the real 1.6 > code. It is encouraging to hear that you are willing to base a product on Lustre 1.6. Are you by any chance willing to share some of your knowledge about installing Lustre on Gentoo with others? :) Perhaps I could make self-support an option, if it looked like it would be reliable. -bryan -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-07 0:33 ` Bryan Green @ 2006-12-07 13:12 ` John R. Dunning 2006-12-08 3:56 ` Bryan Green 0 siblings, 1 reply; 29+ messages in thread From: John R. Dunning @ 2006-12-07 13:12 UTC (permalink / raw To: gentoo-cluster From: Bryan Green <bgreen@nas.nasa.gov> Date: Wed, 06 Dec 2006 16:33:12 -0800 "John R. Dunning" writes: > From: "Daniel van Ham Colchete" <daniel.colchete@gmail.com> > Date: Tue, 5 Dec 2006 19:15:49 -0200 > > Question: would you use Lustre 1.6 now or you would wait until the > official version is out? > > If I had to ship today, I'd probably ship the 1.6b5 code. I find lustre 1.4 > much more of a headache to configure and manage. Thankfully, I don't have to > ship today; I expect by the time I do, cfs will have released the real 1.6 > code. It is encouraging to hear that you are willing to base a product on Lustre 1.6. There are problems either way, but based on my experience, I believe 1.6 is a better choice, at least for the kind of situation I'm expecting to see. That's based partly on the fact that in my testing I've seen a pretty small quotient of out-and-out bugs (though there are a couple which are pretty annoying) and partly on the fact that configuration and management-wise, 1.6 is way easier to deal with. Part of what I expect will be happening in deployments is to be building lustrefs's on the fly, under control of some kind of configurator thingie. For that kind of task, 1.4 would be much more difficult to deal with. We have a test gentoo cluster system which runs with lustre as its rootfs. It essentially "just works". I've run numerous benchmarks and tests on it, including bonnie, iozone, ltp, and assorted bits of application code; for the most part it's been trouble-free, and the performance is generally pretty good. There are a few areas where, due to the properties of lustre, things run unexpectedly slow, but for my purposes, they're all things that can be lived with. What I conclude from all that is that it's good enough for me to consider shipping it as part of a product while still being able to sleep at night :-} Are you by any chance willing to share some of your knowledge about installing Lustre on Gentoo with others? :) Sure. Are you worrying about the kernel patching and other software installation issues, or about how to set up the fs itself once you've got the software together? Very briefly, the kernel-patching issue is an ongoing headache. Lustre patches vfs in non-trivial ways. Unfortunately, everybody else does too. It becomes a fairly ugly patch-merging problem. If you want, I can detail the process I've settled on for coming up with a kernel patchset, but you won't like it. There are similar issues around ldiskfs and other bits, but they're simpler, at least by comparison. Once the software is installed, configuring the fs goes pretty much by the book. mkfs.lustre, mount -t lustre, lfs, and lctl are your friends. You'll have some work to do deciding what your architecture is, in terms of how many OSTs of what type, what's the interconnect topology which will get you the best throughput etc, but there aren't really any landmines in there. I've only worked with the failover stuff a small amount, so can't really say a lot about that, but the time I did play with it, it seemed to work as advertised. If you are looking for more detail on something specific, I'm happy to say what little I know about it. Perhaps I could make self-support an option, if it looked like it would be reliable. Well, obviously, you should test the bejeezus out of your configuration before you declare open season on it. So far I haven't found reason to believe lustre is substantially worse than any of the other open-source software packages which are used in production situations. I think that constitutes a qualified "yes" :-} -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-07 13:12 ` John R. Dunning @ 2006-12-08 3:56 ` Bryan Green 2006-12-08 14:18 ` John R. Dunning 0 siblings, 1 reply; 29+ messages in thread From: Bryan Green @ 2006-12-08 3:56 UTC (permalink / raw To: gentoo-cluster "John R. Dunning" writes: > From: Bryan Green <bgreen@nas.nasa.gov> > > It is encouraging to hear that you are willing to base a product on Lustre > 1.6. > > There are problems either way, but based on my experience, I believe 1.6 is a > better choice, at least for the kind of situation I'm expecting to see. > That's based partly on the fact that in my testing I've seen a pretty small > quotient of out-and-out bugs (though there are a couple which are pretty > annoying) and partly on the fact that configuration and management-wise, 1.6 > is way easier to deal with. Part of what I expect will be happening in > deployments is to be building lustrefs's on the fly, under control of some > kind of configurator thingie. For that kind of task, 1.4 would be much more > difficult to deal with. > >From my limited experience with 1.6, and even more limited experience with 1.4, I wholeheartedly agree with your assessment. Version 1.4 looks like a real headache to configure. By comparison, 'mount -t lustre' pretty much characterizes the simplicity of 1.6. > > Are you by any chance willing to share some of your knowledge about > installing Lustre on Gentoo with others? :) > > Sure. > > Are you worrying about the kernel patching and other software installation > issues, or about how to set up the fs itself once you've got the software > together? Kernel patching. For software installation, the lustre ebuild that was put on this list recently seemed to do the trick for me, and setup was pretty easy. I was able to patch the kernel, but the server was somewhat unstable. Actually, my memory is hazy. I used the 'lustre-sources' ebuild, which effectively packaged up the patches. It was a 2.6.15 kernel. I also tried to make a custom kernel for lustre 1.4, but ultimately hit too many roadblocks. I did learn a bit about how to use 'quilt' though. > > Very briefly, the kernel-patching issue is an ongoing headache. Lustre > patches vfs in non-trivial ways. Unfortunately, everybody else does too. It > becomes a fairly ugly patch-merging problem. If you want, I can detail the > process I've settled on for coming up with a kernel patchset, but you won't > like it. There are similar issues around ldiskfs and other bits, but they're > simpler, at least by comparison. I'd be interested in some of the details - off-list if that is more appropriate, though it might be of interest to others on the list as well. Once you download a 1.6 beta, how do you produce a kernel for Gentoo? Do you patch a gentoo-sources kernel, a vanilla-sources kernel, or something else? The ideal would perhaps be to have a 'lustre-sources' ebuild in the gentoo-science overlay. :) > > Perhaps I could make > self-support an option, if it looked like it would be reliable. > > Well, obviously, you should test the bejeezus out of your configuration before > you declare open season on it. So far I haven't found reason to believe > lustre is substantially worse than any of the other open-source software > packages which are used in production situations. I think that constitutes a > qualified "yes" :-} Are you considering getting support from CFS at some point? Sorry, you don't have to answer if that is a sensitive question. But part of this thread has been the topic of encouraging CFS to support Gentoo. Interestingly, my colleague, who is in charge of installing Lustre (1.4) on our test system, is talking to CFS about supporting a vanilla kernel configuration. The reason? We can't make the system stable with a SLES kernel. It was stable for a long time with Gentoo. Now they seem to have gotten it stable with SLES plus a vanilla 2.6.19 kernel (which of course does not have the Lustre patches). So they want Suse to provide a newer SLES kernel with the Lustre patches, and CFS to support that configuration. -bryan -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-08 3:56 ` Bryan Green @ 2006-12-08 14:18 ` John R. Dunning 2006-12-08 17:15 ` Bryan Green ` (2 more replies) 0 siblings, 3 replies; 29+ messages in thread From: John R. Dunning @ 2006-12-08 14:18 UTC (permalink / raw To: gentoo-cluster From: Bryan Green <bgreen@nas.nasa.gov> Date: Thu, 07 Dec 2006 19:56:46 -0800 [...] By comparison, 'mount -t lustre' pretty much characterizes the simplicity of 1.6. Agreed. > Are you worrying about the kernel patching and other software installation > issues, or about how to set up the fs itself once you've got the software > together? Kernel patching. For software installation, the lustre ebuild that was put on this list recently seemed to do the trick for me, and setup was pretty easy. Yeah, I think that ebuild came from us. I was able to patch the kernel, but the server was somewhat unstable. Do you remember how it was unstable? That's the kind of thing I'd very much like to understand, as we're proposing to depend heavily on it. If there are issues, whether specifically tied to our patches or not, I'd love to know about them. Actually, my memory is hazy. I used the 'lustre-sources' ebuild, which effectively packaged up the patches. It was a 2.6.15 kernel. I also tried to make a custom kernel for lustre 1.4, but ultimately hit too many roadblocks. I did learn a bit about how to use 'quilt' though. Hmmm. Maybe not. Our stuff ditches quilt. > > Very briefly, the kernel-patching issue is an ongoing headache. Lustre > patches vfs in non-trivial ways. Unfortunately, everybody else does too. It > becomes a fairly ugly patch-merging problem. If you want, I can detail the > process I've settled on for coming up with a kernel patchset, but you won't > like it. There are similar issues around ldiskfs and other bits, but they're > simpler, at least by comparison. I'd be interested in some of the details - off-list if that is more appropriate, though it might be of interest to others on the list as well. Once you download a 1.6 beta, how do you produce a kernel for Gentoo? Do you patch a gentoo-sources kernel, a vanilla-sources kernel, or something else? The ideal would perhaps be to have a 'lustre-sources' ebuild in the gentoo-science overlay. :) We can start here and if people get sick of hearing about it, take it someplace else. The approach taken by most of the patches in lustre/kernel_patches/patches is, for any particular base kernel, go through and add the datastructures and logic to implement the lustre-specific functionality, which involves changes to core vfs datastructures, sometimes changes in locking strategy, changes to arglists etc. They generally start with RHEL or SLES kernels. There are a couple of problems for the rest of us with that; (a) the RHEL and SLES kernels tend to be a bit antiquated, and (b) the vendors also tend to make quite free with the patches to core datastructures. Some of the latter is actually due to the former; because they're using antique kernels, but they want some bits of the latest and greatest fixes, they selectively import more modern stuff as their own vendor patches. The result of all this is the layer of patches to implement lustre functionality, when viewed from the point of view of an unpatched kernel, makes no sense at all. If you try to install such a patchset on a vanilla-ish kernel, even if you get the right base version, you'll get tons of rejects, and when you look at them, it's obvious that they depend on stuff that's not there. The way I settled on getting to a patchset which doesn't depend on all kinds of RHEL or SLES was to essentially build a RHEL (fc5, if I recall) kernel, then "subtract out" the RHEL-ness, then take the resultant kernel and diff it against a virgin one. That description covers a multitude of sins. Subtracting out the RHEL-ness (by essentially doing patch -R, then cleaning up the mess) has the inverse of many of the same problems that you get trying to patch lustre on top of a vanilla kernel; arglists don't match etc. The only piece of good news is that you at that point have three datapoints to work with; vanilla, RHEL, and RHEL+lustre, so it's rather easier (though not exactly easy) to divine what the intention of the lustre patches is, and work out how to do the analogous thing without RHEL. Even at that, I had to be wary of some bits of code which disappeared in the RHEL transition, but came back when I backed out RHEL, which needed to be given the same treatment as other analogous bits of code which were still there. The bottom line is that you have to understand enough about what the lustre patches are accomplishing that you can come up with analogous patches for the kernel of your choice, which happens not to be one of the ones cfs ships patchsets for. The first time I did all that stuff it took something like 3-4 weeks, with numerous false starts. The most recent time I did it, it was something like a couple of days, though that's misleading, because it was very close to the previous version I was upgrading from. If I had to do it today, starting from scratch, I'd estimate 3-5 days. You'll note that nowhere in that set of stuff did I utter the word "gentoo". The kernel we're using is not really a gentoo kernel. We're mips-based, so we're starting from something that's perilously close to the mainline linux-mips kernel, then building it up from there. Thankfully, the linux-mips guys don't go in for heavy-duty patching of non-platform-related stuff, so from the point of view of adding lustre to it, it's virtually identical to a vanilla kernel.org kernel. I believe we may have pulled in a small number of the gentoo kernel patches, but I'm not the kernel wizard, so don't know off the top of my head. From my point of view, it looks vanilla. It's not clear to me how you'd go about making lustre installation work in a more gentoo-ish kind of way, at least not without a very large amount of work. I guess I think the most likely path forward would be to work with cfs to try to get them to support more vanilla kernels, then try to work on the rest of the gentoo kernel patches to make them fit better. Unfortunately, I suspect that that still isn't going to be easy, as you've got the classic patch-collision problem happening all over the place. I suspect that following that approach would end up with two parallel streams of patches, one for lustre kernels and one for non-lustre kernels. Unless you can get the gentoo community to roll lustre in as a standard part of the gentoo patchset. That probably requires that somebody do a lustre patchset for every kernel version. Unlikely. You could, of course, invert the problem and layer lustre on top, but until such time as gentoo is much more prevalent, I doubt you'll get cfs to do that, which means that somebody in the gentoo community gets signed up for the task of re-doing the process I outlined above, for every gentoo kernel which comes down the pike. I'm not holding my breath for that one either. A longer term solution is to do some combination of remodularizing vfs and recasting the lustre stuff so as to depend less on getting its fingers into the guts. I once spent some time looking into that, and I do believe it's possible, but it would take some work, and would really need to be done in concert with the rest of the core kernel guys, and I ran out of time to pursue it. In the meantime, the more the gentoo community can resist the temptation to patch the kernel (at least the vfs parts of it), the easier it will be to add lustre. Separate from the core kernel patching issues (Hah! you thought I was done, didn't you?) there's stuff around ldiskfs. The strategy used by lustre is to grab a copy of ext3, cart it off to the side, change all the names, insert a few other strategically placed patches, and call it ldiskfs. That then becomes the basic facility by which actual bits are stored on block devices. The issue there is roughly similar to the core kernel, but not as severe, ie any given patchset depends heavily on which specific version of ext3 you started from. Update the kernel, and if it contained fixes to ext3, you've got a problem. In practice, this issue tends to be swamped by the core kernel one, ie getting lustre going on a specific kernel binds you so tightly to that kernel that you don't have to worry too much about changing ext3. But at such time as the kernel integration issue becomes easier to deal with, this one will have to be addressed as well. My preferred solution would be to simply snag a copy of ext3 that works, do the foozling once, then make that code be a permanent part of the lustre distribution, rather than relying on constructing it on the fly. But that's up to cfs. So anyhow, the short answer is that there's no real rocket science involved in getting lustre to work on a gentoo system, but it does take some work, and if you do it the way I did it, you end up with a system which is more constrained than a normal gentoo system, because you're no longer free to update the kernel using the stock tools. For us it's not a huge deal, but I suspect that some of the gentoo community will balk at that. [...] Are you considering getting support from CFS at some point? We are working with cfs. That doesn't mean they're doing all our work for us :-} Honestly, a big part of it is just plain old market sensitivity. Cfs is paying attention to where their bread and butter is. So far, that's not gentoo. Perhaps if sicortex is wildly successful we'll be able to change that equation :-} Sorry, you don't have to answer if that is a sensitive question. But part of this thread has been the topic of encouraging CFS to support Gentoo. Interestingly, my colleague, who is in charge of installing Lustre (1.4) on our test system, is talking to CFS about supporting a vanilla kernel configuration. The reason? We can't make the system stable with a SLES kernel. It was stable for a long time with Gentoo. I have not observed stability problems; it pretty much just works. If you can say any more about what issues you ran across, I'd love to hear it. Now they seem to have gotten it stable with SLES plus a vanilla 2.6.19 kernel (which of course does not have the Lustre patches). So they want Suse to provide a newer SLES kernel with the Lustre patches, and CFS to support that configuration. Well, ok, I dunno what to tell you about working with the vendors on that one. We actually did consider running RHEL or SLES kernels, but remember we're mips, and looking at the state of the mips support in those kernels, it was not a pretty picture. We also didn't really want to be in the game of having that much of a frankenstein system. So our approach has boiled down to 1. Stick close to vanilla 2. Make mips work 3. Do whatever we need to do to make lustre layer on top of that Based on what you've said, I wouldn't fool around with SLES, I'd just figure out what close-to-vanilla kernel you want to start from (picking one you think you can live with for a while) and do some part of what I described above. You might have a somewhat easier time of it if you started with 2.6.18, as I believe there's a cfs-supplied patchset for that one. If you want to start from a gentoo 2.6.18 one, I suspect your task will be to start with vanilla, make that work, then work out how to re-apply the gentoo patches. Re getting cfs to help, my bet would be that you'll have an easier time getting the gentoo community to create patches that are amenable to going on top of a lustre-ized vanilla kernel (and relying on cfs to support vanilla kernels) than you will getting cfs to generate patches to go on top of gentoo. If you watch the lustre lists, you'll see more people asking for vanilla than are asking for gentoo. Under no circumstances would I advocate getting a kernel working at some level, then trying to use the kernel.org patches, or anybody else's, to move it forward. I tried that a few times, and while I actually did find a couple of combinations that worked, most of the ones I tried blew up in my face. It's the same problem; there's all kinds of activity going on in vfs. I hope that situation doesn't continue indefinitely, but that's the way it seems to be right now. I've gone on long enough for now. Feel free to dig deeper if you dare :-} -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-08 14:18 ` John R. Dunning @ 2006-12-08 17:15 ` Bryan Green 2006-12-11 13:38 ` John R. Dunning 2006-12-08 18:29 ` Donnie Berkholz 2006-12-11 19:44 ` Bryan Green 2 siblings, 1 reply; 29+ messages in thread From: Bryan Green @ 2006-12-08 17:15 UTC (permalink / raw To: gentoo-cluster "John R. Dunning" writes: > From: Bryan Green <bgreen@nas.nasa.gov> > > I was able to patch the kernel, but the server was somewhat unstable. > > Do you remember how it was unstable? That's the kind of thing I'd very much > like to understand, as we're proposing to depend heavily on it. If there are > issues, whether specifically tied to our patches or not, I'd love to know > about them. I remember the system was stable until we tried to shut it down. It would lock up while shutting down, possibly while unmounting filesystems. I also did not get to do extensive testing of the system, so I don't know if it would have been stable under real use of the Lustre filesystem. > I also tried to make a custom kernel for > lustre 1.4, but ultimately hit too many roadblocks. I did learn a bit about how > to use 'quilt' though. > > Hmmm. Maybe not. Our stuff ditches quilt. I just used quilt when working with 1.4. I did not have an ebuild for that. -bryan -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-08 17:15 ` Bryan Green @ 2006-12-11 13:38 ` John R. Dunning 0 siblings, 0 replies; 29+ messages in thread From: John R. Dunning @ 2006-12-11 13:38 UTC (permalink / raw To: gentoo-cluster From: Bryan Green <bgreen@nas.nasa.gov> Date: Fri, 08 Dec 2006 09:15:54 -0800 "John R. Dunning" writes: > From: Bryan Green <bgreen@nas.nasa.gov> > > I was able to patch the kernel, but the server was somewhat unstable. > > Do you remember how it was unstable? That's the kind of thing I'd very much > like to understand, as we're proposing to depend heavily on it. If there are > issues, whether specifically tied to our patches or not, I'd love to know > about them. I remember the system was stable until we tried to shut it down. Ah. I have observed lustre to get cranky when you try to boot the system out from under it. In particular, if you go to all the servers and /sbin/shutdown without first shutting down lustre, I've seen it hang. Given the nature of lustre, that didn't surprise me a lot :-} The times I've shut down lustre in the correct order (shut down clients, then oss's, then mds, then mgs) it's always behaved itself. It would lock up while shutting down, possibly while unmounting filesystems. I also did not get to do extensive testing of the system, so I don't know if it would have been stable under real use of the Lustre filesystem. Ok. Like I said, I've found a few bugs, but I've never seen it act unstable in real use. > I also tried to make a custom kernel for > lustre 1.4, but ultimately hit too many roadblocks. I did learn a bit about how > to use 'quilt' though. > > Hmmm. Maybe not. Our stuff ditches quilt. I just used quilt when working with 1.4. I did not have an ebuild for that. I could never get quilt to work so just ditched it. You don't need it anyhow, if you ./configure blah-blah --disable-quilt, it works fine. I imagine if you were doing core development on lustre, in particular trying to actually build the large collection of patches they ship with it, quilt would be handy, but for just trying to get the kernel patched, my scripts skip it. -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-08 14:18 ` John R. Dunning 2006-12-08 17:15 ` Bryan Green @ 2006-12-08 18:29 ` Donnie Berkholz 2006-12-08 18:52 ` Bryan Green 2006-12-08 19:10 ` John R. Dunning 2006-12-11 19:44 ` Bryan Green 2 siblings, 2 replies; 29+ messages in thread From: Donnie Berkholz @ 2006-12-08 18:29 UTC (permalink / raw To: gentoo-cluster [-- Attachment #1: Type: text/plain, Size: 2216 bytes --] John R. Dunning wrote: > From: Bryan Green <bgreen@nas.nasa.gov> > A longer term solution is to do some combination of remodularizing vfs and > recasting the lustre stuff so as to depend less on getting its fingers into > the guts. I once spent some time looking into that, and I do believe it's > possible, but it would take some work, and would really need to be done in > concert with the rest of the core kernel guys, and I ran out of time to pursue > it. In the meantime, the more the gentoo community can resist the temptation > to patch the kernel (at least the vfs parts of it), the easier it will be to > add lustre. > Based on what you've said, I wouldn't fool around with SLES, I'd just figure > out what close-to-vanilla kernel you want to start from (picking one you think > you can live with for a while) and do some part of what I described above. > You might have a somewhat easier time of it if you started with 2.6.18, as I > believe there's a cfs-supplied patchset for that one. If you want to start > from a gentoo 2.6.18 one, I suspect your task will be to start with vanilla, > make that work, then work out how to re-apply the gentoo patches. Re getting > cfs to help, my bet would be that you'll have an easier time getting the > gentoo community to create patches that are amenable to going on top of a > lustre-ized vanilla kernel (and relying on cfs to support vanilla kernels) > than you will getting cfs to generate patches to go on top of gentoo. If you > watch the lustre lists, you'll see more people asking for vanilla than are > asking for gentoo. I've just got a couple comments on this. * The gentoo-sources patches are almost all upstream and based on the W.X.Y.Z "stable release" patchsets, except for 2-3 cases that probably aren't relevant to clusters. As a result, Gentoo folks would be reasonably well off just running vanilla-sources, which groups them in with everyone else wanting Lustre on a vanilla kernel. * Normally I would recommend hardened-sources for anything resembling a server, but you should have all your nodes and file servers blocked off from the Internet anyway so that's a non-issue. Thanks, Donnie [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 249 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-08 18:29 ` Donnie Berkholz @ 2006-12-08 18:52 ` Bryan Green 2006-12-08 19:10 ` John R. Dunning 1 sibling, 0 replies; 29+ messages in thread From: Bryan Green @ 2006-12-08 18:52 UTC (permalink / raw To: gentoo-cluster Donnie Berkholz writes: > > * The gentoo-sources patches are almost all upstream and based on the > W.X.Y.Z "stable release" patchsets, except for 2-3 cases that probably > aren't relevant to clusters. As a result, Gentoo folks would be > reasonably well off just running vanilla-sources, which groups them in > with everyone else wanting Lustre on a vanilla kernel. Yes, using vanilla-sources definitely sounds like the way forward. (More comments to come... I'm still digesting) -bryan -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-08 18:29 ` Donnie Berkholz 2006-12-08 18:52 ` Bryan Green @ 2006-12-08 19:10 ` John R. Dunning 1 sibling, 0 replies; 29+ messages in thread From: John R. Dunning @ 2006-12-08 19:10 UTC (permalink / raw To: gentoo-cluster From: Donnie Berkholz <dberkholz@gentoo.org> Date: Fri, 08 Dec 2006 10:29:00 -0800 [...] I've just got a couple comments on this. * The gentoo-sources patches are almost all upstream and based on the W.X.Y.Z "stable release" patchsets, except for 2-3 cases that probably aren't relevant to clusters. As a result, Gentoo folks would be reasonably well off just running vanilla-sources, which groups them in with everyone else wanting Lustre on a vanilla kernel. Fair enough. If the solution to how to run lustre on a gentoo system starts with "run a vanilla kernel, not a gentoo kernel", that makes things considerably easier. It doesn't make all the problems go away, but at least you've removed one of the uglier dimensions from the task :-} * Normally I would recommend hardened-sources for anything resembling a server, but you should have all your nodes and file servers blocked off from the Internet anyway so that's a non-issue. Yes. I expect that most of our machines will not be having ports open on the big-I internet, at least in the early days. Later, that may change, and those network-security patches will be of more immediate interest to us. Of somewhat more interest, even early on, are patches for file-system security. Last I paid attention to it, the plan was to pull in that class of patch on an ad-hoc basis as we decide they're justified. -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-08 14:18 ` John R. Dunning 2006-12-08 17:15 ` Bryan Green 2006-12-08 18:29 ` Donnie Berkholz @ 2006-12-11 19:44 ` Bryan Green 2 siblings, 0 replies; 29+ messages in thread From: Bryan Green @ 2006-12-11 19:44 UTC (permalink / raw To: gentoo-cluster "John R. Dunning" writes: > > We are working with cfs. That doesn't mean they're doing all our work for us > :-} > > Honestly, a big part of it is just plain old market sensitivity. Cfs is > paying attention to where their bread and butter is. So far, that's not > gentoo. Perhaps if sicortex is wildly successful we'll be able to change that > equation :-} > > ... > > So our approach has boiled down to > > 1. Stick close to vanilla > 2. Make mips work > 3. Do whatever we need to do to make lustre layer on top of that > > Based on what you've said, I wouldn't fool around with SLES, I'd just figure > out what close-to-vanilla kernel you want to start from (picking one you think > you can live with for a while) and do some part of what I described above. > You might have a somewhat easier time of it if you started with 2.6.18, as I > believe there's a cfs-supplied patchset for that one. Thank you for your extensive feedback. :) I think you've done a pretty good job of showing the complications involved in putting together your own lustre kernel. It sounds gnarly. What I take away from this is the need for a vanilla kernel patchset from CFS, preferably for 2.6.18 or higher. If there is already at least the basis for a vanilla 2.6.18 patchset in the current beta, that could be the starting point for an ebuild that would be of use for a while. Its been awhile since I last tried running Lustre. Perhaps it is time I tried building a 2.6.18 lustre-ized kernel. It depends on how good the patchset provided by CFS is. I don't have the bandwidth to go through the extensive process that you did to get a working kernel. -bryan -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [gentoo-cluster] examples of (large) Gentoo clusters 2006-12-01 23:16 [gentoo-cluster] examples of (large) Gentoo clusters Bryan Green 2006-12-02 3:48 ` Donnie Berkholz @ 2006-12-02 14:40 ` Nick Anderson 1 sibling, 0 replies; 29+ messages in thread From: Nick Anderson @ 2006-12-02 14:40 UTC (permalink / raw To: gentoo-cluster Unfortunately we haven't done a gentoo cluster, though with portage maintenance would be easy. I did want to comment on your processor choice. If you choose quad core your only option currently is Intel clovertown. These machines are power hungry. I would reccomend waiting for the opteron quad core chips that will be comming out in 07. If your going dual core I would reccomend opterons. They don't require the fully buffered dimms which from our testing seem to draw about 15W per dimm. The other thing to consider is the characteristics of your code, what we have seen from the Intel cpus is jobs run quite well if they are serial jobs, however when running parallel jobs the opterons still win out. Just some food for thought .... On Friday 01 December 2006 17:16, Bryan Green wrote: > Hello all, > > I am looking for something of a survey of examples of Gentoo-driven > clusters out there. If such a survey has been done, perhaps someone point > me to it. But I would like to hear from others on the list about their > clusters. > > I am in the process of advocating for using Gentoo on a new cluster that we > will be building. The cluster will be a "hyperwall", meaning that each > node will have graphics, forming a grid of displays for multi-parameter, > multi-dimensional scientific visualization. There will also be several > disk servers which will run Suse in order to get Lustre support (Lustre > support on the client side will be OS-neutral when the current beta is > officially released). In addition to graphics, the nodes will also be used > for compute jobs (scientific), and may serve as a testbed for a production > scientific computing environment. > > In the process of making my case, I've been asked what other examples there > are of large Gentoo clusters. This cluster will be 128 nodes (dual socket, > dual or quad core). Of particular interest are production and/or > scientific environments - not so much database clusters, though all > examples are of interest. Use of MPI is particularly relevant. Graphics > clusters are also of interest of course. > > I'd be grateful for any feedback I get from others on the list about the > clusters they maintain or use, and perhaps some comments about the efficacy > of Gentoo in an environment where stability is very important, and how > system administration compares to administration of a Suse or Redhat > cluster. > > Thanks, > -bryan -- gentoo-cluster@gentoo.org mailing list ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2006-12-11 19:46 UTC | newest] Thread overview: 29+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-12-01 23:16 [gentoo-cluster] examples of (large) Gentoo clusters Bryan Green 2006-12-02 3:48 ` Donnie Berkholz 2006-12-02 13:54 ` Hanni Ali 2006-12-02 18:12 ` Bryan Green 2006-12-02 17:57 ` Bryan Green 2006-12-02 19:51 ` Philipp Riegger 2006-12-03 3:49 ` Donnie Berkholz 2006-12-04 18:15 ` Bryan Green 2006-12-04 20:53 ` Donnie Berkholz 2006-12-04 22:35 ` Hanni Ali 2006-12-04 23:00 ` Bryan Green 2006-12-04 23:55 ` Daniel van Ham Colchete 2006-12-05 3:55 ` Donnie Berkholz 2006-12-05 13:18 ` John R. Dunning 2006-12-05 16:25 ` Bryan Green 2006-12-05 21:15 ` Daniel van Ham Colchete 2006-12-05 21:22 ` Bryan Green 2006-12-05 21:28 ` John R. Dunning 2006-12-07 0:33 ` Bryan Green 2006-12-07 13:12 ` John R. Dunning 2006-12-08 3:56 ` Bryan Green 2006-12-08 14:18 ` John R. Dunning 2006-12-08 17:15 ` Bryan Green 2006-12-11 13:38 ` John R. Dunning 2006-12-08 18:29 ` Donnie Berkholz 2006-12-08 18:52 ` Bryan Green 2006-12-08 19:10 ` John R. Dunning 2006-12-11 19:44 ` Bryan Green 2006-12-02 14:40 ` Nick Anderson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox