public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed
From: "J. Roeleveld" <joost@antarean.org>
To: gentoo-user@lists.gentoo.org
Subject: Re: [gentoo-user] Clusters on Gentoo ?
Date: Mon, 18 Aug 2014 16:31:16 +0200	[thread overview]
Message-ID: <1855316.WFR9YJczUb@andromeda> (raw)
In-Reply-To: <53F106B2.4090307@thegeezer.net>

On Sunday, August 17, 2014 08:46:58 PM thegeezer wrote:
> there are many way to do clustering and one thing that i would consider
> a "holy grail" would be something like pvm [1]
> because nothing else seems to have similar horizontal scaling of cpu at
> the kernel level

PVM, from the webpage, looks more like a pre-built VM. Not some kernel module 
that distributes existing code to different nodes.
This kind of clustering also has no benefit for most uses. You really need to 
design your tasks for these kind of environments.

> i would love to know the mechanism behind dell's equallogic san as it
> really is clustered lvm on steroids.
> GFS / orangefs / ocfs are not the easiest things to setup (ocfs is) and
> i've not found performance to be so great for writes.

I have seen weird issues when using Oracle's filesystems for anything not 
Oracle. How important is reliability?

> DRBD is only 2 devices as far as i understand, so not really super scalable
> i'm still not convinced over the likes of hadoop for storage, maybe i
> just don't have the scale to "get" it?

I wouldn't use Hadoop for storage of files. It's only useful if you have a lot 
(and I do mean a LOT) of data where a query only returns a very small amount.
Performance of a Hadoop cluster is high because the same query is sent to all 
nodes at once and the answers get merged into a single answer along the way 
back to the requestor. I don't see it as a valid system to actually store 
important data you do not want to risk losing.

> the thing with clusters is that you want to be able to spin an extra
> node up and join it to the group and then you increase cpu / storage by
> n+1   but also you want to be able to spin nodes down dynamically and go
> down by n-1.  i guess this is where hadoop is of benefit because that is
> not a happy thing for a typical file system.

Not necessary. That is only one way to use a cluster.
It's also an "easy" and "cheap" method of increasing the available processing 
power. This only works properly if the tasks can be distributed over multiple 
nodes easily. Having the option to quickly add and remove nodes make it 
difficult to keep the data consistent. Especially Hadoop prefers the nodes to 
stay available as there is no single node containing all the data. There is 
some redundancy, but remove a few nodes and you can easily loose data.

> network load balancing is super easy, all info required is in each
> packet -- application load balancing requires more thought.
> this is where the likes of memcached can help but also why a good design
> of the cluster is better. localised data and tiered access etc...  kind
> of why i would like to see a pvm kind of solution -- so that a page
> fault is triggered like swap memory which then fetches the relevant
> memory from the network:

That is going to kill performance...
Have a look into NUMA. It's always best to have the data where it is being 
processed. Either by moving the data to the processing unit, or by using a 
processing unit local to the data.
Moving data is always expensive with regards to performance.

This is how Hadoop clusters work, the data is processed on the node actually 
having the data. The result (which is often less then 1% of the source-data) 
is then sent over the network to another node, which, at this stage, merges 
the result and passes it to another node. This then continues until all the 
results are merged into a single result-set which is then returned to the 
requesting application.

> bearing in mind that a computer can typically
> trigger thousands of page faults a second and that memory access is very
> very many times faster than gigabit networking!
> 
> [1] http://www.csm.ornl.gov/pvm/pvm_home.html

Looks nice, but is not going to help with performance if the application is 
not designed for distributed processing.

--
Joost


  reply	other threads:[~2014-08-18 14:31 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-06 16:50 [gentoo-user] Clusters on Gentoo ? James
2014-08-07  7:38 ` J. Roeleveld
2014-08-07 11:10   ` Alec Ten Harmsel
2014-08-07 22:16     ` [gentoo-user] " James
2014-08-08  2:36       ` Alec Ten Harmsel
2014-08-08  6:29         ` J. Roeleveld
2014-08-08 10:17           ` Alec Ten Harmsel
2014-08-17 19:46 ` [gentoo-user] " thegeezer
2014-08-18 14:31   ` J. Roeleveld [this message]
2014-08-18 14:50     ` Rich Freeman
2014-08-18 14:53       ` Alec Ten Harmsel
2014-08-19  9:34         ` J. Roeleveld
2014-08-19 10:33           ` Rich Freeman
2014-08-19 10:45             ` J. Roeleveld
2014-08-19 10:52           ` Alec Ten Harmsel
2014-08-18 19:09     ` thegeezer
2014-08-19  9:18       ` J. Roeleveld

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1855316.WFR9YJczUb@andromeda \
    --to=joost@antarean.org \
    --cc=gentoo-user@lists.gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox