[gentoo-user] Clusters on Gentoo ?

public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed

* [gentoo-user] Clusters on Gentoo ?
@ 2014-08-06 16:50 James
  2014-08-07  7:38 ` J. Roeleveld
  2014-08-17 19:46 ` [gentoo-user] " thegeezer
  0 siblings, 2 replies; 17+ messages in thread
From: James @ 2014-08-06 16:50 UTC (permalink / raw
  To: gentoo-user

Howdy one and all,

Many see a world where clusters abound even for the small business and
resource capable enthusist [1]. Clusters of old PCs are the norm, but a slew
of new extremely low powered 64bit embedded systems, running embedded linux,
with ample ram (ddr4 even) and up to (8) SATA-3 ports will undoubtly
be the targets of aquistion by hobbyist around the world. Other with more
salient goals are sure to follow!

For example, we (Gentoo) have just had one of the "titans" of the embedded
linux world, return to Gentoo. Linaro is the default industry group that
is leading the charge in new development for linux based embedded system
sharing most of their work with the larger open source communities.
Thomas Gall aka. tgall is working for Linaro as the acting director of the
Linaro Mobile Group [8,9].  Clusters will seemlessly integrate CPUs, GPUs,
Arms, FPGA, SOCs and many other instantiations of computational resources,
sooner rather than later. The Billion dollar players already run these sorts
of amalgamations for a very wide variety of reasons, so why should't the
bands of linux_commoners have access to such raw power? [10]

In a recent thread (schedulers) it was noted that several folks had interest
in clusters (privately operated clouds) as more than a passing interest.
Companion projects, such as Apache's "Spark" [4] have tremendous potential
as aggressive solutions such diverse fields as social media relationships,
distributed database techniques and new, massively parallel programing
paradymes for computationally intensive scientific endeavors, just to
mention a few [5,6,7].

So I'm soliciting the readers of this list to post any references to
distributed/cluster/cloud softwares/fileSystems they are aware of, have used
or would like to see; to guage interest in Mesos, Chronos, Spark (apache) as
well as all other open source cluster (distributed)  systems or tools [2].
My collection of such is sporadic, at best, and serves mostly my
math/science needs. Project Aethna, is one of the oldest efforts, still
kicking at MIT, the last I heard [3]. Newer/cooler efforts?

Hopefully, we can all share ideas and brainstorm about how Gentoo users
can lead the pack of linux distros into this brave_new world. [Overlays?]

curiously,
James

[1]
http://www.forbes.com/sites/marcochiappetta/2014/07/31/amd-opteron-64-bit-arm-based-seattle-dev-kits-are-shipping/?partner=yahootix

[2] http://hadoop.apache.org/docs/r1.2.1/cluster_setup.html

[3] https://ist.mit.edu/athena

[4] https://spark.apache.org/docs/latest/index.html

[5] https://spark.apache.org/docs/latest/graphx-programming-guide.html#overview

[6] http://en.wikipedia.org/wiki/Apache_Hadoop

[7] http://www.wired.com/2012/04/amazon-takes-genomics-research-to-the-clouds/

[8] http://www.gossamer-threads.com/lists/gentoo/dev/289556

[9] http://www.linaro.org/

[10] http://opencores.org/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] Clusters on Gentoo ?
  2014-08-06 16:50 [gentoo-user] Clusters on Gentoo ? James
@ 2014-08-07  7:38 ` J. Roeleveld
  2014-08-07 11:10   ` Alec Ten Harmsel
  2014-08-17 19:46 ` [gentoo-user] " thegeezer
  1 sibling, 1 reply; 17+ messages in thread
From: J. Roeleveld @ 2014-08-07  7:38 UTC (permalink / raw
  To: gentoo-user

On Wednesday, August 06, 2014 04:50:22 PM James wrote:
<snipped>

> Hopefully, we can all share ideas and brainstorm about how Gentoo users
> can lead the pack of linux distros into this brave_new world. [Overlays?]

A good place to start would also be:
http://www.yolinux.com/TUTORIALS/LinuxClustersAndFileSystems.html

--
Joost



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] Clusters on Gentoo ?
  2014-08-07  7:38 ` J. Roeleveld
@ 2014-08-07 11:10   ` Alec Ten Harmsel
  2014-08-07 22:16     ` [gentoo-user] " James
  0 siblings, 1 reply; 17+ messages in thread
From: Alec Ten Harmsel @ 2014-08-07 11:10 UTC (permalink / raw
  To: gentoo-user

I'm a Hadoop and related software sysadmin at the University of 
Michigan. I'm a student still, so it's only a part-time position. I 
have some documentation at http://caen.github.io/hadoop - if something 
is not clear, I will gladly take feedback and make appropriate changes.

> In a recent thread (schedulers) it was noted that several folks had interest
> in clusters (privately operated clouds) as more than a passing interest.

 I'll try to chime in on any questions about scheduling/clusters in the 
future as we have a pretty large installation (~20,000 cores) running a 
traditional HPC stack, and a small Hadoop cluster.

> [2] http://hadoop.apache.org/docs/r1.2.1/cluster_setup.html

Hadoop is currently "stable" on 2.x (specifically 2.4), so relevant 
documentation is at http://hadoop.apache.org/docs/stable.

Alec

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [gentoo-user] Re: Clusters on Gentoo ?
  2014-08-07 11:10   ` Alec Ten Harmsel
@ 2014-08-07 22:16     ` James
  2014-08-08  2:36       ` Alec Ten Harmsel
  0 siblings, 1 reply; 17+ messages in thread
From: James @ 2014-08-07 22:16 UTC (permalink / raw
  To: gentoo-user

Alec Ten Harmsel <alec <at> alectenharmsel.com> writes:


> I'm a Hadoop and related software sysadmin at the University of 
> Michigan. I'm a student still, so it's only a part-time position. I 
> have some documentation at http://caen.github.io/hadoop - if something 
> is not clear, I will gladly take feedback and make appropriate changes.

>  I'll try to chime in on any questions about scheduling/clusters in the 
> future as we have a pretty large installation (~20,000 cores) running a 
> traditional HPC stack, and a small Hadoop cluster.

> > [2] http://hadoop.apache.org/docs/r1.2.1/cluster_setup.html
 
> Hadoop is currently "stable" on 2.x (specifically 2.4), so relevant 
> documentation is at http://hadoop.apache.org/docs/stable.


Which operating systems does you Hadoop systems run on top of?


Whilst research, I ran across a FAT PAYCHECK at a startup for Hadoop
expertise (California)  :

"Extremely competitive pay: 150-175K base salary, large bonus plan, and a
generous equity stake"


http://www.cybercoders.com/principal-software-engineer-backend-technology-agnostic-job-157829?jobId=EON-1157039&ad=recruiticsindeed&rx_job=19411798&rx_source=Indeed&rx_campaign=Indeed15&rx_medium=CPC


enjoy!
James



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] Re: Clusters on Gentoo ?
  2014-08-07 22:16     ` [gentoo-user] " James
@ 2014-08-08  2:36       ` Alec Ten Harmsel
  2014-08-08  6:29         ` J. Roeleveld
  0 siblings, 1 reply; 17+ messages in thread
From: Alec Ten Harmsel @ 2014-08-08  2:36 UTC (permalink / raw
  To: gentoo-user

> Which operating systems does you Hadoop systems run on top of?

We use RedHat, although we make a fair amount of custom RPMs. It's just
too much having to deal with Gentoo while maintaining high performance
filesystems and schedulers, which are apparently a real pain.

> Whilst research, I ran across a FAT PAYCHECK at a startup for Hadoop
expertise (California) : "Extremely competitive pay: 150-175K base
salary, large bonus plan, and a generous equity stake"

Lolz, a few of the full time staff regularly get headhunted for similar
positions. I'm still a student and would like to finish the degree,
though, so no high paying jobs for me yet.

Alec

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] Re: Clusters on Gentoo ?
  2014-08-08  2:36       ` Alec Ten Harmsel
@ 2014-08-08  6:29         ` J. Roeleveld
  2014-08-08 10:17           ` Alec Ten Harmsel
  0 siblings, 1 reply; 17+ messages in thread
From: J. Roeleveld @ 2014-08-08  6:29 UTC (permalink / raw
  To: gentoo-user

On Thursday, August 07, 2014 10:36:35 PM Alec Ten Harmsel wrote:
> > Which operating systems does you Hadoop systems run on top of?
> 
> We use RedHat, although we make a fair amount of custom RPMs. It's just
> too much having to deal with Gentoo while maintaining high performance
> filesystems and schedulers, which are apparently a real pain.

If you already make custom RPMs, why not build binary packages for a Gentoo 
based cluster?

> > Whilst research, I ran across a FAT PAYCHECK at a startup for Hadoop
> 
> expertise (California) : "Extremely competitive pay: 150-175K base
> salary, large bonus plan, and a generous equity stake"

I'm wondering what "Unlimited vacation policy" actually means.

> Lolz, a few of the full time staff regularly get headhunted for similar
> positions. I'm still a student and would like to finish the degree,
> though, so no high paying jobs for me yet.

Good decision.

--
Joost


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] Re: Clusters on Gentoo ?
  2014-08-08  6:29         ` J. Roeleveld
@ 2014-08-08 10:17           ` Alec Ten Harmsel
  0 siblings, 0 replies; 17+ messages in thread
From: Alec Ten Harmsel @ 2014-08-08 10:17 UTC (permalink / raw
  To: gentoo-user

On Fri 08 Aug 2014 02:29:55 AM EDT, J. Roeleveld wrote:

> If you already make custom RPMs, why not build binary packages for a Gentoo 
> based cluster?

I was mistaken last night (probably a little tired, been driving all
day) - we use RedHat for the support and because the software we run
usually only officially supports RHEL and maybe Debian.

My bad.

Alec


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] Clusters on Gentoo ?
  2014-08-06 16:50 [gentoo-user] Clusters on Gentoo ? James
  2014-08-07  7:38 ` J. Roeleveld
@ 2014-08-17 19:46 ` thegeezer
  2014-08-18 14:31   ` J. Roeleveld
  1 sibling, 1 reply; 17+ messages in thread
From: thegeezer @ 2014-08-17 19:46 UTC (permalink / raw
  To: gentoo-user

there are many way to do clustering and one thing that i would consider
a "holy grail" would be something like pvm [1]
because nothing else seems to have similar horizontal scaling of cpu at
the kernel level

i would love to know the mechanism behind dell's equallogic san as it
really is clustered lvm on steroids.
GFS / orangefs / ocfs are not the easiest things to setup (ocfs is) and
i've not found performance to be so great for writes.
DRBD is only 2 devices as far as i understand, so not really super scalable
i'm still not convinced over the likes of hadoop for storage, maybe i
just don't have the scale to "get" it?

the thing with clusters is that you want to be able to spin an extra
node up and join it to the group and then you increase cpu / storage by
n+1   but also you want to be able to spin nodes down dynamically and go
down by n-1.  i guess this is where hadoop is of benefit because that is
not a happy thing for a typical file system.

network load balancing is super easy, all info required is in each
packet -- application load balancing requires more thought.
this is where the likes of memcached can help but also why a good design
of the cluster is better. localised data and tiered access etc...  kind
of why i would like to see a pvm kind of solution -- so that a page
fault is triggered like swap memory which then fetches the relevant
memory from the network: bearing in mind that a computer can typically
trigger thousands of page faults a second and that memory access is very
very many times faster than gigabit networking!

[1] http://www.csm.ornl.gov/pvm/pvm_home.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] Clusters on Gentoo ?
  2014-08-17 19:46 ` [gentoo-user] " thegeezer
@ 2014-08-18 14:31   ` J. Roeleveld
  2014-08-18 14:50     ` Rich Freeman
  2014-08-18 19:09     ` thegeezer
  0 siblings, 2 replies; 17+ messages in thread
From: J. Roeleveld @ 2014-08-18 14:31 UTC (permalink / raw
  To: gentoo-user

On Sunday, August 17, 2014 08:46:58 PM thegeezer wrote:
> there are many way to do clustering and one thing that i would consider
> a "holy grail" would be something like pvm [1]
> because nothing else seems to have similar horizontal scaling of cpu at
> the kernel level

PVM, from the webpage, looks more like a pre-built VM. Not some kernel module 
that distributes existing code to different nodes.
This kind of clustering also has no benefit for most uses. You really need to 
design your tasks for these kind of environments.

> i would love to know the mechanism behind dell's equallogic san as it
> really is clustered lvm on steroids.
> GFS / orangefs / ocfs are not the easiest things to setup (ocfs is) and
> i've not found performance to be so great for writes.

I have seen weird issues when using Oracle's filesystems for anything not 
Oracle. How important is reliability?

> DRBD is only 2 devices as far as i understand, so not really super scalable
> i'm still not convinced over the likes of hadoop for storage, maybe i
> just don't have the scale to "get" it?

I wouldn't use Hadoop for storage of files. It's only useful if you have a lot 
(and I do mean a LOT) of data where a query only returns a very small amount.
Performance of a Hadoop cluster is high because the same query is sent to all 
nodes at once and the answers get merged into a single answer along the way 
back to the requestor. I don't see it as a valid system to actually store 
important data you do not want to risk losing.

> the thing with clusters is that you want to be able to spin an extra
> node up and join it to the group and then you increase cpu / storage by
> n+1   but also you want to be able to spin nodes down dynamically and go
> down by n-1.  i guess this is where hadoop is of benefit because that is
> not a happy thing for a typical file system.

Not necessary. That is only one way to use a cluster.
It's also an "easy" and "cheap" method of increasing the available processing 
power. This only works properly if the tasks can be distributed over multiple 
nodes easily. Having the option to quickly add and remove nodes make it 
difficult to keep the data consistent. Especially Hadoop prefers the nodes to 
stay available as there is no single node containing all the data. There is 
some redundancy, but remove a few nodes and you can easily loose data.

> network load balancing is super easy, all info required is in each
> packet -- application load balancing requires more thought.
> this is where the likes of memcached can help but also why a good design
> of the cluster is better. localised data and tiered access etc...  kind
> of why i would like to see a pvm kind of solution -- so that a page
> fault is triggered like swap memory which then fetches the relevant
> memory from the network:

That is going to kill performance...
Have a look into NUMA. It's always best to have the data where it is being 
processed. Either by moving the data to the processing unit, or by using a 
processing unit local to the data.
Moving data is always expensive with regards to performance.

This is how Hadoop clusters work, the data is processed on the node actually 
having the data. The result (which is often less then 1% of the source-data) 
is then sent over the network to another node, which, at this stage, merges 
the result and passes it to another node. This then continues until all the 
results are merged into a single result-set which is then returned to the 
requesting application.

> bearing in mind that a computer can typically
> trigger thousands of page faults a second and that memory access is very
> very many times faster than gigabit networking!
> 
> [1] http://www.csm.ornl.gov/pvm/pvm_home.html

Looks nice, but is not going to help with performance if the application is 
not designed for distributed processing.

--
Joost

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] Clusters on Gentoo ?
  2014-08-18 14:31   ` J. Roeleveld
@ 2014-08-18 14:50     ` Rich Freeman
  2014-08-18 14:53       ` Alec Ten Harmsel
  2014-08-18 19:09     ` thegeezer
  1 sibling, 1 reply; 17+ messages in thread
From: Rich Freeman @ 2014-08-18 14:50 UTC (permalink / raw
  To: gentoo-user

On Mon, Aug 18, 2014 at 10:31 AM, J. Roeleveld <joost@antarean.org> wrote:
>
> I wouldn't use Hadoop for storage of files. It's only useful if you have a lot
> (and I do mean a LOT) of data where a query only returns a very small amount.

Not to mention a lot of data in a small number of files.  I think the
minimum allocation size for Hadoop is measured in megabytes.  I tried
using it to process gentoo-x86 and the number of files just clobbered
the thing.  Since in my job the files were really just static data and
not the actual subject of the map/reduce I instead just replicated the
data to all the nodes and had them retrieve the data from the local
filesystem.

Hadoop is a very specialized tool.  It does what it does very well,
but if you want to use it for something other than map/reduce then
consider carefully whether it is the right tool for the job.

--
Rich

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] Clusters on Gentoo ?
  2014-08-18 14:50     ` Rich Freeman
@ 2014-08-18 14:53       ` Alec Ten Harmsel
  2014-08-19  9:34         ` J. Roeleveld
  0 siblings, 1 reply; 17+ messages in thread
From: Alec Ten Harmsel @ 2014-08-18 14:53 UTC (permalink / raw
  To: gentoo-user

On Mon 18 Aug 2014 10:50:23 AM EDT, Rich Freeman wrote:

> Hadoop is a very specialized tool.  It does what it does very well,
> but if you want to use it for something other than map/reduce then
> consider carefully whether it is the right tool for the job.

Agreed; unless you have decent hardware and can comfortably measure 
your data in TB, it'll be quicker to use something else once you factor 
in the administration time and learning curve.

Alec

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] Clusters on Gentoo ?
  2014-08-18 14:31   ` J. Roeleveld
  2014-08-18 14:50     ` Rich Freeman
@ 2014-08-18 19:09     ` thegeezer
  2014-08-19  9:18       ` J. Roeleveld
  1 sibling, 1 reply; 17+ messages in thread
From: thegeezer @ 2014-08-18 19:09 UTC (permalink / raw
  To: gentoo-user

On 18/08/14 15:31, J. Roeleveld wrote:
> <snip>
valid points, and interesting to see the corrections of my
understanding, always welcome :)
> Looks nice, but is not going to help with performance if the application is 
> not designed for distributed processing.
>
> --
> Joost
>
this is the key point i would raise about clusters really -- it would be
nice to not need for example distcc configured and just have portage run
across all connected nodes without any further work, or to use a tablet
computer which is "borrowing" cycles from a GFX card across the network
without having to configure nvidia grid: specifically these two use
cases have wildly different characteristics and are a great example of
why clustering has to be designed first to fit the application and
viceversa.

/me continues to wonder if 10GigE is fast enough to page fault across
the network ... ;)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] Clusters on Gentoo ?
  2014-08-18 19:09     ` thegeezer
@ 2014-08-19  9:18       ` J. Roeleveld
  0 siblings, 0 replies; 17+ messages in thread
From: J. Roeleveld @ 2014-08-19  9:18 UTC (permalink / raw
  To: gentoo-user

On Monday, August 18, 2014 08:09:00 PM thegeezer wrote:
> On 18/08/14 15:31, J. Roeleveld wrote:
> > <snip>
> 
> valid points, and interesting to see the corrections of my
> understanding, always welcome :)

You're welcome :)

> > Looks nice, but is not going to help with performance if the application
> > is
> > not designed for distributed processing.
> > 
> > --
> > Joost
> 
> this is the key point i would raise about clusters really -- it would be
> nice to not need for example distcc configured and just have portage run
> across all connected nodes without any further work, or to use a tablet
> computer which is "borrowing" cycles from a GFX card across the network
> without having to configure nvidia grid: specifically these two use
> cases have wildly different characteristics and are a great example of
> why clustering has to be designed first to fit the application and
> viceversa.

I had a better look at that site you linked to. It won't be as "hidden" as 
you'd like. The software you run on it needs to be designed to actually use 
the infrastructure.
This means that for your ideal to work, the "industry" needs to decide on a 
single clustering technology for this. I wish you good luck on that venture. 
:)

> /me continues to wonder if 10GigE is fast enough to page fault across
> the network ... ;)

Depends on how fast you want the environment to be.
Old i386 time, probably.
Expecting a performance equivalent to a modern system, no.

Check the bus-speeds between the CPU and memory that is being employed these 
days. That is the minimum speed you need in the network link to be fast enough 
to actually work. And that is expecting a perfect link with no errors 
occurring in the wiring.

--
Joost

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] Clusters on Gentoo ?
  2014-08-18 14:53       ` Alec Ten Harmsel
@ 2014-08-19  9:34         ` J. Roeleveld
  2014-08-19 10:33           ` Rich Freeman
  2014-08-19 10:52           ` Alec Ten Harmsel
  0 siblings, 2 replies; 17+ messages in thread
From: J. Roeleveld @ 2014-08-19  9:34 UTC (permalink / raw
  To: gentoo-user

On Monday, August 18, 2014 10:53:51 AM Alec Ten Harmsel wrote:
> On Mon 18 Aug 2014 10:50:23 AM EDT, Rich Freeman wrote:
> > Hadoop is a very specialized tool.  It does what it does very well,
> > but if you want to use it for something other than map/reduce then
> > consider carefully whether it is the right tool for the job.
> 
> Agreed; unless you have decent hardware and can comfortably measure
> your data in TB, it'll be quicker to use something else once you factor
> in the administration time and learning curve.

The benefit of clustering technologies is that you don't need high-end 
hardware to start with. You can use the old hardware you found collecting dust 
in the basement.

The learning curve isn't as steep as it used to be. There are plenty of tools 
to make it easier to start using Hadoop.

--
Joost


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] Clusters on Gentoo ?
  2014-08-19  9:34         ` J. Roeleveld
@ 2014-08-19 10:33           ` Rich Freeman
  2014-08-19 10:45             ` J. Roeleveld
  2014-08-19 10:52           ` Alec Ten Harmsel
  1 sibling, 1 reply; 17+ messages in thread
From: Rich Freeman @ 2014-08-19 10:33 UTC (permalink / raw
  To: gentoo-user

On Tue, Aug 19, 2014 at 5:34 AM, J. Roeleveld <joost@antarean.org> wrote:
> On Monday, August 18, 2014 10:53:51 AM Alec Ten Harmsel wrote:
>> On Mon 18 Aug 2014 10:50:23 AM EDT, Rich Freeman wrote:
>> > Hadoop is a very specialized tool.  It does what it does very well,
>> > but if you want to use it for something other than map/reduce then
>> > consider carefully whether it is the right tool for the job.
>>
>> Agreed; unless you have decent hardware and can comfortably measure
>> your data in TB, it'll be quicker to use something else once you factor
>> in the administration time and learning curve.
>
> The benefit of clustering technologies is that you don't need high-end
> hardware to start with. You can use the old hardware you found collecting dust
> in the basement.
>
> The learning curve isn't as steep as it used to be. There are plenty of tools
> to make it easier to start using Hadoop.
>

As long as you're counting words and don't mind coding everything in Java.  :)

I found that if you want to avoid using Java, then the available
documentation plummets, and I'm pretty sure the version I was
attempting to use was buggy - it was losing records in the sort/reduce
phase I believe.  Or perhaps I was just using it incorrectly, but the
same exact code worked just fine when I ran it on a single host with a
smaller dataset and just piped map | sort | reduce without using
Hadoop.  The documentation was pretty sparse on how to get Hadoop to
work via stdin/out with non-Java code and it is quite possible I
wasn't quite doing things right.  In the end my problem wasn't big
enough to necessitate using Hadoop and I used GNU parallel instead.

--
Rich

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] Clusters on Gentoo ?
  2014-08-19 10:33           ` Rich Freeman
@ 2014-08-19 10:45             ` J. Roeleveld
  0 siblings, 0 replies; 17+ messages in thread
From: J. Roeleveld @ 2014-08-19 10:45 UTC (permalink / raw
  To: gentoo-user

On Tuesday, August 19, 2014 06:33:29 AM Rich Freeman wrote:
> On Tue, Aug 19, 2014 at 5:34 AM, J. Roeleveld <joost@antarean.org> wrote:
> > On Monday, August 18, 2014 10:53:51 AM Alec Ten Harmsel wrote:
> >> On Mon 18 Aug 2014 10:50:23 AM EDT, Rich Freeman wrote:
> >> > Hadoop is a very specialized tool.  It does what it does very well,
> >> > but if you want to use it for something other than map/reduce then
> >> > consider carefully whether it is the right tool for the job.
> >> 
> >> Agreed; unless you have decent hardware and can comfortably measure
> >> your data in TB, it'll be quicker to use something else once you factor
> >> in the administration time and learning curve.
> > 
> > The benefit of clustering technologies is that you don't need high-end
> > hardware to start with. You can use the old hardware you found collecting
> > dust in the basement.
> > 
> > The learning curve isn't as steep as it used to be. There are plenty of
> > tools to make it easier to start using Hadoop.
> 
> As long as you're counting words and don't mind coding everything in Java. 
> :)
> 
> I found that if you want to avoid using Java, then the available
> documentation plummets, and I'm pretty sure the version I was
> attempting to use was buggy - it was losing records in the sort/reduce
> phase I believe.  Or perhaps I was just using it incorrectly, but the
> same exact code worked just fine when I ran it on a single host with a
> smaller dataset and just piped map | sort | reduce without using
> Hadoop.  The documentation was pretty sparse on how to get Hadoop to
> work via stdin/out with non-Java code and it is quite possible I
> wasn't quite doing things right.  In the end my problem wasn't big
> enough to necessitate using Hadoop and I used GNU parallel instead.

No need for Java knowledge to develop against Hadoop.
A commercial product:
http://www.informatica.com/Images/01603_powerexchange-for-hadoop_ds_en-US.pdf
Nice and easy graphical interface. The same "code" that works against a 
relational database also works with Hadoop. The tool does the translation.

I would be surprised if there are no other tools that can make it easier to 
develop code to work with Hadoop. I just haven't had the reason to search for 
those yet.

--
Joost


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] Clusters on Gentoo ?
  2014-08-19  9:34         ` J. Roeleveld
  2014-08-19 10:33           ` Rich Freeman
@ 2014-08-19 10:52           ` Alec Ten Harmsel
  1 sibling, 0 replies; 17+ messages in thread
From: Alec Ten Harmsel @ 2014-08-19 10:52 UTC (permalink / raw
  To: gentoo-user

On Tue 19 Aug 2014 05:34:40 AM EDT, J. Roeleveld wrote:
> On Monday, August 18, 2014 10:53:51 AM Alec Ten Harmsel wrote:
>> On Mon 18 Aug 2014 10:50:23 AM EDT, Rich Freeman wrote:
>>> Hadoop is a very specialized tool.  It does what it does very well,
>>> but if you want to use it for something other than map/reduce then
>>> consider carefully whether it is the right tool for the job.
>>
>> Agreed; unless you have decent hardware and can comfortably measure
>> your data in TB, it'll be quicker to use something else once you factor
>> in the administration time and learning curve.
>
> The benefit of clustering technologies is that you don't need high-end
> hardware to start with. You can use the old hardware you found collecting dust
> in the basement.

Yes, but... if you are doing anything that *needs* to be fast (i.e. if 
you're not a hobbyist), you don't need some super fancy database 
machine but you still need some decent hardware (gotta have enough RAM 
for that JVM ;) ). If you'd like to take a look at our hardware, you 
can check out http://caen.github.io/hadoop/hardware.html.

> The learning curve isn't as steep as it used to be. There are plenty of tools
> to make it easier to start using Hadoop.

There are plenty of great tools (Pig, Sqoop, Hive, RHadoop, etc.) that 
you can use so you're not writing Java. This is all client-side; it 
doesn't make the administration easier.

I agree that it's easy to start using it (It's possible to configure a 
small cluster from scratch in half an hour), but it takes a lot more 
time to tune your installation so it actually performs well. Just like 
any other piece of server software; serving a website with httpd is 
easy, but serving it well and adding security takes a lot more time.

Rich Freeman wrote:
> As long as you're counting words and don't mind coding everything in Java. :)

We discourage researchers from writing in Java and instead use any of 
the things I list above, unless they really like Java.

> I found that if you want to avoid using Java, then the
> available documentation plummets

Yeah, this is still a pretty big problem. Documentation is pretty 
sparse.

Alec

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-08-19 11:01 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-06 16:50 [gentoo-user] Clusters on Gentoo ? James
2014-08-07  7:38 ` J. Roeleveld
2014-08-07 11:10   ` Alec Ten Harmsel
2014-08-07 22:16     ` [gentoo-user] " James
2014-08-08  2:36       ` Alec Ten Harmsel
2014-08-08  6:29         ` J. Roeleveld
2014-08-08 10:17           ` Alec Ten Harmsel
2014-08-17 19:46 ` [gentoo-user] " thegeezer
2014-08-18 14:31   ` J. Roeleveld
2014-08-18 14:50     ` Rich Freeman
2014-08-18 14:53       ` Alec Ten Harmsel
2014-08-19  9:34         ` J. Roeleveld
2014-08-19 10:33           ` Rich Freeman
2014-08-19 10:45             ` J. Roeleveld
2014-08-19 10:52           ` Alec Ten Harmsel
2014-08-18 19:09     ` thegeezer
2014-08-19  9:18       ` J. Roeleveld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox