On Wednesday, September 17, 2014 08:56:28 PM James wrote:
> Alec Ten Harmsel <alec <at> alectenharmsel.com> writes:
> > As far as HDFS goes, I would only set that up if you will use it for
> > Hadoop or related tools. It's highly specific, and the performance is
> > not good unless you're doing a massively parallel read (what it was
> > designed for). I can elaborate why if anyone is actually interested.
> 
> Acutally, from my research and my goal (one really big scientific 
simulation
> running constantly).

Out of curiosity, what do you want to simulate?

> Many folks are recommending to skip Hadoop/HDFS all
> together

I agree, Hadoop/HDFS is for data analysis. Like building a profile about 
people based on the information companies like Facebook, Google, NSA, 
Walmart, Governments, Banks,.... collect about their 
customers/users/citizens/slaves/....

> and go straight to mesos/spark. RDD (in-memory)  cluster
> calculations are at the heart of my needs. The opposite end of the
> spectrum, loads of small files and small apps; I dunno about, but, I'm all
> ears.
> In the end, my (3) node scientific cluster will morph and support
> the typical myriad  of networked applications, but I can take
> a few years to figure that out, or just copy what smart guys like
> you and joost do.....

Nope, I'm simply following what you do and provide suggestions where I 
can.
Most of the clusters and distributed computing stuff I do is based on 
adding machines to distribute the load. But the mechanisms for these are 
implemented in the applications I work with, not what I design underneath.

The filesystems I am interested in are different to the ones you want.
I need to provided access to software installation files to a VM server and 
access to documentation which is created by the users.
The VM server is physically next to what I already mentioned as server A. 
Access to the VM from the remote site will be using remote desktop 
connections.
But to allow faster and easier access to the documentation, I need a 
server B at the remote site which functions as described.
AFS might be suitable, but I need to be able to layer Samba on top of that 
to allow a seamless operation.
I don't want the laptops to have their own cache and then having to figure 
out how to solve the multiple different changes to documents containing 
layouts. (MS Word and OpenDocument files)

> > We use Lustre for our high performance general storage. I don't have 
any
> > numbers, but I'm pretty sure it is *really* fast (10Gbit/s over IB
> > sounds familiar, but don't quote me on that).
> 
> AT Umich, you guys should test the FhGFS/btrfs combo. The folks
> at UCI swear about it, although they are only publishing a wee bit.
> (you know, water cooler gossip)...... Surely the Wolverines do not
> want those californians getting up on them?
> 
> Are you guys planning a mesos/spark test?
> 
> > > Personally, I would read up on these and see how they work. Then,
> > > based on that, decide if they are likely to assist in the specific
> > > situation you are interested in.
> 
> It's a ton of reading. It's not apples-to-apple_cider type of reading.
> My head hurts.....

Take a walk outside. Clear air should help you with the headaches :P

> I'm leaning to  DFS/LFS
> 
> (2)  Luster/btrfs      and     FhGFS/btrfs
> 
> Thoughts/comments?

I have insufficient knowledge to advise on either of these.
One question, why BTRFS instead of ZFS?

My current understanding is:
- ZFS is production ready, but due to licensing issues, not included in the 
kernel
- BTRFS is included, but not yet production ready with all planned features

For me, Raid6-like functionality is an absolute requirement and latest I 
know is that that isn't implemented in BTRFS yet. Does anyone know when 
that will be implemented and reliable? Eg. what time-frame are we talking 
about?

--
Joost