On Wednesday, September 17, 2014 08:56:28 PM James wrote: > Alec Ten Harmsel alectenharmsel.com> writes: > > As far as HDFS goes, I would only set that up if you will use it for > > Hadoop or related tools. It's highly specific, and the performance is > > not good unless you're doing a massively parallel read (what it was > > designed for). I can elaborate why if anyone is actually interested. > > Acutally, from my research and my goal (one really big scientific simulation > running constantly). Out of curiosity, what do you want to simulate? > Many folks are recommending to skip Hadoop/HDFS all > together I agree, Hadoop/HDFS is for data analysis. Like building a profile about people based on the information companies like Facebook, Google, NSA, Walmart, Governments, Banks,.... collect about their customers/users/citizens/slaves/.... > and go straight to mesos/spark. RDD (in-memory) cluster > calculations are at the heart of my needs. The opposite end of the > spectrum, loads of small files and small apps; I dunno about, but, I'm all > ears. > In the end, my (3) node scientific cluster will morph and support > the typical myriad of networked applications, but I can take > a few years to figure that out, or just copy what smart guys like > you and joost do..... Nope, I'm simply following what you do and provide suggestions where I can. Most of the clusters and distributed computing stuff I do is based on adding machines to distribute the load. But the mechanisms for these are implemented in the applications I work with, not what I design underneath. The filesystems I am interested in are different to the ones you want. I need to provided access to software installation files to a VM server and access to documentation which is created by the users. The VM server is physically next to what I already mentioned as server A. Access to the VM from the remote site will be using remote desktop connections. But to allow faster and easier access to the documentation, I need a server B at the remote site which functions as described. AFS might be suitable, but I need to be able to layer Samba on top of that to allow a seamless operation. I don't want the laptops to have their own cache and then having to figure out how to solve the multiple different changes to documents containing layouts. (MS Word and OpenDocument files) > > We use Lustre for our high performance general storage. I don't have any > > numbers, but I'm pretty sure it is *really* fast (10Gbit/s over IB > > sounds familiar, but don't quote me on that). > > AT Umich, you guys should test the FhGFS/btrfs combo. The folks > at UCI swear about it, although they are only publishing a wee bit. > (you know, water cooler gossip)...... Surely the Wolverines do not > want those californians getting up on them? > > Are you guys planning a mesos/spark test? > > > > Personally, I would read up on these and see how they work. Then, > > > based on that, decide if they are likely to assist in the specific > > > situation you are interested in. > > It's a ton of reading. It's not apples-to-apple_cider type of reading. > My head hurts..... Take a walk outside. Clear air should help you with the headaches :P > I'm leaning to DFS/LFS > > (2) Luster/btrfs and FhGFS/btrfs > > Thoughts/comments? I have insufficient knowledge to advise on either of these. One question, why BTRFS instead of ZFS? My current understanding is: - ZFS is production ready, but due to licensing issues, not included in the kernel - BTRFS is included, but not yet production ready with all planned features For me, Raid6-like functionality is an absolute requirement and latest I know is that that isn't implemented in BTRFS yet. Does anyone know when that will be implemented and reliable? Eg. what time-frame are we talking about? -- Joost