From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gentoo-user+bounces-158829-garchives=archives.gentoo.org@lists.gentoo.org>
Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80])
	by finch.gentoo.org (Postfix) with ESMTP id 50C9F13838B
	for <garchives@archives.gentoo.org>; Fri, 19 Sep 2014 15:21:48 +0000 (UTC)
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id D4FC1E08E6;
	Fri, 19 Sep 2014 15:21:42 +0000 (UTC)
Received: from out2-smtp.messagingengine.com (out2-smtp.messagingengine.com [66.111.4.26])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by pigeon.gentoo.org (Postfix) with ESMTPS id F3FC3E08AE
	for <gentoo-user@lists.gentoo.org>; Fri, 19 Sep 2014 15:21:41 +0000 (UTC)
Received: from compute2.internal (compute2.nyi.internal [10.202.2.42])
	by gateway2.nyi.internal (Postfix) with ESMTP id 78E3C20C53
	for <gentoo-user@lists.gentoo.org>; Fri, 19 Sep 2014 11:21:41 -0400 (EDT)
Received: from frontend2 ([10.202.2.161])
  by compute2.internal (MEProxy); Fri, 19 Sep 2014 11:21:41 -0400
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=fastmail.co.uk; h=
	message-id:date:from:mime-version:to:subject:references
	:in-reply-to:content-type:content-transfer-encoding; s=mesmtp;
	 bh=dw2mGfC6kMwqa+TEy5G16ygavjQ=; b=JfkKPWSp5bNtFfUnYT0q7jRknbyH
	ogI0L4sQNRcb/vDn7pn2hOfC4lJS2OiKCUJM5QM1AJgemDlbgRZLtmmW/BkzvGI5
	CYdt9cV4QsiqT+jhe/xLH5Afit+iIplanRFX2A83N3ee9Bj9qycY58UvAmL/onue
	Zr6dt9eAXmN+TLY=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=
	messagingengine.com; h=message-id:date:from:mime-version:to
	:subject:references:in-reply-to:content-type
	:content-transfer-encoding; s=smtpout; bh=dw2mGfC6kMwqa+TEy5G16y
	gavjQ=; b=FuzHDidW11ylxsRoTMn0PVfTn8xYMpWI9PNlei6j3WtK3pPrH6Atp6
	uFxm+QGsg0lc/VQEwo73lgCwyO6uaTvAffMMHgqV+/a9uBetQfksn2oZRe019BXc
	dNNTRfTdxuALox6GvUlpYMCg3Hjs5cacyX2AjRBAkenCxIO4E6+kE=
X-Sasl-enc: UbKB7u1S8iAJSKkmWaUlMh0gcvm/nJWYO1f5Q6nFP0vQ 1411140101
Received: from [10.234.65.74] (unknown [212.183.128.54])
	by mail.messagingengine.com (Postfix) with ESMTPA id DD8016801A2
	for <gentoo-user@lists.gentoo.org>; Fri, 19 Sep 2014 11:21:40 -0400 (EDT)
Message-ID: <541C4A01.10505@fastmail.co.uk>
Date: Fri, 19 Sep 2014 16:21:37 +0100
From: Kerin Millar <kerframil@fastmail.co.uk>
User-Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:24.0) Gecko/20140731 FossaMail/24.7.0
Precedence: bulk
List-Post: <mailto:gentoo-user@lists.gentoo.org>
List-Help: <mailto:gentoo-user+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-user+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-user+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-user.gentoo.org>
X-BeenThere: gentoo-user@lists.gentoo.org
Reply-to: gentoo-user@lists.gentoo.org
MIME-Version: 1.0
To: gentoo-user@lists.gentoo.org
Subject: Re: [gentoo-user] Re: File system testing
References: <loom.20140916T203111-406@post.gmane.org> <1702192.KJn5UXzBYJ@andromeda> <loom.20140917T172559-530@post.gmane.org> <15339117.pAj2kdbPAt@andromeda> <5419ED08.2020408@alectenharmsel.com> <541AA31B.3060105@fastmail.co.uk> <541ADA46.4010307@alectenharmsel.com>
In-Reply-To: <541ADA46.4010307@alectenharmsel.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-Archives-Salt: 5b59e279-4457-4fc7-9337-f3d00c6463d2
X-Archives-Hash: 1dae241ff499165cd1d3b40aa4b13359

On 18/09/2014 14:12, Alec Ten Harmsel wrote:
>
> On 09/18/2014 05:17 AM, Kerin Millar wrote:
>> On 17/09/2014 21:20, Alec Ten Harmsel wrote:
>>> As far as HDFS goes, I would only set that up if you will use it for
>>> Hadoop or related tools. It's highly specific, and the performance is
>>> not good unless you're doing a massively parallel read (what it was
>>> designed for). I can elaborate why if anyone is actually interested.
>>
>> I, for one, am very interested.
>>
>> --Kerin
>>
>
> Alright, here goes:
>
> Rich Freeman wrote:
>
>> FYI - one very big limitation of hdfs is its minimum filesize is
>> something huge like 1MB or something like that.  Hadoop was designed
>> to take a REALLY big input file and chunk it up.  If you use hdfs to
>> store something like /usr/portage it will turn into the sort of
>> monstrosity that you'd actually need a cluster to store.
>
> This is exactly correct, except we run with a block size of 128MB, and a large cluster will typically have a block size of 256MB or even 512MB.
>
> HDFS has two main components: a NameNode, which keeps track of which blocks are a part of which file (in memory), and the DataNodes that actually store the blocks. No data ever flows through the NameNode; it negotiates transfers between the client and DataNodes and negotiates transfers for jobs. Since the NameNode stores metadata in-memory, small files are bad because RAM gets wasted.
>
> What exactly is Hadoop/HDFS used for? The most common uses are generating search indices on data (which is a batch job) and doing non-realtime processing of log streams and/or data streams (another batch job) and allowing a large number of analysts run disparate queries on the same large dataset (another batch job). Batch processing - processing the entire dataset - is really where Hadoop shines.
>
> When you put a file into HDFS, it gets split based on the block size. This is done so that a parallel read will be really fast - each map task reads in a single block and processes it. Ergo, if you put in a 1GB file with a 128MB block size and run a MapReduce job, 8 map tasks will be launched. If you put in a 1TB file, 8192 tasks would be launched. Tuning the block size is important to optimize the overhead of launching tasks vs. potentially under-utilizing a cluster. Typically, a cluster with a lot of data has a bigger block size.
>
> The downsides of HDFS:
> * Seeked reads are not supported afaik because no one needs that for batch processing
> * Seeked writes into an existing file are not supported because either blocks would be added in the middle of a file and wouldn't be 128MB, or existing blocks would be edited, resulting in blocks larger than 128MB. Both of these scenarios are bad.
>
> Since HDFS users typically do not need seeked reads or seeked writes, these downsides aren't really a big deal.
>
> If something's not clear, let me know.

Thank you for taking the time to explain.

--Kerin