From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gentoo-user+bounces-158790-garchives=archives.gentoo.org@lists.gentoo.org>
Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80])
	by finch.gentoo.org (Postfix) with ESMTP id 815D913838B
	for <garchives@archives.gentoo.org>; Thu, 18 Sep 2014 13:12:46 +0000 (UTC)
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id 3D8EDE09F2;
	Thu, 18 Sep 2014 13:12:41 +0000 (UTC)
Received: from out1-smtp.messagingengine.com (out1-smtp.messagingengine.com [66.111.4.25])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by pigeon.gentoo.org (Postfix) with ESMTPS id DF4DBE09D2
	for <gentoo-user@lists.gentoo.org>; Thu, 18 Sep 2014 13:12:39 +0000 (UTC)
Received: from compute3.internal (compute3.nyi.internal [10.202.2.43])
	by gateway2.nyi.internal (Postfix) with ESMTP id 6DA3A20807
	for <gentoo-user@lists.gentoo.org>; Thu, 18 Sep 2014 09:12:39 -0400 (EDT)
Received: from frontend2 ([10.202.2.161])
  by compute3.internal (MEProxy); Thu, 18 Sep 2014 09:12:39 -0400
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=
	messagingengine.com; h=message-id:date:from:mime-version:to
	:subject:references:in-reply-to:content-type
	:content-transfer-encoding; s=smtpout; bh=b6zHVmg1Qk3BTiSn2gYt7s
	tcS6s=; b=r0JxMbG+7BWBBEoE0LQR/lluYNwuV4qervVQcUkdm6UGyPhN3qd7+M
	VqKufjvLv+Cal3pWpAgSXlI9ZjIu8cO+fwpaKKqOD/AD0P5RutPDeIovCqvf28cB
	eaWx4CBTbVUarlwgzEP9cRCz5Rei1vnG7mjEQDfbUZ7WM3LQ61zdU=
X-Sasl-enc: 7EuHDtd5MV1DqnSdsemTR2DLjcvq75cXQE2Pw+gGWS/U 1411045959
Received: from [35.2.124.1] (unknown [35.2.124.1])
	by mail.messagingengine.com (Postfix) with ESMTPA id 346F368011F
	for <gentoo-user@lists.gentoo.org>; Thu, 18 Sep 2014 09:12:39 -0400 (EDT)
Message-ID: <541ADA46.4010307@alectenharmsel.com>
Date: Thu, 18 Sep 2014 09:12:38 -0400
From: Alec Ten Harmsel <alec@alectenharmsel.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1
Precedence: bulk
List-Post: <mailto:gentoo-user@lists.gentoo.org>
List-Help: <mailto:gentoo-user+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-user+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-user+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-user.gentoo.org>
X-BeenThere: gentoo-user@lists.gentoo.org
Reply-to: gentoo-user@lists.gentoo.org
MIME-Version: 1.0
To: gentoo-user@lists.gentoo.org
Subject: Re: [gentoo-user] Re: File system testing
References: <loom.20140916T203111-406@post.gmane.org> <1702192.KJn5UXzBYJ@andromeda> <loom.20140917T172559-530@post.gmane.org> <15339117.pAj2kdbPAt@andromeda> <5419ED08.2020408@alectenharmsel.com> <541AA31B.3060105@fastmail.co.uk>
In-Reply-To: <541AA31B.3060105@fastmail.co.uk>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
X-Archives-Salt: 2d1d1f1d-bd56-489d-988e-55148342284c
X-Archives-Hash: 61abc7e647ce7356d416eda0bda8b395


On 09/18/2014 05:17 AM, Kerin Millar wrote:
> On 17/09/2014 21:20, Alec Ten Harmsel wrote:
>> As far as HDFS goes, I would only set that up if you will use it for
>> Hadoop or related tools. It's highly specific, and the performance is
>> not good unless you're doing a massively parallel read (what it was
>> designed for). I can elaborate why if anyone is actually interested.
>
> I, for one, am very interested.
>
> --Kerin
>

Alright, here goes:

Rich Freeman wrote:

> FYI - one very big limitation of hdfs is its minimum filesize is
> something huge like 1MB or something like that.  Hadoop was designed
> to take a REALLY big input file and chunk it up.  If you use hdfs to
> store something like /usr/portage it will turn into the sort of
> monstrosity that you'd actually need a cluster to store.

This is exactly correct, except we run with a block size of 128MB, and a large cluster will typically have a block size of 256MB or even 512MB.

HDFS has two main components: a NameNode, which keeps track of which blocks are a part of which file (in memory), and the DataNodes that actually store the blocks. No data ever flows through the NameNode; it negotiates transfers between the client and DataNodes and negotiates transfers for jobs. Since the NameNode stores metadata in-memory, small files are bad because RAM gets wasted.

What exactly is Hadoop/HDFS used for? The most common uses are generating search indices on data (which is a batch job) and doing non-realtime processing of log streams and/or data streams (another batch job) and allowing a large number of analysts run disparate queries on the same large dataset (another batch job). Batch processing - processing the entire dataset - is really where Hadoop shines.

When you put a file into HDFS, it gets split based on the block size. This is done so that a parallel read will be really fast - each map task reads in a single block and processes it. Ergo, if you put in a 1GB file with a 128MB block size and run a MapReduce job, 8 map tasks will be launched. If you put in a 1TB file, 8192 tasks would be launched. Tuning the block size is important to optimize the overhead of launching tasks vs. potentially under-utilizing a cluster. Typically, a cluster with a lot of data has a bigger block size.

The downsides of HDFS:
* Seeked reads are not supported afaik because no one needs that for batch processing
* Seeked writes into an existing file are not supported because either blocks would be added in the middle of a file and wouldn't be 128MB, or existing blocks would be edited, resulting in blocks larger than 128MB. Both of these scenarios are bad.

Since HDFS users typically do not need seeked reads or seeked writes, these downsides aren't really a big deal.

If something's not clear, let me know.

Alec