From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gentoo-user+bounces-191055-garchives=archives.gentoo.org@lists.gentoo.org>
Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by finch.gentoo.org (Postfix) with ESMTPS id 73924138350
	for <garchives@archives.gentoo.org>; Mon,  4 May 2020 00:46:32 +0000 (UTC)
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id E2FA2E0946;
	Mon,  4 May 2020 00:46:23 +0000 (UTC)
Received: from mail-ed1-f47.google.com (mail-ed1-f47.google.com [209.85.208.47])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by pigeon.gentoo.org (Postfix) with ESMTPS id C7ECDE092B
	for <gentoo-user@lists.gentoo.org>; Mon,  4 May 2020 00:46:21 +0000 (UTC)
Received: by mail-ed1-f47.google.com with SMTP id f12so12084867edn.12
        for <gentoo-user@lists.gentoo.org>; Sun, 03 May 2020 17:46:21 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=i3DFWEb8i6DiYUNX6czutVs53/OAPGpubdFBG0q826c=;
        b=ZUOlIaN5rnO4NvQ15b4L3ncA1S2CBewhgWCispc6VGpYLCeowSMen88O+fs6d+EjWR
         jx9gh2r8ZJaLhG0JXlxlKFMfxVLQhi6EWdz05lxyugEdZeRRmLYMqsUdGC7X81HCyOfy
         ZNLPlN5s+Ml+5Vao8wWY8nxOD0q1EZsHJ/n7dmeetdKHkM07N1pdL4J1ixsZMz6fT4iJ
         sHPZEbsuo3fuwwp4vdPFqfS3IyfFKq6GBlwp1IJoaJkUhoDQkCUFO1e6UK1uzbFFt6aW
         ZQbt19l0ilowMHTGYbrDdlAPSOHprKWncWU0AoTrMFdz7lxAxyByQLEpQGAj2o7RGeBB
         AVQQ==
X-Gm-Message-State: AGi0Pub9Aw82RtcljNqbSl4FIP7zVOb5YGLdK0WBzs4TIKbw3/MYfbrQ
	rQzWCGA0ep/bqYh9ME8qhsGIipHT9Vfpxoe7LCCyWFg7
X-Google-Smtp-Source: APiQypJWCQNwHQbMUkOegbJAEmLzDrjyKJ4cOoAcv3Xmn3odKZnG0Au/VOc3ltnfNqCfO2EKTOGFK2ZY8grGmf78Pgk=
X-Received: by 2002:a50:ee0e:: with SMTP id g14mr12837225eds.34.1588553180145;
 Sun, 03 May 2020 17:46:20 -0700 (PDT)
Precedence: bulk
List-Post: <mailto:gentoo-user@lists.gentoo.org>
List-Help: <mailto:gentoo-user+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-user+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-user+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-user.gentoo.org>
X-BeenThere: gentoo-user@lists.gentoo.org
Reply-to: gentoo-user@lists.gentoo.org
X-Auto-Response-Suppress: DR, RN, NRN, OOF, AutoReply
MIME-Version: 1.0
References: <dri0tBrXDazCGtc_Eu0IwV0R1chgd2giA9ZqGEs8LOJa3vAwAreuXaIR2MeyOgAfXi51yqLcR5NpxDSFY5ss1igKxRAM50hSu7mXY0Y-I78=@protonmail.com>
 <2251dac1-92cd-7c3b-97ea-6a061fe01eb0@users.sourceforge.net>
 <r_Q9jvM58EU2pwZlP_Y-68RWGty_14cd-2tWbp0SkzuYCp_NjKveJ5N2u_C7-MDj_ECdnP7ITK-fEikxX-u-j9qZkc8K6zMSUerYoduMq5c=@protonmail.com>
 <a245aab8-52d2-18f5-0859-ab3fda06341c@konstantinhansen.de>
In-Reply-To: <a245aab8-52d2-18f5-0859-ab3fda06341c@konstantinhansen.de>
From: Rich Freeman <rich0@gentoo.org>
Date: Sun, 3 May 2020 20:46:09 -0400
Message-ID: <CAGfcS_k1bC=HQRy-BgVm2-Aqfb=5tHKixVmvRzr3qLzGmw27Yg@mail.gmail.com>
Subject: Re: [gentoo-user] which linux RAID setup to choose?
To: gentoo-user@lists.gentoo.org
Content-Type: text/plain; charset="UTF-8"
X-Archives-Salt: 4cca59d5-e200-4d92-b54b-a26b99f3e9ec
X-Archives-Hash: 91063ba2e8a2c03066a01534c7681282

On Sun, May 3, 2020 at 6:50 PM hitachi303
<gentoo-user@konstantinhansen.de> wrote:
>
> The only person I know who is running a really huge raid ( I guess 2000+
> drives) is comfortable with some spare drives. His raid did fail an can
> fail. Data will be lost. Everything important has to be stored at a
> secondary location. But they are using the raid to store data for some
> days or weeks when a server is calculating stuff. If the raid fails they
> have to restart the program for the calculation.

So, if you have thousands of drives, you really shouldn't be using a
conventional RAID solution.  Now, if you're just using RAID to refer
to any technology that stores data redundantly that is one thing.
However, if you wanted to stick 2000 drives into a single host using
something like mdadm/zfs, or heaven forbid a bazillion LSI HBAs with
some kind of hacked-up solution for PCIe port replication plus SATA
bus multipliers/etc, you're probably doing it wrong.  (Really even
with mdadm/zfs you probably still need some kind of terribly
non-optimal solution for attaching all those drives to a single host.)

At that scale you really should be using a distributed filesystem.  Or
you could use some application-level solution that accomplishes the
same thing on top of a bunch of more modest hosts running zfs/etc (the
Backblaze solution at least in the past).

The most mainstream FOSS solution at this scale is Ceph.  It achieves
redundancy at the host level.  That is, if you have it set up to
tolerate two failures then you can take two random hosts in the
cluster and smash their motherboards with a hammer in the middle of
operation, and the cluster will keep on working and quickly restore
its redundancy.  Each host can have multiple drives, and losing any or
all of the drives within a single host counts as a single failure.
You can even do clever stuff like tell it which hosts are attached to
which circuit breakers and then you could lose all the hosts on a
single power circuit at once and it would be fine.

This also has the benefit of covering you when one of your flakey
drives causes weird bus issues that affect other drives, or one host
crashes, and so on.  The redundancy is entirely at the host level so
you're protected against a much larger number of failure modes.

This sort of solution also performs much faster as data requests are
not CPU/NIC/HBA limited for any particular host.  The software is
obviously more complex, but the hardware can be simpler since if you
want to expand storage you just buy more servers and plug them into
the LAN, versus trying to figure out how to cram an extra dozen hard
drives into a single host with all kinds of port multiplier games.
You can also do maintenance and just reboot an entire host while the
cluster stays online as long as you aren't messing with them all at
once.

I've gone in this general direction because I was tired of having to
try to deal with massive cases, being limited to motherboards with 6
SATA ports, adding LSI HBAs that require an 8x slot and often
conflicts with using an NVMe, and so on.

-- 
Rich