[gentoo-user] Understanding fstrim...

public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed

* [gentoo-user] Understanding fstrim...
@ 2020-04-13  5:32 tuxic
  2020-04-13  9:22 ` Andrea Conti
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: tuxic @ 2020-04-13  5:32 UTC (permalink / raw
  To: Gentoo

Hi,

From the list I already have learned, that most of my concerns regarding
the lifetime and maintainance to prolong it are without a
reason.

Nonetheless I am interested in the technique as such.

My SSD (NVme/M2) is ext4 formatted and I found articles on the
internet, that it is neither a good idea to activate the "discard"
option at mount time nor to do a fstrim either at each file deletion
no triggered by a cron job.

Since there seems to be a "not so good point in time", when to do a
fstrim, I think there must also be a point in time, when it is quite
right to fstrim the mu SSD.

fstrim clears blocks, which currently are not in use and which
contents is != 0.

The more unused blocks there are, which has a contents != 0, the
lesser the count of blocks is, which the wear leveling algorithm can
use for its purpose.

That leads to the conclusion: to fstrim as often as possible, to keep the
count of empty blocks as high as possible.

BUT: Clearing blocks is an action, which includes writes to the cells of
the SSD.

Which is not that nice.

Then, do a fstrim just in the moment, when there is no useable block
left.

Then the wear-leveling algorithm is already at its limits.

Which is not that nice either.

The truth - as so often - is somewhere in between.  

Is it possible to get an information from the SSD, how many blocks are
in the state of "has contents" and "is unused" and how many blocks are
in the state of "has *no* contents" and "is unused"?

Assuming this information is available: Is it possible to find the
sweat spot, when to fstrim SSD?

Cheers!
Meino

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [gentoo-user] Understanding fstrim...
  2020-04-13  5:32 [gentoo-user] Understanding fstrim tuxic
@ 2020-04-13  9:22 ` Andrea Conti
  2020-04-13  9:49   ` Neil Bothwick
  2020-04-13 10:06 ` Michael
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Andrea Conti @ 2020-04-13  9:22 UTC (permalink / raw
  To: gentoo-user

> My SSD (NVme/M2) is ext4 formatted and I found articles on the
> internet, that it is neither a good idea to activate the "discard"
> option at mount time nor to do a fstrim either at each file deletion
> no triggered by a cron job.

I have no desire to enter the whole performance/lifetime debate; I'd
just like to point out that one very real consequence of using fstrim
(or mounting with the discard option) that I haven't seen mentioned
often is that it makes the contents of any removed files Truly Gone(tm).

No more extundelete to save your back when you mistakenly rm something
that you haven't backed up for a while...

andrea

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [gentoo-user] Understanding fstrim...
  2020-04-13  9:22 ` Andrea Conti
@ 2020-04-13  9:49   ` Neil Bothwick
  2020-04-13 11:01     ` Andrea Conti
  0 siblings, 1 reply; 22+ messages in thread
From: Neil Bothwick @ 2020-04-13  9:49 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 617 bytes --]

On Mon, 13 Apr 2020 11:22:47 +0200, Andrea Conti wrote:

> I have no desire to enter the whole performance/lifetime debate; I'd
> just like to point out that one very real consequence of using fstrim
> (or mounting with the discard option) that I haven't seen mentioned
> often is that it makes the contents of any removed files Truly Gone(tm).
> 
> No more extundelete to save your back when you mistakenly rm something
> that you haven't backed up for a while...

Have your backup cron job call fstrim once everything is safely backed up?


-- 
Neil Bothwick

Life's a cache, and then you flush...

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [gentoo-user] Understanding fstrim...
  2020-04-13  9:49   ` Neil Bothwick
@ 2020-04-13 11:01     ` Andrea Conti
  0 siblings, 0 replies; 22+ messages in thread
From: Andrea Conti @ 2020-04-13 11:01 UTC (permalink / raw
  To: gentoo-user

> Have your backup cron job call fstrim once everything is safely backed up?

Well, yes, but that's beside the point.

What I really wanted to stress was that mounting an SSD-backed 
filesystem with "discard" has effects on the ability to recover deleted 
data.

Normally it's not a problem, but shit happens -- and when it happens on 
such a filesystem don't waste time with recovery tools, as all you'll 
get back are files full of 0xFFs.

andrea

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [gentoo-user] Understanding fstrim...
  2020-04-13  5:32 [gentoo-user] Understanding fstrim tuxic
  2020-04-13  9:22 ` Andrea Conti
@ 2020-04-13 10:06 ` Michael
  2020-04-13 11:00   ` tuxic
  2020-04-13 10:11 ` Peter Humphrey
  2020-04-13 11:39 ` Rich Freeman
  3 siblings, 1 reply; 22+ messages in thread
From: Michael @ 2020-04-13 10:06 UTC (permalink / raw
  To: Gentoo

[-- Attachment #1: Type: text/plain, Size: 3124 bytes --]

On Monday, 13 April 2020 06:32:37 BST tuxic@posteo.de wrote:
> Hi,
> 
> From the list I already have learned, that most of my concerns regarding
> the lifetime and maintainance to prolong it are without a
> reason.

Probably your concerns about SSD longevity are without a reason, but keep up 
to date backups just in case.  ;-)


> Nonetheless I am interested in the technique as such.
> 
> My SSD (NVme/M2) is ext4 formatted and I found articles on the
> internet, that it is neither a good idea to activate the "discard"
> option at mount time nor to do a fstrim either at each file deletion
> no triggered by a cron job.

Beside what the interwebs say about fstrim, the man page provides good advice.  
They recommend running a cron job once a week, for most desktop and server 
implementations.


> Since there seems to be a "not so good point in time", when to do a
> fstrim, I think there must also be a point in time, when it is quite
> right to fstrim the mu SSD.
> 
> fstrim clears blocks, which currently are not in use and which
> contents is != 0.
> 
> The more unused blocks there are, which has a contents != 0, the
> lesser the count of blocks is, which the wear leveling algorithm can
> use for its purpose.

The wear levelling mechanism is using the HPA as far as I know, although you 
can always overprovision it.[1]

> That leads to the conclusion: to fstrim as often as possible, to keep the
> count of empty blocks as high as possible.

Not really.  Why would you need the count of empty blocks as high as possible, 
unless you are about to right some mammoth file and *need* to use up every 
available space possible on this disk/partition?


> BUT: Clearing blocks is an action, which includes writes to the cells of
> the SSD.
> 
> Which is not that nice.

It's OK, as long as you are not over-writing cells which do not need to be 
overwritten.  Cells with deleted data will be overwritten as some point.


> Then, do a fstrim just in the moment, when there is no useable block
> left.

Why leave it at the last moment and incur a performance penalty while waiting 
for fstrim to complete?


> Then the wear-leveling algorithm is already at its limits.
> 
> Which is not that nice either.
> 
> The truth - as so often - is somewhere in between.
> 
> Is it possible to get an information from the SSD, how many blocks are
> in the state of "has contents" and "is unused" and how many blocks are
> in the state of "has *no* contents" and "is unused"?
> 
> Assuming this information is available: Is it possible to find the
> sweat spot, when to fstrim SSD?

I humbly suggest you may be over-thinking something a cron job running fstrim 
once a week, or once a month, or twice a month would take care of without you 
knowing or worrying about.

Nevertheless, if the usage of your disk/partitions is variable and one week 
you may fill it up with deleted data, while for the rest of the month you 
won't even touch it, there's SSDcronTRIM, a script I've been using for a 
while.[2]


[1] https://www.thomas-krenn.com/en/wiki/SSD_Over-provisioning_using_hdparm
[2] https://github.com/chmatse/SSDcronTRIM

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [gentoo-user] Understanding fstrim...
  2020-04-13 10:06 ` Michael
@ 2020-04-13 11:00   ` tuxic
  2020-04-13 14:00     ` David Haller
  0 siblings, 1 reply; 22+ messages in thread
From: tuxic @ 2020-04-13 11:00 UTC (permalink / raw
  To: gentoo-user

Hi Michael,

thank you for replying to my questions! :)

On 04/13 11:06, Michael wrote:
> On Monday, 13 April 2020 06:32:37 BST tuxic@posteo.de wrote:
> > Hi,
> > 
> > From the list I already have learned, that most of my concerns regarding
> > the lifetime and maintainance to prolong it are without a
> > reason.
> 
> Probably your concerns about SSD longevity are without a reason, but keep up 
> to date backups just in case.  ;-)

...of course! :)
My question are more driven by curiousty than by anxiety...

> > Nonetheless I am interested in the technique as such.
> > 
> > My SSD (NVme/M2) is ext4 formatted and I found articles on the
> > internet, that it is neither a good idea to activate the "discard"
> > option at mount time nor to do a fstrim either at each file deletion
> > no triggered by a cron job.
> 
> Beside what the interwebs say about fstrim, the man page provides good advice.  
> They recommend running a cron job once a week, for most desktop and server 
> implementations.

...but it neither explains why to do so nor does it explain the technical
background.

For example it saus:
"For most desktop and server systems a sufficient trimming frequency  is
once  a week."

...but why is this ok to do so? Are all PCs made equal? Are all
use cases are equal? It even does not distinguish between SSD/Sata 
and SSD/NVMe(M2 in my case).

These are the points, where my curiousty kicks in and I am starting
to ask questions.

:)

> > Since there seems to be a "not so good point in time", when to do a
> > fstrim, I think there must also be a point in time, when it is quite
> > right to fstrim the mu SSD.
> > 
> > fstrim clears blocks, which currently are not in use and which
> > contents is != 0.
> > 
> > The more unused blocks there are, which has a contents != 0, the
> > lesser the count of blocks is, which the wear leveling algorithm can
> > use for its purpose.
> 
> The wear levelling mechanism is using the HPA as far as I know, although you 
> can always overprovision it.[1]

For example: Take an SSD with 300 GB user-useable space. To
over-overprovisioning the device the user decides to partitioning
only the half of the disk and format it. The rest is left untouched in
"nowhere land".
Now the controller has a lot of space to shuffle data around.
Fstrim only works on the mounted part of the SSD. So the used blocks
in "nowhere land" remain...unfstrimmed?

To not to use all available space for the partitions is a hint I found
online...and then I asked me the question above...

If what I read online is wrong my assumptions are wrong...which
isn't reassuring either.

> 
> > That leads to the conclusion: to fstrim as often as possible, to keep the
> > count of empty blocks as high as possible.
> 
> Not really.  Why would you need the count of empty blocks as high as possible, 

Unused blocks with data cannot be used for wearleveling. Suppose you
have a total amount of 100 block, 50 blocks are used, 25 are unused
and empty, 25 are unused and filled with former data.

In this case only 25 blocks are available to spread the next write
operation.

After fstrim 50 blocks would be available again and the same amount of
writes could now be spread over 50 sectors.

At least that is what I read online...

> unless you are about to right some mammoth file and *need* to use up every 
> available space possible on this disk/partition?
> 
> 
> > BUT: Clearing blocks is an action, which includes writes to the cells of
> > the SSD.
> > 
> > Which is not that nice.
> 
> It's OK, as long as you are not over-writing cells which do not need to be 
> overwritten.  Cells with deleted data will be overwritten as some point.
> 
> 
> > Then, do a fstrim just in the moment, when there is no useable block
> > left.
> 
> Why leave it at the last moment and incur a performance penalty while waiting 
> for fstrim to complete?

Performance is not my concern (at least in the moment ;) ). I try to
fully understand the mechanisms here, since what I read online is not
without contradictions...

> > Then the wear-leveling algorithm is already at its limits.
> > 
> > Which is not that nice either.
> > 
> > The truth - as so often - is somewhere in between.
> > 
> > Is it possible to get an information from the SSD, how many blocks are
> > in the state of "has contents" and "is unused" and how many blocks are
> > in the state of "has *no* contents" and "is unused"?
> > 
> > Assuming this information is available: Is it possible to find the
> > sweat spot, when to fstrim SSD?
> 
> I humbly suggest you may be over-thinking something a cron job running fstrim 
> once a week, or once a month, or twice a month would take care of without you 
> knowing or worrying about.

To technically overthink problems is a vital part of my profession and exactly what
I am asked for. I cannot put this behaviour down so easily. :)
From my experience there aren't too manu questions, Michael, there is
often only a lack of related answers.

> Nevertheless, if the usage of your disk/partitions is variable and one week 
> you may fill it up with deleted data, while for the rest of the month you 
> won't even touch it, there's SSDcronTRIM, a script I've been using for a 
> while.[2]
> 
> 
> [1] https://www.thomas-krenn.com/en/wiki/SSD_Over-provisioning_using_hdparm
> [2] https://github.com/chmatse/SSDcronTRIM

Cheers!
Meino

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [gentoo-user] Understanding fstrim...
  2020-04-13 11:00   ` tuxic
@ 2020-04-13 14:00     ` David Haller
  0 siblings, 0 replies; 22+ messages in thread
From: David Haller @ 2020-04-13 14:00 UTC (permalink / raw
  To: gentoo-user

Hello,

On Mon, 13 Apr 2020, tuxic@posteo.de wrote:
>On 04/13 11:06, Michael wrote:
>> On Monday, 13 April 2020 06:32:37 BST tuxic@posteo.de wrote:
[..]
>My question are more driven by curiousty than by anxiety...
[..]
>For example [the fstrim manpage] says:
>"For most desktop and server systems a sufficient trimming frequency  is
>once  a week."
>
>...but why is this ok to do so? Are all PCs made equal? Are all
>use cases are equal? It even does not distinguish between SSD/Sata 
>and SSD/NVMe(M2 in my case).

Observe your use pattern a bit and use 'fstrim -v' when you think it's
worth it, as it basically boils down to how much you delete *when*. If
you e.g.:

- constantly use the drive as a fast cache for video-editing etc.,
  writing large files to the drive and later deleting them again

    -> run fstrim daily or even mount with 'discard'-option

- write/delete somewhat regularly, e.g. as a system-drive and running
  updates or emerge @world (esp. if you build on the SSD) e.g. weekly
  they effectively are a write operation and a bunch of deletions. Or
  if you do whatever other deletions somewhat regularly

    -> run fstrim after each one or three such deletions, e.g. via
       a weekly cronjob

- mostly write (if anything), rarely delete anything

    -> run fstrim manually a few days after $some deletions have
       accumulated or any other convenient time you can remember to
       and are sure all deleted files can be gone, be it bi-weekly,
       monthly, tri-montly, yearly, completly irregularly, whenever ;)

Choose anything in the range that fits _your_ use-pattern best,
considering capacity, free-space (no matter if on a partition or
unallocated) and what size was trimmed when running 'fstrim -v'...

Running that weekly (I'd suggest bi-weekly) 'fstrim' cronjob is not a
bad suggestion as a default I guess, but observe your use and choose
to deviate or not :)

My gut says to run fstrim if:

    - it'd trim more than 5% (-ish) capacity
    - it'd trim more than 20% (-ish) of the remaining "free" space
      (including unallocated)
    - it'd trim more than $n GiB (where $n may be anything ;)

whichever comes first (and the two latter can only be determined by
observation). No need to run fstrim after deleting just 1kiB. Or 1MiB.

Not that me lazybag adheres to that, read on if you will... ;)

FWIW:
I run fstrim a few times a year when I think of it and guesstimate I
did delete quite a bit in the meantime (much like I run fsck ;) ...
This usually trims a few GiB on my 128G drive:

# fdisk -u -l /dev/sda
Disk /dev/sda: 119.2 GiB, 128035676160 bytes, 250069680 sectors
Disk model: SAMSUNG SSD 830 
[..]
Device     Boot     Start       End   Sectors Size Id Type
/dev/sda1            2048 109053951 109051904  52G 83 Linux
/dev/sda2       109053952 218105855 109051904  52G 83 Linux

(I did leave ~15GiB unpartitioned, and was too lazy to rectify that
yet, at the time I partitioned in 2012, for many (cheaper?) SSDs that
overprovisioning was still a good thing and 'duh intarweb' was quite
worse than today regarding the problem)...

So while I'm about it, I guess it's time to run fstrim (for the first
time this year IIRC) ...

# fstrim -v /sda1 ; fstrim -v /sda2     ## mountpoints mangled
/sda1: 7563407360 bytes were trimmed
/sda2: 6842478592 bytes were trimmed

# calc 'x=config("display",1); 7563407360/2^30; 6842478592/2^30'
        ~7.0
        ~6.4

So, my typical few GiB or about 12.8% disk capacity (summed) were
trimmed (oddly enough, it's always been in this 4-8GiB/partition
range). I probably should run fstrim a bit more often though, but then
again I've still got those unallocated 15G, so I guess I'm fine. And
that's with quite a large Gentoo system on /dev/sda2 and all its at
times large (like libreoffice, firefox, seamonkey, icedtea, etc.)
updates:

# df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        52G   45G  3.8G  93% /

PORTAGE_TMPDIR, PORTDIR, (and distfiles and packages) are on other
HDDs though, so building stuff does not affect the SSD, only the
actual install (merge) and whatever else. But I've got /var/log/ on
the SSD on both systems (sda1/sda2).

While I'm at it:

# smartctl -A /dev/sda
[..]
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH [..] RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    [..] 0
  9 Power_On_Hours          0x0032   091   091   000    [..] 43261
 12 Power_Cycle_Count       0x0032   097   097   000    [..] 2617
177 Wear_Leveling_Count     0x0013   093   093   000    [..] 247
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    [..] 0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    [..] 0
182 Erase_Fail_Count_Total  0x0032   100   100   010    [..] 0
183 Runtime_Bad_Block       0x0013   100   100   010    [..] 0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    [..] 0
190 Airflow_Temperature_Cel 0x0032   067   050   000    [..] 33
195 ECC_Error_Rate          0x001a   200   200   000    [..] 0
199 CRC_Error_Count         0x003e   253   253   000    [..] 0
235 POR_Recovery_Count      0x0012   099   099   000    [..] 16
241 Total_LBAs_Written      0x0032   099   099   000    [..] 12916765928
[ '[..]': none show WHEN_FAILED other than '-' and the rest is standard]

Wow, I have almost exactly 6TBW or ~52 "Drive writes" or "Capacity
writes" on this puny 128MB SSD:

# calc 'printf("%0.2d TiB written\n%0.1d drive writes\n",
    12916765928/2/2^30, 12916765928/250069680);'
~6.01 TiB written
~51.7 drive writes

And I forgot I've been running this drive for this long already (not
that I've been running it 24/7 by quite a bit, but since July 2012 or
about 15/7-ish):

$ dateduration 43261h       ### [1]
4 years 11 months 7 days 13 hours 0 minutes 0 seconds

HTH,
-dnh

[1] ==== ~/bin/dateduration ====
    #!/bin/bash
    F='%Y years %m months %d days %H hours %M minutes %S seconds'
    now=$(date +%s)
    datediff -f "$F" $(dateadd -i '%s' "$now" +0s) $(dateadd -i '%s' "$now" $1)
    ====

    If anyone knows a better way... ;)

-- 
printk("; crashing the system because you wanted it\n");
        linux-2.6.6/fs/hpfs/super.c


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [gentoo-user] Understanding fstrim...
  2020-04-13  5:32 [gentoo-user] Understanding fstrim tuxic
  2020-04-13  9:22 ` Andrea Conti
  2020-04-13 10:06 ` Michael
@ 2020-04-13 10:11 ` Peter Humphrey
  2020-04-13 11:39 ` Rich Freeman
  3 siblings, 0 replies; 22+ messages in thread
From: Peter Humphrey @ 2020-04-13 10:11 UTC (permalink / raw
  To: Gentoo

On Monday, 13 April 2020 06:32:37 BST tuxic@posteo.de wrote:

> Assuming this information is available: Is it possible to find the
> sweat spot, when to fstrim SSD?

This crontab entry is my compromise:

15 3 */2 * * /sbin/fstrim -a

It does assume I'll be elsewhere at 03:15, of course.

-- 
Regards,
Peter.





^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [gentoo-user] Understanding fstrim...
  2020-04-13  5:32 [gentoo-user] Understanding fstrim tuxic
                   ` (2 preceding siblings ...)
  2020-04-13 10:11 ` Peter Humphrey
@ 2020-04-13 11:39 ` Rich Freeman
  2020-04-13 11:55   ` Michael
  3 siblings, 1 reply; 22+ messages in thread
From: Rich Freeman @ 2020-04-13 11:39 UTC (permalink / raw
  To: gentoo-user

On Mon, Apr 13, 2020 at 1:32 AM <tuxic@posteo.de> wrote:
>
> fstrim clears blocks, which currently are not in use and which
> contents is != 0.
>...
> BUT: Clearing blocks is an action, which includes writes to the cells of
> the SSD.

I see a whole bunch of discussion, but it seems like many here don't
actually understand what fstrim actually does.

It doesn't "clear" anything, and it doesn't care what the contents of
a block are.  It doesn't write to the cells of the SSD per se.

It issues the TRIM command to the drive for any unused blocks (or a
subset of them if you use non-default options).  It doesn't care what
the contents of the blocks are when it does so - it shouldn't even try
to read the blocks to know what their content is.

Trimming a block won't clear it at all.  It doesn't write to the cells
of the SSD either - at least not the ones being trimmed.  It just
tells the drive controller that the blocks are no longer in use.

Now, the drive controller needs to keep track of which blocks are in
use (which it does whether you use fstrim or not), and that data is
probably stored in some kind of flash so that it is persistent, but
presumably that is managed in such a manner that it is unlikely to
fail before the rest of the drive fails.

On a well-implemented drive trimming actually REDUCES writes.  When
you trim a block the drive controller will stop trying to preserve its
contents.  If you don't trim it, then the controller will preserve its
contents.  Preserving the content of unused blocks necessarily
involves more writing to the drive than just treating as
zero-fillled/etc.

Now, where you're probably getting at the concept of clearing and
zeroing data is that if you try to read a trimmed block the drive
controller probably won't even bother to read the block from the ssd
and will just return zeros.  Those zeros were never written to flash -
they're just like a zero-filled file in a filesystem.  If you write a
bazillion zeros to a file on ext4 it will just record in the
filesystem data that you have a bunch of blocks of zero and it won't
allocate any actual space on disk - reading that file requires no
reading of the actual disk beyond the metadata because they're not
stored in actual extents.  Indeed blocks are more of a virtual thing
on an SSD (or even hard drive these days), so if a logical block isn't
mapped to a physical storage area there isn't anything to read in the
first place.

However, when you trimmed the file the drive didn't go find some area
of flash and fill it with zeros.  It just marked it as unused or
removed its logical mapping to physical storage.

In theory you should be able to use discard or trim the filesystem
every 5 minutes with no negative effects at all.  In theory.  However,
many controllers (especially old ones) aren't well-implemented and may
not handle this efficiently.  A trim operation is still an operation
the controller has to deal with, and so deferring it to a time when
the drive is idle could improve performance, especially for drives
that don't do a lot of writes.  If a drive has a really lousy
controller then trims might cause its stupid firmware to do stupid
things.  However, this isn't really anything intrinsic to the concept
of trimming.

Fundamentally trimming is just giving the drive more information about
the importance of the data it is storing.  Just about any filesystem
benefits from having more information about what it is storing if it
is well-implemented.  In a perfect world we'd just enable discard on
our mounts and be done with it.

I'd probably just look up the recommendations for your particular
drive around trimming and follow those.  Somebody may have benchmarked
it to determine how brain-dead it is.  If you bought a more name-brand
SSD you're probably more likely to benefit from more frequent
trimming.

I'm personally using zfs which didn't support trim/discard until very
recently, and I'm not on 0.8 yet, so for me it is a bit of a moot
point.  I plan to enable it once I can do so.

-- 
Rich

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [gentoo-user] Understanding fstrim...
  2020-04-13 11:39 ` Rich Freeman
@ 2020-04-13 11:55   ` Michael
  2020-04-13 12:18     ` Rich Freeman
  2020-04-13 15:48     ` [gentoo-user] " Holger Hoffstätte
  0 siblings, 2 replies; 22+ messages in thread
From: Michael @ 2020-04-13 11:55 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 4715 bytes --]

On Monday, 13 April 2020 12:39:11 BST Rich Freeman wrote:
> On Mon, Apr 13, 2020 at 1:32 AM <tuxic@posteo.de> wrote:
> > fstrim clears blocks, which currently are not in use and which
> > contents is != 0.
> >
> >...
> >
> > BUT: Clearing blocks is an action, which includes writes to the cells of
> > the SSD.
> 
> I see a whole bunch of discussion, but it seems like many here don't
> actually understand what fstrim actually does.
> 
> It doesn't "clear" anything, and it doesn't care what the contents of
> a block are.  It doesn't write to the cells of the SSD per se.
> 
> It issues the TRIM command to the drive for any unused blocks (or a
> subset of them if you use non-default options).  It doesn't care what
> the contents of the blocks are when it does so - it shouldn't even try
> to read the blocks to know what their content is.
> 
> Trimming a block won't clear it at all.  It doesn't write to the cells
> of the SSD either - at least not the ones being trimmed.  It just
> tells the drive controller that the blocks are no longer in use.
> 
> Now, the drive controller needs to keep track of which blocks are in
> use (which it does whether you use fstrim or not), and that data is
> probably stored in some kind of flash so that it is persistent, but
> presumably that is managed in such a manner that it is unlikely to
> fail before the rest of the drive fails.
> 
> On a well-implemented drive trimming actually REDUCES writes.  When
> you trim a block the drive controller will stop trying to preserve its
> contents.  If you don't trim it, then the controller will preserve its
> contents.  Preserving the content of unused blocks necessarily
> involves more writing to the drive than just treating as
> zero-fillled/etc.
> 
> Now, where you're probably getting at the concept of clearing and
> zeroing data is that if you try to read a trimmed block the drive
> controller probably won't even bother to read the block from the ssd
> and will just return zeros.  Those zeros were never written to flash -
> they're just like a zero-filled file in a filesystem.  If you write a
> bazillion zeros to a file on ext4 it will just record in the
> filesystem data that you have a bunch of blocks of zero and it won't
> allocate any actual space on disk - reading that file requires no
> reading of the actual disk beyond the metadata because they're not
> stored in actual extents.  Indeed blocks are more of a virtual thing
> on an SSD (or even hard drive these days), so if a logical block isn't
> mapped to a physical storage area there isn't anything to read in the
> first place.
> 
> However, when you trimmed the file the drive didn't go find some area
> of flash and fill it with zeros.  It just marked it as unused or
> removed its logical mapping to physical storage.
> 
> In theory you should be able to use discard or trim the filesystem
> every 5 minutes with no negative effects at all.  In theory.  However,
> many controllers (especially old ones) aren't well-implemented and may
> not handle this efficiently.  A trim operation is still an operation
> the controller has to deal with, and so deferring it to a time when
> the drive is idle could improve performance, especially for drives
> that don't do a lot of writes.  If a drive has a really lousy
> controller then trims might cause its stupid firmware to do stupid
> things.  However, this isn't really anything intrinsic to the concept
> of trimming.
> 
> Fundamentally trimming is just giving the drive more information about
> the importance of the data it is storing.  Just about any filesystem
> benefits from having more information about what it is storing if it
> is well-implemented.  In a perfect world we'd just enable discard on
> our mounts and be done with it.
> 
> I'd probably just look up the recommendations for your particular
> drive around trimming and follow those.  Somebody may have benchmarked
> it to determine how brain-dead it is.  If you bought a more name-brand
> SSD you're probably more likely to benefit from more frequent
> trimming.
> 
> I'm personally using zfs which didn't support trim/discard until very
> recently, and I'm not on 0.8 yet, so for me it is a bit of a moot
> point.  I plan to enable it once I can do so.

What Rich said, plus:

I have noticed when prolonged fstrim takes place on an old SSD drive of mine 
it becomes unresponsive.  As Rich said this is not because data is being 
physically deleted, only a flag is switched from 1 to 0 to indicate its 
availability for further writes.

As I understand the firmware performs wear-leveling when it needs to in the 
HPA allocated blocks, rather than waiting for the user/OS to run fstrim to 
obtain some more 'free' space.

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [gentoo-user] Understanding fstrim...
  2020-04-13 11:55   ` Michael
@ 2020-04-13 12:18     ` Rich Freeman
  2020-04-13 13:18       ` tuxic
  2020-04-13 15:48     ` [gentoo-user] " Holger Hoffstätte
  1 sibling, 1 reply; 22+ messages in thread
From: Rich Freeman @ 2020-04-13 12:18 UTC (permalink / raw
  To: gentoo-user

On Mon, Apr 13, 2020 at 7:55 AM Michael <confabulate@kintzios.com> wrote:
>
> I have noticed when prolonged fstrim takes place on an old SSD drive of mine
> it becomes unresponsive.  As Rich said this is not because data is being
> physically deleted, only a flag is switched from 1 to 0 to indicate its
> availability for further writes.

That, and it is going about it in a brain-dead manner.

A modern drive is basically a filesystem of sorts.  It just uses block
numbers instead of filenames, but it can be just as complex
underneath.

And just as with filesystems there are designs that are really lousy
and designs that are really good.  And since nobody sees the source
code or pays much attention to the hidden implementation details, it
is often not designed with your requirements in mind.

I suspect a lot of SSDs are using the SSD-equivalent of FAT32 to
manage their block remapping.  Some simple algorithm that gets the job
done but which doesn't perform well/etc.

This makes me wonder if there would be a benefit from coming up with a
flash block layer of some sort that re-implements this stuff properly.
We have stuff like f2fs which does this at the filesystem level.
However, this might be going too far as this inevitably competes with
the filesystem layer on features/etc.

Maybe what we need is something more like lvm for flash.  It doesn't
try to be a filesystem.  It just implements block-level storage
mapping one block device to a new block device.  It might very well
implement a log-based storage layer.  It would accept TRIM commands
and any other related features.  It would then have a physical device
translation layer.  Maybe it would be aware of different drive models
and their idiosyncrasies, so on some drives it might just be a NOOP
passthrough and on other drives it implements its own log-based
storage with batched trims on large contiguous regions, and so on.
Since it isn't a full POSIX filesystem it could be much simpler and
just focus on the problem it needs to solve - dealing with brain-dead
SSD controllers.

-- 
Rich

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [gentoo-user] Understanding fstrim...
  2020-04-13 12:18     ` Rich Freeman
@ 2020-04-13 13:18       ` tuxic
  2020-04-13 14:27         ` Rich Freeman
  0 siblings, 1 reply; 22+ messages in thread
From: tuxic @ 2020-04-13 13:18 UTC (permalink / raw
  To: gentoo-user

On 04/13 08:18, Rich Freeman wrote:
> On Mon, Apr 13, 2020 at 7:55 AM Michael <confabulate@kintzios.com> wrote:
> >
> > I have noticed when prolonged fstrim takes place on an old SSD drive of mine
> > it becomes unresponsive.  As Rich said this is not because data is being
> > physically deleted, only a flag is switched from 1 to 0 to indicate its
> > availability for further writes.
> 
> That, and it is going about it in a brain-dead manner.
> 
> A modern drive is basically a filesystem of sorts.  It just uses block
> numbers instead of filenames, but it can be just as complex
> underneath.
> 
> And just as with filesystems there are designs that are really lousy
> and designs that are really good.  And since nobody sees the source
> code or pays much attention to the hidden implementation details, it
> is often not designed with your requirements in mind.
> 
> I suspect a lot of SSDs are using the SSD-equivalent of FAT32 to
> manage their block remapping.  Some simple algorithm that gets the job
> done but which doesn't perform well/etc.
> 
> This makes me wonder if there would be a benefit from coming up with a
> flash block layer of some sort that re-implements this stuff properly.
> We have stuff like f2fs which does this at the filesystem level.
> However, this might be going too far as this inevitably competes with
> the filesystem layer on features/etc.
> 
> Maybe what we need is something more like lvm for flash.  It doesn't
> try to be a filesystem.  It just implements block-level storage
> mapping one block device to a new block device.  It might very well
> implement a log-based storage layer.  It would accept TRIM commands
> and any other related features.  It would then have a physical device
> translation layer.  Maybe it would be aware of different drive models
> and their idiosyncrasies, so on some drives it might just be a NOOP
> passthrough and on other drives it implements its own log-based
> storage with batched trims on large contiguous regions, and so on.
> Since it isn't a full POSIX filesystem it could be much simpler and
> just focus on the problem it needs to solve - dealing with brain-dead
> SSD controllers.
> 
> -- 
> Rich
> 
Hi Rich, hi Michael,

THAT is information I like...now I start(!) to understand the "inner
mechanics" of fstrim...thank you very much!!! :::)))

One quesion -- not to express any doubt of what you wrote Rich, but onlu
to check, whether I understand that detail or not:

Fstrim "allows" the drive to trim ittself. The actual "trimming" is
done by the drive ittself without any interaction from the outside
of the SSD.

You wrote:

> > Now, the drive controller needs to keep track of which blocks are in
> > use (which it does whether you use fstrim or not), and that data is
> > probably stored in some kind of flash so that it is persistent, but
> > presumably that is managed in such a manner that it is unlikely to
> > fail before the rest of the drive fails.

and:
> > Fundamentally trimming is just giving the drive more information about
> > the importance of the data it is storing.  Just about any filesystem

For me (due my beginners level of knowing the "behind the scene"
things) this is kinda contradictionous.

On the one hand, the SSD drive keeps track of the information, what
blocks are used and unused. And trimming is done by the drive in
itsself. On the other hand trimming is "just giving the drive more
information about...."

What kind of information does the commandline tool fstrim transfers to
the SSD beside the command "fstrim yourself" (an ioctl, I think?),
which is needed to fstrim the blocks and what kind of information is
the SDD collecting itsself for this purpose?

Cheers!
Meino






>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [gentoo-user] Understanding fstrim...
  2020-04-13 13:18       ` tuxic
@ 2020-04-13 14:27         ` Rich Freeman
  2020-04-13 15:41           ` David Haller
  0 siblings, 1 reply; 22+ messages in thread
From: Rich Freeman @ 2020-04-13 14:27 UTC (permalink / raw
  To: gentoo-user

On Mon, Apr 13, 2020 at 9:18 AM <tuxic@posteo.de> wrote:
>
> One quesion -- not to express any doubt of what you wrote Rich, but onlu
> to check, whether I understand that detail or not:
>
> Fstrim "allows" the drive to trim ittself. The actual "trimming" is
> done by the drive ittself without any interaction from the outside
> of the SSD.
> ...
> On the one hand, the SSD drive keeps track of the information, what
> blocks are used and unused. And trimming is done by the drive in
> itsself. On the other hand trimming is "just giving the drive more
> information about...."
>
> What kind of information does the commandline tool fstrim transfers to
> the SSD beside the command "fstrim yourself" (an ioctl, I think?),
> which is needed to fstrim the blocks and what kind of information is
> the SDD collecting itsself for this purpose?
>

So, "trimming" isn't something a drive does really.  It is a logical
command issued to the drive.

The fundamental operations the drive does at the physical layer are:
1. Read a block
2. Write a block that is empty
3. Erase a large group of blocks to make them empty

The key attribute of flash is that the physical unit of blocks that
can be erased is much larger than the individual blocks that can be
read/written.  This is what leads to the need for TRIM.

If SSDs worked like traditional hard drives where an individual block
could be rewritten freely then there would be no need for trimming.
Actually, drives with 4k sectors is a bit of an analogy though the
cost for partial rewrites on a hard drive is purely a matter of
performance and not longevity and the 4k sector issue can be solved
via alignment.  It is an analogous problem though, and SMR hard drives
is a much closer analogy (really no different from 4k sectors except
now the magnitude of the problem is MUCH bigger).

Whether you use TRIM or not an SSD has to translate any logical
operation into the three physical operations above.  It also has to
balance wear into this.  So, if you don't use trim a logical
instruction like "write this data to block 1" turns into:

1.  Look at the mapping table to determine that logical block 1 is at
physical block 123001.
2.  Write the new contents of block 1 to physical block 823125 which
is unused-but-clean.
3.  Update the mapping table so that block 1 is at physical block
823125, and mark 123001 as unused-but-dirty.

Then the controller would have two background tasks that it would run
periodically:

Task 1 - look for contiguous regions marked as unused-but-dirty.  When
one of the right size/alignment is identified erase it physically and
mark it as unused-but-clean.
Task 2 - if the amount of unused-but-clean space gets too low, then
find areas that have fragmented unused-but-dirty space.  The used
space in those regions would get copied to new unused-but-clean blocks
and remapped, and then task 1 will erase them and mark them as clean.
This deals with fragmented unused space.

Every SSD will do some variation on this whether you ever use the trim
command as it is necessary for wear-leveling and dealing with the
physical erase limitations.

Now, in this hypothetical case here is how the drive handles a TRIM
command.  If it gets the logical instruction "TRIM block 1" what it
does is:

1. Look at the mapping table to determine that logical block 1 is at
physical block 123001.
2. Mark physical block 123001 as unused-but-dirty in the mapping table.

That's all it does.  There are four ways that a drive can get marked
as unused on an SSD:
1. Out of the box all blocks are unused-but-clean.  (costs no
operations that you care about)
2. The trim command marks a block as unused-but-dirty. (costs no operations)
3. Block overwrites mark the old block as unused-but-dirty. (costs a
write operation, but you were writing data anyway)
4. Task 2 can mark blocks as unused-but-dirty. (costs a bunch of reads
and writes)

Basically the goal of TRIM is to do more of #2 and less of #4 above,
which is an expensive read-write defragmentation process.  Plus #4
also increases drive wear since it involves copying data.

Now this is all a bit oversimplified but I believe it is accurate as
far as it illustrates the concept.  A real drive probably groups
logical blocks a bit more so that it doesn't need to maintain a 1:1
block mapping which seems like it would use a lot of space.  Again, it
is a bit like a filesystem so all the optimizations filesystems use
like extents/etc would apply.  At the physical level the principle is
that the drive has to deal with the issue that reads/writes are more
granular than erases, and everything else flows from that.  The same
issue applies to SMR hard drives, which were discussed on this list a
while ago.

-- 
Rich

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [gentoo-user] Understanding fstrim...
  2020-04-13 14:27         ` Rich Freeman
@ 2020-04-13 15:41           ` David Haller
  2020-04-13 16:05             ` Rich Freeman
  0 siblings, 1 reply; 22+ messages in thread
From: David Haller @ 2020-04-13 15:41 UTC (permalink / raw
  To: gentoo-user

Hello,

On Mon, 13 Apr 2020, Rich Freeman wrote:
>So, "trimming" isn't something a drive does really.  It is a logical
>command issued to the drive.
>
>The fundamental operations the drive does at the physical layer are:
>1. Read a block
>2. Write a block that is empty
>3. Erase a large group of blocks to make them empty
[..]
>Now, in this hypothetical case here is how the drive handles a TRIM
>command.  If it gets the logical instruction "TRIM block 1" what it
>does is:
>
>1. Look at the mapping table to determine that logical block 1 is at
>physical block 123001.
>2. Mark physical block 123001 as unused-but-dirty in the mapping table.
>
>That's all it does.  There are four ways that a drive can get marked
>as unused on an SSD:
>1. Out of the box all blocks are unused-but-clean.  (costs no
>operations that you care about)
>2. The trim command marks a block as unused-but-dirty. (costs no operations)
>3. Block overwrites mark the old block as unused-but-dirty. (costs a
>write operation, but you were writing data anyway)
>4. Task 2 can mark blocks as unused-but-dirty. (costs a bunch of reads
>and writes)
>
>Basically the goal of TRIM is to do more of #2 and less of #4 above,
>which is an expensive read-write defragmentation process.  Plus #4
>also increases drive wear since it involves copying data.

Beautifully summarized Rich! But I'd like to add two little aspects:

First of all: "physical write blocks" in the physical flash are 128kB
or something in that size range, not 4kB or even 512B ... Haven't
read, but looking enticing neither

https://en.wikipedia.org/wiki/Write_amplification

nor

https://en.wikipedia.org/wiki/Trim_(computing)

I hope they cover it ;)

Anyway, a write to a single (used) logical 512B block
involves:

1. read existing data of the phy-block-group (e.g. 128KB)
2. write data of logical block to the right spot of in-mem block-group
3. write in-mem block-group to (a different, unused) phy-block-group
4. update all logical block pointers to new phy-block-group as needed
5. mark old phy-block-group as unused

And whatnot.

And second: fstrim just makes the OS (via the Filesystem driver via
the SATA/NVME/SCSI driver through some hoops), or the Filesystem when
mounted with 'discard' via the drivers, tell the SSD one simple thing
about logical blocks that a deleted file used to use (in the TRIM
ATA/SCSI/SATA/NVME command, wikipedite for where TRIM is specced ;):

    "Hey, SSD, here's a list of LBAs (logical blocks) I no longer need.
    You may hencewith treat them as empty/unused."

Without it, the SSD has no idea about those blocks being unneeded and
treats blocks, once written to, as used blocks, doing the _tedious_
Copy-on-Write when a write hits one of those logical blocks, even if
those were deleted on the filesystem level years ago... see above WP
articles. Without TRIM, the SSD only gets to know the fact, when the
driver (the FS) writes again to the same logical block ...

With TRIM, the SSD-Controller knows what logical blocks it can treat
as unused, and do much better wear-leveling. So, it's sort of a
"trickle down 'unlink()' to the SSD"-feature. On the logical-block
level, mind you. But for the SSD, that can be quite a "relief"
regarding space for wear-leveling.

And what takes time when doing a "large" TRIM is transmitting a
_large_ list of blocks to the SSD via the TRIM command. That's why
e.g. those ~6-7GiB trims I did just before (see my other mail) took a
couple of seconds for 13GiB ~ 25M LBAs ~ a whole effin bunch of TRIM
commands (no idea... wait, 1-4kB per TRIM and 4B/LBA is max. 1k
LBAs/TRIM and for 25M LBAs you'll need minimum 25-100k TRIM
commands... go figure ;) no wonder it takes a second or few ;)

Oh, and yes, on rotating rust, all that does not matter. You'd just
let the data rot and write at 512B (or now 4kB) granularity. Well,
those 4k-but-512Bemulated drives (which is about all new ones by now I
think) have to do something like SSDs. But only on the 4kB level. Plus
the SMR shingling stuff of course. When will those implement TRIM?

HTH,
-dnh

-- 
All Hardware Sucks and I do not consider myself to actually have any data
until there's an offsite backup of it.                 -- Anthony de Boer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [gentoo-user] Understanding fstrim...
  2020-04-13 15:41           ` David Haller
@ 2020-04-13 16:05             ` Rich Freeman
  2020-04-13 20:34               ` antlists
  0 siblings, 1 reply; 22+ messages in thread
From: Rich Freeman @ 2020-04-13 16:05 UTC (permalink / raw
  To: gentoo-user

On Mon, Apr 13, 2020 at 11:41 AM David Haller <gentoo@dhaller.de> wrote:
>
> First of all: "physical write blocks" in the physical flash are 128kB
> or something in that size range, not 4kB or even 512B

Yup, though I never claimed otherwise.  I just made the generic
statement that the erase blocks are much larger than the write blocks,
even moreso than on a 4k hard drive.  (The only time I mentioned 4k
was in the context of hard drives, not SSDs.)

> Anyway, a write to a single (used) logical 512B block
> involves:
>
> 1. read existing data of the phy-block-group (e.g. 128KB)
> 2. write data of logical block to the right spot of in-mem block-group
> 3. write in-mem block-group to (a different, unused) phy-block-group
> 4. update all logical block pointers to new phy-block-group as needed
> 5. mark old phy-block-group as unused

Yup.  Hence my statement that my description was a simplification and
that a real implementation would probably use extents to save memory.
You're describing 128kB extents.  However, there is no reason that the
drive has to keep all the blocks in an erase group together, other
than to save memory in the mapping layer.  If it doesn't then it can
modify a logical block without having to re-read adjacent logical
blocks.

> And what takes time when doing a "large" TRIM is transmitting a
> _large_ list of blocks to the SSD via the TRIM command. That's why
> e.g. those ~6-7GiB trims I did just before (see my other mail) took a
> couple of seconds for 13GiB ~ 25M LBAs ~ a whole effin bunch of TRIM
> commands (no idea... wait, 1-4kB per TRIM and 4B/LBA is max. 1k
> LBAs/TRIM and for 25M LBAs you'll need minimum 25-100k TRIM
> commands... go figure ;) no wonder it takes a second or few ;)

There is no reason that 100k TRIM commands need to take much time.
Transmitting the commands is happening at SATA speeds at least.  I'm
not sure what the length of the data in a trim instruction is, but
even if it were 10-20 bytes you could send 100k of those in 1MB, which
takes <10ms to transfer depending on the SATA generation.

Now, the problem is the implementation on the drive.  If the drive
takes a long time to retire each command then that is what backs up
the queue, and hence that is why the behavior depends a lot on
firmware/etc.  The drive mapping is like a filesystem and as we all
know some filesystems are faster than others for various operations.
Also as we know hardware designers often aren't optimizing for
performance in these matters.

> Oh, and yes, on rotating rust, all that does not matter. You'd just
> let the data rot and write at 512B (or now 4kB) granularity. Well,
> those 4k-but-512Bemulated drives (which is about all new ones by now I
> think) have to do something like SSDs. But only on the 4kB level. Plus
> the SMR shingling stuff of course. When will those implement TRIM?

And that would be why I used 4k hard drives and SMR drives as an
analogy.  4k hard drives do not support TRIM but as you (and I)
pointed out, they're only dealing with 4k at a time.  SMR drives
sometimes do support TRIM.

-- 
Rich

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [gentoo-user] Understanding fstrim...
  2020-04-13 16:05             ` Rich Freeman
@ 2020-04-13 20:34               ` antlists
  2020-04-13 20:58                 ` Rich Freeman
  0 siblings, 1 reply; 22+ messages in thread
From: antlists @ 2020-04-13 20:34 UTC (permalink / raw
  To: gentoo-user

On 13/04/2020 17:05, Rich Freeman wrote:
>> And what takes time when doing a "large" TRIM is transmitting a
>> _large_  list of blocks to the SSD via the TRIM command. That's why
>> e.g. those ~6-7GiB trims I did just before (see my other mail) took a
>> couple of seconds for 13GiB ~ 25M LBAs ~ a whole effin bunch of TRIM
>> commands (no idea... wait, 1-4kB per TRIM and 4B/LBA is max. 1k
>> LBAs/TRIM and for 25M LBAs you'll need minimum 25-100k TRIM
>> commands... go figure;)  no wonder it takes a second or few;)

> There is no reason that 100k TRIM commands need to take much time.
> Transmitting the commands is happening at SATA speeds at least.  I'm
> not sure what the length of the data in a trim instruction is, but
> even if it were 10-20 bytes you could send 100k of those in 1MB, which
> takes <10ms to transfer depending on the SATA generation.

Dare I say it ... buffer bloat? poor implementation?

aiui, the spec says you can send a command "trim 1GB starting at block 
X". Snag is, the linux block size of 4KB means that it gets split into 
loads of trim commands, which then clogs up all the buffers ...

Plus all too often the trim command is synchronous, so although it is 
pretty quick, the drive won't accept the next command until the previous 
one has completed.

Cheers,
Wol


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [gentoo-user] Understanding fstrim...
  2020-04-13 20:34               ` antlists
@ 2020-04-13 20:58                 ` Rich Freeman
  2020-04-14  3:32                   ` tuxic
  0 siblings, 1 reply; 22+ messages in thread
From: Rich Freeman @ 2020-04-13 20:58 UTC (permalink / raw
  To: gentoo-user

On Mon, Apr 13, 2020 at 4:34 PM antlists <antlists@youngman.org.uk> wrote:
>
> aiui, the spec says you can send a command "trim 1GB starting at block
> X". Snag is, the linux block size of 4KB means that it gets split into
> loads of trim commands, which then clogs up all the buffers ...
>

Hmm, found the ATA spec at:
http://nevar.pl/pliki/ATA8-ACS-3.pdf

In particular page 76 which outlines the LBA addressing for TRIM.  It
looks like up to 64k ranges can be sent per trim, and each range can
cover up to 64k blocks.  So that is 2^32 blocks per trim command, or
2TiB of data per TRIM if the blocks are 512 bytes (which I'm guessing
is the case for ATA but I didn't check).  The command itself would be
half a megabyte since each range is 64 bits.

But if the kernel chops them up as you say that will certainly add
overhead.  The drive controller itself is probably the bigger
bottleneck unless it is designed to do fast TRIMs.

-- 
Rich

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [gentoo-user] Understanding fstrim...
  2020-04-13 20:58                 ` Rich Freeman
@ 2020-04-14  3:32                   ` tuxic
  2020-04-14 12:51                     ` Rich Freeman
  0 siblings, 1 reply; 22+ messages in thread
From: tuxic @ 2020-04-14  3:32 UTC (permalink / raw
  To: gentoo-user

On 04/13 04:58, Rich Freeman wrote:
> On Mon, Apr 13, 2020 at 4:34 PM antlists <antlists@youngman.org.uk> wrote:
> >
> > aiui, the spec says you can send a command "trim 1GB starting at block
> > X". Snag is, the linux block size of 4KB means that it gets split into
> > loads of trim commands, which then clogs up all the buffers ...
> >
> 
> Hmm, found the ATA spec at:
> http://nevar.pl/pliki/ATA8-ACS-3.pdf
> 
> In particular page 76 which outlines the LBA addressing for TRIM.  It
> looks like up to 64k ranges can be sent per trim, and each range can
> cover up to 64k blocks.  So that is 2^32 blocks per trim command, or
> 2TiB of data per TRIM if the blocks are 512 bytes (which I'm guessing
> is the case for ATA but I didn't check).  The command itself would be
> half a megabyte since each range is 64 bits.
> 
> But if the kernel chops them up as you say that will certainly add
> overhead.  The drive controller itself is probably the bigger
> bottleneck unless it is designed to do fast TRIMs.
> 
> -- 
> Rich
> 

Hi all,

thanks **a lot** for all this great information! :)

Since I have a NVMe drive on a M.2 socket I would
be interested at what level/stage (?word? ...sorry...) 
the data go a different path as with the classical sata
SSDs.

Is this just "protocol" or there is something different?
On the internet I read, that the io-scheduler is choosen 
differentlu by the kernel, if there is a NVMe driven detected,
for example.
(I think, the io-scheduler has nothing to do with the fstrim
operation itsself (I think...) -- it is there only as an example...)

I think, due to the corona lockdown I have to fstrim my hair
myself....  :) 8)

Cheers!
Meino

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [gentoo-user] Understanding fstrim...
  2020-04-14  3:32                   ` tuxic
@ 2020-04-14 12:51                     ` Rich Freeman
  2020-04-14 14:26                       ` Wols Lists
  0 siblings, 1 reply; 22+ messages in thread
From: Rich Freeman @ 2020-04-14 12:51 UTC (permalink / raw
  To: gentoo-user

On Mon, Apr 13, 2020 at 11:32 PM <tuxic@posteo.de> wrote:
>
> Since I have a NVMe drive on a M.2 socket I would
> be interested at what level/stage (?word? ...sorry...)
> the data go a different path as with the classical sata
> SSDs.
>
> Is this just "protocol" or there is something different?

NVMe involves both hardware, protocol, and of course software changes
driven by these.

First, a disclaimer, I am by no means an expert in storage transport
protocols/etc and obviously there are a ton of standards so that any
random drive works with any random motherboard/etc.  If I missed
something or have any details wrong please let me know.

From the days of IDE to pre-NVMe on PC the basic model was that the
CPU would talk to a host bus adapter (HBA) which would in turn talk to
the drive controller.  The HBA was usually on the motherboard but of
course it could be in an expansion slot.

The CPU talked to the HBA using the bus standards of the day
(ISA/PCI/PCIe/etc) and this was of course a high-speed bus designed to
work with all kinds of stuff.  AHCI is the latest generation of
protocols for communication between the CPU and the HBA so that any
HBA can work with any OS/etc using generic drivers.  The protocol was
designed with spinning hard drives in mind and has some limitations
with SSD.

The HBA would talk to the drive over SCSI/SAS/SATA/PATA and there are
a bunch of protocols designed for this.  Again, they were designed in
the era of hard drives and have some limitations.

The concept of NVMe is to ditch the HBA and stick the drive directly
on the PCIe bus which is much faster, and streamline the protocols.
This is a little analogous to the shift to IDE from the old days of
separate drive controllers - cutting out a layer of interface
hardware.  Early NVMes just had their own protocols, but a standard
was created so that just as with AHCI the OS can use a single driver
for any drive vendor.

At the hardware level the big change is that NVMe just uses the PCIe
bus.  That M.2 adapter has a different form factor than a regular PCIe
slot, and I didn't look at the whole pinout, but you probably could
just make a dumb adapter to plug a drive right into a PCIe slot as
electronically I think they're just PCIe cards.  I believe they have
to be PCIv3+ and typically have 4 lanes, which is a lot of bandwidth.
And of course it is the same interface used for NICs/graphics/etc so
it is pretty low latency and can support hardware interrupts and all
that stuff.  It is pretty common for motherboards to share an M.2 slot
with a PCIe slot so that you can use one or the other but not both for
this reason - the same lanes are in use for both.

The wikipedia article has a good comparison of the protocol-level
changes.  For the most part it mainly involves making things far more
parallel.  With ATA/AHCI you'd have one queue of commands that was
only a few instructions deep.  With NVMe you can have thousands of
queues with thousands of commands in each and 2048 different hardware
interrupts for the drive to be able to signal back when one command vs
another has completed (although I'm really curious if anybody takes
any kind of advantage of that - unless you have different drivers
trying to use the same drive in parallel it seems like MSI-X isn't
saving much here - maybe if you had a really well-trusted VM or
something or the command set has some way to segment the drive
virtually...).  This basically allows billions of operations to be
in-progress at any given time so there is much less of a bottleneck in
the protocol/interface itself.  NVMe is all about IOPS.

I don't know all the gory details but I wouldn't be surprised that
once you get past this if many of the commands themselves are the same
just to keep things simple.  Or maybe they just rewrote it all from
scratch - I didn't look into it and would be curious to hear from
somebody who has.  Obviously the concepts of read/write/trim/etc are
going to apply regardless of interface.

-- 
Rich

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [gentoo-user] Understanding fstrim...
  2020-04-14 12:51                     ` Rich Freeman
@ 2020-04-14 14:26                       ` Wols Lists
  2020-04-14 14:51                         ` Rich Freeman
  0 siblings, 1 reply; 22+ messages in thread
From: Wols Lists @ 2020-04-14 14:26 UTC (permalink / raw
  To: gentoo-user

On 14/04/20 13:51, Rich Freeman wrote:
> I believe they have
> to be PCIv3+ and typically have 4 lanes, which is a lot of bandwidth.

My new mobo - the manual says if I put an nvme drive in - I think it's
the 2nd nvme slot - it disables the 2nd graphics card slot :-(

Seeing as I need two graphics cards to double-head my system, that means
I can't use two nvmes :-(

But using the 1st nvme slot disables a sata slot, which buggers my raid
up ... :-(

Oh well. That's life :-(

Cheers,
Wol


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [gentoo-user] Understanding fstrim...
  2020-04-14 14:26                       ` Wols Lists
@ 2020-04-14 14:51                         ` Rich Freeman
  0 siblings, 0 replies; 22+ messages in thread
From: Rich Freeman @ 2020-04-14 14:51 UTC (permalink / raw
  To: gentoo-user

On Tue, Apr 14, 2020 at 10:26 AM Wols Lists <antlists@youngman.org.uk> wrote:
>
> On 14/04/20 13:51, Rich Freeman wrote:
> > I believe they have
> > to be PCIv3+ and typically have 4 lanes, which is a lot of bandwidth.
>
> My new mobo - the manual says if I put an nvme drive in - I think it's
> the 2nd nvme slot - it disables the 2nd graphics card slot :-(
>

First, there is no such thing as an "nvme slot".  You're probably
describing an M.2 slot.  This matters as I'll get to later...

As I mentioned many motherboards share a PCIe slot with an M.2 slot.
The CPU + chipset only has so many PCIe lanes.  So unless they aren't
already using them for expansion slots they have to double up.  By
doubling up they can basically stick more x2/4/8 PCIe slots into the
motherboard than they could if they completely dedicated them.  Or
they could let that second GPU talk directly to the CPU vs having to
go through the chipset (I think - I'm not really an expert on PCIe),
and let the NVMe talk directly to the CPU if you aren't using that
second GPU.

>
> But using the 1st nvme slot disables a sata slot, which buggers my raid
> up ... :-(
>

While that might be an M.2 slot, it probably isn't an "nvme slot".
M.2 can be used for either SATA or PCIe.  Some motherboards have one,
the other, or both.  And M.2 drives can be either, so you need to be
sure you're using the right one.  If you get the wrong kind of drive
it might not work, or it might end up being a SATA drive when you
intended to use an NVMe.  A SATA drive will have none of the benefits
of NVMe and will be functionally no different from just a regular 2.5"
SSD that plugs into a SATA cable - it is just a different form factor.

It sounds like they doubled up a PCIe port on the one M.2 connector,
and they doubled up a SATA port on the other M.2 connector.  It isn't
necessarily a bad thing, but obviously you need to make tradeoffs.

If you want a motherboard with a dozen x16 PCIe's, 5 M.2's,14 SATA
ports, and 10 USB3's on it there is no reason that shouldn't be
possible, but don't expect to find it in the $60 bargain bin, and
don't expect all those lanes to talk directly to the CPU unless you're
using EPYC or something else high-end.  :)

-- 
Rich

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [gentoo-user] Re: Understanding fstrim...
  2020-04-13 11:55   ` Michael
  2020-04-13 12:18     ` Rich Freeman
@ 2020-04-13 15:48     ` Holger Hoffstätte
  1 sibling, 0 replies; 22+ messages in thread
From: Holger Hoffstätte @ 2020-04-13 15:48 UTC (permalink / raw
  To: gentoo-user, Michael

On 4/13/20 1:55 PM, Michael wrote:
> I have noticed when prolonged fstrim takes place on an old SSD drive of mine
> it becomes unresponsive.  As Rich said this is not because data is being
> physically deleted, only a flag is switched from 1 to 0 to indicate its
> availability for further writes.

This is all true and while the exact behaviour depends on the drive model,
nevertheless a common problem is that the drive's request queue is flooded
during whole-drive fstrim and that can lead to unresponsiveness, especially
when you're using the deadline (or mq-deadline these days) scheduler, which
by nature has a tendency to starve readers when continuous long chains of
writes happen. One can tune the read/write ratio to be more balanced.

A better way to help the block layer out is by either permanently switching
to bfq on these (presumably lower-endish) devices, or - my approach - just during
scheduled fstrim. I've written a script to do just that and have been using
it on all my machines (server, workstation, laptop) for a long time now.
Switching to a fair I/O scheduler during scheduled fstrim *completely*
fixed the system-wide lag for me.

I suggest you try bfq for fstrim; if you'd like the script just let me know.

cheers
Holger

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-04-14 14:51 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-04-13  5:32 [gentoo-user] Understanding fstrim tuxic
2020-04-13  9:22 ` Andrea Conti
2020-04-13  9:49   ` Neil Bothwick
2020-04-13 11:01     ` Andrea Conti
2020-04-13 10:06 ` Michael
2020-04-13 11:00   ` tuxic
2020-04-13 14:00     ` David Haller
2020-04-13 10:11 ` Peter Humphrey
2020-04-13 11:39 ` Rich Freeman
2020-04-13 11:55   ` Michael
2020-04-13 12:18     ` Rich Freeman
2020-04-13 13:18       ` tuxic
2020-04-13 14:27         ` Rich Freeman
2020-04-13 15:41           ` David Haller
2020-04-13 16:05             ` Rich Freeman
2020-04-13 20:34               ` antlists
2020-04-13 20:58                 ` Rich Freeman
2020-04-14  3:32                   ` tuxic
2020-04-14 12:51                     ` Rich Freeman
2020-04-14 14:26                       ` Wols Lists
2020-04-14 14:51                         ` Rich Freeman
2020-04-13 15:48     ` [gentoo-user] " Holger Hoffstätte

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox