* [gentoo-user] Understanding fstrim... @ 2020-04-13 5:32 tuxic 2020-04-13 9:22 ` Andrea Conti ` (3 more replies) 0 siblings, 4 replies; 22+ messages in thread From: tuxic @ 2020-04-13 5:32 UTC (permalink / raw To: Gentoo Hi, From the list I already have learned, that most of my concerns regarding the lifetime and maintainance to prolong it are without a reason. Nonetheless I am interested in the technique as such. My SSD (NVme/M2) is ext4 formatted and I found articles on the internet, that it is neither a good idea to activate the "discard" option at mount time nor to do a fstrim either at each file deletion no triggered by a cron job. Since there seems to be a "not so good point in time", when to do a fstrim, I think there must also be a point in time, when it is quite right to fstrim the mu SSD. fstrim clears blocks, which currently are not in use and which contents is != 0. The more unused blocks there are, which has a contents != 0, the lesser the count of blocks is, which the wear leveling algorithm can use for its purpose. That leads to the conclusion: to fstrim as often as possible, to keep the count of empty blocks as high as possible. BUT: Clearing blocks is an action, which includes writes to the cells of the SSD. Which is not that nice. Then, do a fstrim just in the moment, when there is no useable block left. Then the wear-leveling algorithm is already at its limits. Which is not that nice either. The truth - as so often - is somewhere in between. Is it possible to get an information from the SSD, how many blocks are in the state of "has contents" and "is unused" and how many blocks are in the state of "has *no* contents" and "is unused"? Assuming this information is available: Is it possible to find the sweat spot, when to fstrim SSD? Cheers! Meino ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [gentoo-user] Understanding fstrim... 2020-04-13 5:32 [gentoo-user] Understanding fstrim tuxic @ 2020-04-13 9:22 ` Andrea Conti 2020-04-13 9:49 ` Neil Bothwick 2020-04-13 10:06 ` Michael ` (2 subsequent siblings) 3 siblings, 1 reply; 22+ messages in thread From: Andrea Conti @ 2020-04-13 9:22 UTC (permalink / raw To: gentoo-user > My SSD (NVme/M2) is ext4 formatted and I found articles on the > internet, that it is neither a good idea to activate the "discard" > option at mount time nor to do a fstrim either at each file deletion > no triggered by a cron job. I have no desire to enter the whole performance/lifetime debate; I'd just like to point out that one very real consequence of using fstrim (or mounting with the discard option) that I haven't seen mentioned often is that it makes the contents of any removed files Truly Gone(tm). No more extundelete to save your back when you mistakenly rm something that you haven't backed up for a while... andrea ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [gentoo-user] Understanding fstrim... 2020-04-13 9:22 ` Andrea Conti @ 2020-04-13 9:49 ` Neil Bothwick 2020-04-13 11:01 ` Andrea Conti 0 siblings, 1 reply; 22+ messages in thread From: Neil Bothwick @ 2020-04-13 9:49 UTC (permalink / raw To: gentoo-user [-- Attachment #1: Type: text/plain, Size: 617 bytes --] On Mon, 13 Apr 2020 11:22:47 +0200, Andrea Conti wrote: > I have no desire to enter the whole performance/lifetime debate; I'd > just like to point out that one very real consequence of using fstrim > (or mounting with the discard option) that I haven't seen mentioned > often is that it makes the contents of any removed files Truly Gone(tm). > > No more extundelete to save your back when you mistakenly rm something > that you haven't backed up for a while... Have your backup cron job call fstrim once everything is safely backed up? -- Neil Bothwick Life's a cache, and then you flush... [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [gentoo-user] Understanding fstrim... 2020-04-13 9:49 ` Neil Bothwick @ 2020-04-13 11:01 ` Andrea Conti 0 siblings, 0 replies; 22+ messages in thread From: Andrea Conti @ 2020-04-13 11:01 UTC (permalink / raw To: gentoo-user > Have your backup cron job call fstrim once everything is safely backed up? Well, yes, but that's beside the point. What I really wanted to stress was that mounting an SSD-backed filesystem with "discard" has effects on the ability to recover deleted data. Normally it's not a problem, but shit happens -- and when it happens on such a filesystem don't waste time with recovery tools, as all you'll get back are files full of 0xFFs. andrea ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [gentoo-user] Understanding fstrim... 2020-04-13 5:32 [gentoo-user] Understanding fstrim tuxic 2020-04-13 9:22 ` Andrea Conti @ 2020-04-13 10:06 ` Michael 2020-04-13 11:00 ` tuxic 2020-04-13 10:11 ` Peter Humphrey 2020-04-13 11:39 ` Rich Freeman 3 siblings, 1 reply; 22+ messages in thread From: Michael @ 2020-04-13 10:06 UTC (permalink / raw To: Gentoo [-- Attachment #1: Type: text/plain, Size: 3124 bytes --] On Monday, 13 April 2020 06:32:37 BST tuxic@posteo.de wrote: > Hi, > > From the list I already have learned, that most of my concerns regarding > the lifetime and maintainance to prolong it are without a > reason. Probably your concerns about SSD longevity are without a reason, but keep up to date backups just in case. ;-) > Nonetheless I am interested in the technique as such. > > My SSD (NVme/M2) is ext4 formatted and I found articles on the > internet, that it is neither a good idea to activate the "discard" > option at mount time nor to do a fstrim either at each file deletion > no triggered by a cron job. Beside what the interwebs say about fstrim, the man page provides good advice. They recommend running a cron job once a week, for most desktop and server implementations. > Since there seems to be a "not so good point in time", when to do a > fstrim, I think there must also be a point in time, when it is quite > right to fstrim the mu SSD. > > fstrim clears blocks, which currently are not in use and which > contents is != 0. > > The more unused blocks there are, which has a contents != 0, the > lesser the count of blocks is, which the wear leveling algorithm can > use for its purpose. The wear levelling mechanism is using the HPA as far as I know, although you can always overprovision it.[1] > That leads to the conclusion: to fstrim as often as possible, to keep the > count of empty blocks as high as possible. Not really. Why would you need the count of empty blocks as high as possible, unless you are about to right some mammoth file and *need* to use up every available space possible on this disk/partition? > BUT: Clearing blocks is an action, which includes writes to the cells of > the SSD. > > Which is not that nice. It's OK, as long as you are not over-writing cells which do not need to be overwritten. Cells with deleted data will be overwritten as some point. > Then, do a fstrim just in the moment, when there is no useable block > left. Why leave it at the last moment and incur a performance penalty while waiting for fstrim to complete? > Then the wear-leveling algorithm is already at its limits. > > Which is not that nice either. > > The truth - as so often - is somewhere in between. > > Is it possible to get an information from the SSD, how many blocks are > in the state of "has contents" and "is unused" and how many blocks are > in the state of "has *no* contents" and "is unused"? > > Assuming this information is available: Is it possible to find the > sweat spot, when to fstrim SSD? I humbly suggest you may be over-thinking something a cron job running fstrim once a week, or once a month, or twice a month would take care of without you knowing or worrying about. Nevertheless, if the usage of your disk/partitions is variable and one week you may fill it up with deleted data, while for the rest of the month you won't even touch it, there's SSDcronTRIM, a script I've been using for a while.[2] [1] https://www.thomas-krenn.com/en/wiki/SSD_Over-provisioning_using_hdparm [2] https://github.com/chmatse/SSDcronTRIM [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [gentoo-user] Understanding fstrim... 2020-04-13 10:06 ` Michael @ 2020-04-13 11:00 ` tuxic 2020-04-13 14:00 ` David Haller 0 siblings, 1 reply; 22+ messages in thread From: tuxic @ 2020-04-13 11:00 UTC (permalink / raw To: gentoo-user Hi Michael, thank you for replying to my questions! :) On 04/13 11:06, Michael wrote: > On Monday, 13 April 2020 06:32:37 BST tuxic@posteo.de wrote: > > Hi, > > > > From the list I already have learned, that most of my concerns regarding > > the lifetime and maintainance to prolong it are without a > > reason. > > Probably your concerns about SSD longevity are without a reason, but keep up > to date backups just in case. ;-) ...of course! :) My question are more driven by curiousty than by anxiety... > > Nonetheless I am interested in the technique as such. > > > > My SSD (NVme/M2) is ext4 formatted and I found articles on the > > internet, that it is neither a good idea to activate the "discard" > > option at mount time nor to do a fstrim either at each file deletion > > no triggered by a cron job. > > Beside what the interwebs say about fstrim, the man page provides good advice. > They recommend running a cron job once a week, for most desktop and server > implementations. ...but it neither explains why to do so nor does it explain the technical background. For example it saus: "For most desktop and server systems a sufficient trimming frequency is once a week." ...but why is this ok to do so? Are all PCs made equal? Are all use cases are equal? It even does not distinguish between SSD/Sata and SSD/NVMe(M2 in my case). These are the points, where my curiousty kicks in and I am starting to ask questions. :) > > Since there seems to be a "not so good point in time", when to do a > > fstrim, I think there must also be a point in time, when it is quite > > right to fstrim the mu SSD. > > > > fstrim clears blocks, which currently are not in use and which > > contents is != 0. > > > > The more unused blocks there are, which has a contents != 0, the > > lesser the count of blocks is, which the wear leveling algorithm can > > use for its purpose. > > The wear levelling mechanism is using the HPA as far as I know, although you > can always overprovision it.[1] For example: Take an SSD with 300 GB user-useable space. To over-overprovisioning the device the user decides to partitioning only the half of the disk and format it. The rest is left untouched in "nowhere land". Now the controller has a lot of space to shuffle data around. Fstrim only works on the mounted part of the SSD. So the used blocks in "nowhere land" remain...unfstrimmed? To not to use all available space for the partitions is a hint I found online...and then I asked me the question above... If what I read online is wrong my assumptions are wrong...which isn't reassuring either. > > > That leads to the conclusion: to fstrim as often as possible, to keep the > > count of empty blocks as high as possible. > > Not really. Why would you need the count of empty blocks as high as possible, Unused blocks with data cannot be used for wearleveling. Suppose you have a total amount of 100 block, 50 blocks are used, 25 are unused and empty, 25 are unused and filled with former data. In this case only 25 blocks are available to spread the next write operation. After fstrim 50 blocks would be available again and the same amount of writes could now be spread over 50 sectors. At least that is what I read online... > unless you are about to right some mammoth file and *need* to use up every > available space possible on this disk/partition? > > > > BUT: Clearing blocks is an action, which includes writes to the cells of > > the SSD. > > > > Which is not that nice. > > It's OK, as long as you are not over-writing cells which do not need to be > overwritten. Cells with deleted data will be overwritten as some point. > > > > Then, do a fstrim just in the moment, when there is no useable block > > left. > > Why leave it at the last moment and incur a performance penalty while waiting > for fstrim to complete? Performance is not my concern (at least in the moment ;) ). I try to fully understand the mechanisms here, since what I read online is not without contradictions... > > Then the wear-leveling algorithm is already at its limits. > > > > Which is not that nice either. > > > > The truth - as so often - is somewhere in between. > > > > Is it possible to get an information from the SSD, how many blocks are > > in the state of "has contents" and "is unused" and how many blocks are > > in the state of "has *no* contents" and "is unused"? > > > > Assuming this information is available: Is it possible to find the > > sweat spot, when to fstrim SSD? > > I humbly suggest you may be over-thinking something a cron job running fstrim > once a week, or once a month, or twice a month would take care of without you > knowing or worrying about. To technically overthink problems is a vital part of my profession and exactly what I am asked for. I cannot put this behaviour down so easily. :) From my experience there aren't too manu questions, Michael, there is often only a lack of related answers. > Nevertheless, if the usage of your disk/partitions is variable and one week > you may fill it up with deleted data, while for the rest of the month you > won't even touch it, there's SSDcronTRIM, a script I've been using for a > while.[2] > > > [1] https://www.thomas-krenn.com/en/wiki/SSD_Over-provisioning_using_hdparm > [2] https://github.com/chmatse/SSDcronTRIM Cheers! Meino ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [gentoo-user] Understanding fstrim... 2020-04-13 11:00 ` tuxic @ 2020-04-13 14:00 ` David Haller 0 siblings, 0 replies; 22+ messages in thread From: David Haller @ 2020-04-13 14:00 UTC (permalink / raw To: gentoo-user Hello, On Mon, 13 Apr 2020, tuxic@posteo.de wrote: >On 04/13 11:06, Michael wrote: >> On Monday, 13 April 2020 06:32:37 BST tuxic@posteo.de wrote: [..] >My question are more driven by curiousty than by anxiety... [..] >For example [the fstrim manpage] says: >"For most desktop and server systems a sufficient trimming frequency is >once a week." > >...but why is this ok to do so? Are all PCs made equal? Are all >use cases are equal? It even does not distinguish between SSD/Sata >and SSD/NVMe(M2 in my case). Observe your use pattern a bit and use 'fstrim -v' when you think it's worth it, as it basically boils down to how much you delete *when*. If you e.g.: - constantly use the drive as a fast cache for video-editing etc., writing large files to the drive and later deleting them again -> run fstrim daily or even mount with 'discard'-option - write/delete somewhat regularly, e.g. as a system-drive and running updates or emerge @world (esp. if you build on the SSD) e.g. weekly they effectively are a write operation and a bunch of deletions. Or if you do whatever other deletions somewhat regularly -> run fstrim after each one or three such deletions, e.g. via a weekly cronjob - mostly write (if anything), rarely delete anything -> run fstrim manually a few days after $some deletions have accumulated or any other convenient time you can remember to and are sure all deleted files can be gone, be it bi-weekly, monthly, tri-montly, yearly, completly irregularly, whenever ;) Choose anything in the range that fits _your_ use-pattern best, considering capacity, free-space (no matter if on a partition or unallocated) and what size was trimmed when running 'fstrim -v'... Running that weekly (I'd suggest bi-weekly) 'fstrim' cronjob is not a bad suggestion as a default I guess, but observe your use and choose to deviate or not :) My gut says to run fstrim if: - it'd trim more than 5% (-ish) capacity - it'd trim more than 20% (-ish) of the remaining "free" space (including unallocated) - it'd trim more than $n GiB (where $n may be anything ;) whichever comes first (and the two latter can only be determined by observation). No need to run fstrim after deleting just 1kiB. Or 1MiB. Not that me lazybag adheres to that, read on if you will... ;) FWIW: I run fstrim a few times a year when I think of it and guesstimate I did delete quite a bit in the meantime (much like I run fsck ;) ... This usually trims a few GiB on my 128G drive: # fdisk -u -l /dev/sda Disk /dev/sda: 119.2 GiB, 128035676160 bytes, 250069680 sectors Disk model: SAMSUNG SSD 830 [..] Device Boot Start End Sectors Size Id Type /dev/sda1 2048 109053951 109051904 52G 83 Linux /dev/sda2 109053952 218105855 109051904 52G 83 Linux (I did leave ~15GiB unpartitioned, and was too lazy to rectify that yet, at the time I partitioned in 2012, for many (cheaper?) SSDs that overprovisioning was still a good thing and 'duh intarweb' was quite worse than today regarding the problem)... So while I'm about it, I guess it's time to run fstrim (for the first time this year IIRC) ... # fstrim -v /sda1 ; fstrim -v /sda2 ## mountpoints mangled /sda1: 7563407360 bytes were trimmed /sda2: 6842478592 bytes were trimmed # calc 'x=config("display",1); 7563407360/2^30; 6842478592/2^30' ~7.0 ~6.4 So, my typical few GiB or about 12.8% disk capacity (summed) were trimmed (oddly enough, it's always been in this 4-8GiB/partition range). I probably should run fstrim a bit more often though, but then again I've still got those unallocated 15G, so I guess I'm fine. And that's with quite a large Gentoo system on /dev/sda2 and all its at times large (like libreoffice, firefox, seamonkey, icedtea, etc.) updates: # df -h / Filesystem Size Used Avail Use% Mounted on /dev/sda2 52G 45G 3.8G 93% / PORTAGE_TMPDIR, PORTDIR, (and distfiles and packages) are on other HDDs though, so building stuff does not affect the SSD, only the actual install (merge) and whatever else. But I've got /var/log/ on the SSD on both systems (sda1/sda2). While I'm at it: # smartctl -A /dev/sda [..] ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH [..] RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 [..] 0 9 Power_On_Hours 0x0032 091 091 000 [..] 43261 12 Power_Cycle_Count 0x0032 097 097 000 [..] 2617 177 Wear_Leveling_Count 0x0013 093 093 000 [..] 247 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 [..] 0 181 Program_Fail_Cnt_Total 0x0032 100 100 010 [..] 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 [..] 0 183 Runtime_Bad_Block 0x0013 100 100 010 [..] 0 187 Uncorrectable_Error_Cnt 0x0032 100 100 000 [..] 0 190 Airflow_Temperature_Cel 0x0032 067 050 000 [..] 33 195 ECC_Error_Rate 0x001a 200 200 000 [..] 0 199 CRC_Error_Count 0x003e 253 253 000 [..] 0 235 POR_Recovery_Count 0x0012 099 099 000 [..] 16 241 Total_LBAs_Written 0x0032 099 099 000 [..] 12916765928 [ '[..]': none show WHEN_FAILED other than '-' and the rest is standard] Wow, I have almost exactly 6TBW or ~52 "Drive writes" or "Capacity writes" on this puny 128MB SSD: # calc 'printf("%0.2d TiB written\n%0.1d drive writes\n", 12916765928/2/2^30, 12916765928/250069680);' ~6.01 TiB written ~51.7 drive writes And I forgot I've been running this drive for this long already (not that I've been running it 24/7 by quite a bit, but since July 2012 or about 15/7-ish): $ dateduration 43261h ### [1] 4 years 11 months 7 days 13 hours 0 minutes 0 seconds HTH, -dnh [1] ==== ~/bin/dateduration ==== #!/bin/bash F='%Y years %m months %d days %H hours %M minutes %S seconds' now=$(date +%s) datediff -f "$F" $(dateadd -i '%s' "$now" +0s) $(dateadd -i '%s' "$now" $1) ==== If anyone knows a better way... ;) -- printk("; crashing the system because you wanted it\n"); linux-2.6.6/fs/hpfs/super.c ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [gentoo-user] Understanding fstrim... 2020-04-13 5:32 [gentoo-user] Understanding fstrim tuxic 2020-04-13 9:22 ` Andrea Conti 2020-04-13 10:06 ` Michael @ 2020-04-13 10:11 ` Peter Humphrey 2020-04-13 11:39 ` Rich Freeman 3 siblings, 0 replies; 22+ messages in thread From: Peter Humphrey @ 2020-04-13 10:11 UTC (permalink / raw To: Gentoo On Monday, 13 April 2020 06:32:37 BST tuxic@posteo.de wrote: > Assuming this information is available: Is it possible to find the > sweat spot, when to fstrim SSD? This crontab entry is my compromise: 15 3 */2 * * /sbin/fstrim -a It does assume I'll be elsewhere at 03:15, of course. -- Regards, Peter. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [gentoo-user] Understanding fstrim... 2020-04-13 5:32 [gentoo-user] Understanding fstrim tuxic ` (2 preceding siblings ...) 2020-04-13 10:11 ` Peter Humphrey @ 2020-04-13 11:39 ` Rich Freeman 2020-04-13 11:55 ` Michael 3 siblings, 1 reply; 22+ messages in thread From: Rich Freeman @ 2020-04-13 11:39 UTC (permalink / raw To: gentoo-user On Mon, Apr 13, 2020 at 1:32 AM <tuxic@posteo.de> wrote: > > fstrim clears blocks, which currently are not in use and which > contents is != 0. >... > BUT: Clearing blocks is an action, which includes writes to the cells of > the SSD. I see a whole bunch of discussion, but it seems like many here don't actually understand what fstrim actually does. It doesn't "clear" anything, and it doesn't care what the contents of a block are. It doesn't write to the cells of the SSD per se. It issues the TRIM command to the drive for any unused blocks (or a subset of them if you use non-default options). It doesn't care what the contents of the blocks are when it does so - it shouldn't even try to read the blocks to know what their content is. Trimming a block won't clear it at all. It doesn't write to the cells of the SSD either - at least not the ones being trimmed. It just tells the drive controller that the blocks are no longer in use. Now, the drive controller needs to keep track of which blocks are in use (which it does whether you use fstrim or not), and that data is probably stored in some kind of flash so that it is persistent, but presumably that is managed in such a manner that it is unlikely to fail before the rest of the drive fails. On a well-implemented drive trimming actually REDUCES writes. When you trim a block the drive controller will stop trying to preserve its contents. If you don't trim it, then the controller will preserve its contents. Preserving the content of unused blocks necessarily involves more writing to the drive than just treating as zero-fillled/etc. Now, where you're probably getting at the concept of clearing and zeroing data is that if you try to read a trimmed block the drive controller probably won't even bother to read the block from the ssd and will just return zeros. Those zeros were never written to flash - they're just like a zero-filled file in a filesystem. If you write a bazillion zeros to a file on ext4 it will just record in the filesystem data that you have a bunch of blocks of zero and it won't allocate any actual space on disk - reading that file requires no reading of the actual disk beyond the metadata because they're not stored in actual extents. Indeed blocks are more of a virtual thing on an SSD (or even hard drive these days), so if a logical block isn't mapped to a physical storage area there isn't anything to read in the first place. However, when you trimmed the file the drive didn't go find some area of flash and fill it with zeros. It just marked it as unused or removed its logical mapping to physical storage. In theory you should be able to use discard or trim the filesystem every 5 minutes with no negative effects at all. In theory. However, many controllers (especially old ones) aren't well-implemented and may not handle this efficiently. A trim operation is still an operation the controller has to deal with, and so deferring it to a time when the drive is idle could improve performance, especially for drives that don't do a lot of writes. If a drive has a really lousy controller then trims might cause its stupid firmware to do stupid things. However, this isn't really anything intrinsic to the concept of trimming. Fundamentally trimming is just giving the drive more information about the importance of the data it is storing. Just about any filesystem benefits from having more information about what it is storing if it is well-implemented. In a perfect world we'd just enable discard on our mounts and be done with it. I'd probably just look up the recommendations for your particular drive around trimming and follow those. Somebody may have benchmarked it to determine how brain-dead it is. If you bought a more name-brand SSD you're probably more likely to benefit from more frequent trimming. I'm personally using zfs which didn't support trim/discard until very recently, and I'm not on 0.8 yet, so for me it is a bit of a moot point. I plan to enable it once I can do so. -- Rich ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [gentoo-user] Understanding fstrim... 2020-04-13 11:39 ` Rich Freeman @ 2020-04-13 11:55 ` Michael 2020-04-13 12:18 ` Rich Freeman 2020-04-13 15:48 ` [gentoo-user] " Holger Hoffstätte 0 siblings, 2 replies; 22+ messages in thread From: Michael @ 2020-04-13 11:55 UTC (permalink / raw To: gentoo-user [-- Attachment #1: Type: text/plain, Size: 4715 bytes --] On Monday, 13 April 2020 12:39:11 BST Rich Freeman wrote: > On Mon, Apr 13, 2020 at 1:32 AM <tuxic@posteo.de> wrote: > > fstrim clears blocks, which currently are not in use and which > > contents is != 0. > > > >... > > > > BUT: Clearing blocks is an action, which includes writes to the cells of > > the SSD. > > I see a whole bunch of discussion, but it seems like many here don't > actually understand what fstrim actually does. > > It doesn't "clear" anything, and it doesn't care what the contents of > a block are. It doesn't write to the cells of the SSD per se. > > It issues the TRIM command to the drive for any unused blocks (or a > subset of them if you use non-default options). It doesn't care what > the contents of the blocks are when it does so - it shouldn't even try > to read the blocks to know what their content is. > > Trimming a block won't clear it at all. It doesn't write to the cells > of the SSD either - at least not the ones being trimmed. It just > tells the drive controller that the blocks are no longer in use. > > Now, the drive controller needs to keep track of which blocks are in > use (which it does whether you use fstrim or not), and that data is > probably stored in some kind of flash so that it is persistent, but > presumably that is managed in such a manner that it is unlikely to > fail before the rest of the drive fails. > > On a well-implemented drive trimming actually REDUCES writes. When > you trim a block the drive controller will stop trying to preserve its > contents. If you don't trim it, then the controller will preserve its > contents. Preserving the content of unused blocks necessarily > involves more writing to the drive than just treating as > zero-fillled/etc. > > Now, where you're probably getting at the concept of clearing and > zeroing data is that if you try to read a trimmed block the drive > controller probably won't even bother to read the block from the ssd > and will just return zeros. Those zeros were never written to flash - > they're just like a zero-filled file in a filesystem. If you write a > bazillion zeros to a file on ext4 it will just record in the > filesystem data that you have a bunch of blocks of zero and it won't > allocate any actual space on disk - reading that file requires no > reading of the actual disk beyond the metadata because they're not > stored in actual extents. Indeed blocks are more of a virtual thing > on an SSD (or even hard drive these days), so if a logical block isn't > mapped to a physical storage area there isn't anything to read in the > first place. > > However, when you trimmed the file the drive didn't go find some area > of flash and fill it with zeros. It just marked it as unused or > removed its logical mapping to physical storage. > > In theory you should be able to use discard or trim the filesystem > every 5 minutes with no negative effects at all. In theory. However, > many controllers (especially old ones) aren't well-implemented and may > not handle this efficiently. A trim operation is still an operation > the controller has to deal with, and so deferring it to a time when > the drive is idle could improve performance, especially for drives > that don't do a lot of writes. If a drive has a really lousy > controller then trims might cause its stupid firmware to do stupid > things. However, this isn't really anything intrinsic to the concept > of trimming. > > Fundamentally trimming is just giving the drive more information about > the importance of the data it is storing. Just about any filesystem > benefits from having more information about what it is storing if it > is well-implemented. In a perfect world we'd just enable discard on > our mounts and be done with it. > > I'd probably just look up the recommendations for your particular > drive around trimming and follow those. Somebody may have benchmarked > it to determine how brain-dead it is. If you bought a more name-brand > SSD you're probably more likely to benefit from more frequent > trimming. > > I'm personally using zfs which didn't support trim/discard until very > recently, and I'm not on 0.8 yet, so for me it is a bit of a moot > point. I plan to enable it once I can do so. What Rich said, plus: I have noticed when prolonged fstrim takes place on an old SSD drive of mine it becomes unresponsive. As Rich said this is not because data is being physically deleted, only a flag is switched from 1 to 0 to indicate its availability for further writes. As I understand the firmware performs wear-leveling when it needs to in the HPA allocated blocks, rather than waiting for the user/OS to run fstrim to obtain some more 'free' space. [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [gentoo-user] Understanding fstrim... 2020-04-13 11:55 ` Michael @ 2020-04-13 12:18 ` Rich Freeman 2020-04-13 13:18 ` tuxic 2020-04-13 15:48 ` [gentoo-user] " Holger Hoffstätte 1 sibling, 1 reply; 22+ messages in thread From: Rich Freeman @ 2020-04-13 12:18 UTC (permalink / raw To: gentoo-user On Mon, Apr 13, 2020 at 7:55 AM Michael <confabulate@kintzios.com> wrote: > > I have noticed when prolonged fstrim takes place on an old SSD drive of mine > it becomes unresponsive. As Rich said this is not because data is being > physically deleted, only a flag is switched from 1 to 0 to indicate its > availability for further writes. That, and it is going about it in a brain-dead manner. A modern drive is basically a filesystem of sorts. It just uses block numbers instead of filenames, but it can be just as complex underneath. And just as with filesystems there are designs that are really lousy and designs that are really good. And since nobody sees the source code or pays much attention to the hidden implementation details, it is often not designed with your requirements in mind. I suspect a lot of SSDs are using the SSD-equivalent of FAT32 to manage their block remapping. Some simple algorithm that gets the job done but which doesn't perform well/etc. This makes me wonder if there would be a benefit from coming up with a flash block layer of some sort that re-implements this stuff properly. We have stuff like f2fs which does this at the filesystem level. However, this might be going too far as this inevitably competes with the filesystem layer on features/etc. Maybe what we need is something more like lvm for flash. It doesn't try to be a filesystem. It just implements block-level storage mapping one block device to a new block device. It might very well implement a log-based storage layer. It would accept TRIM commands and any other related features. It would then have a physical device translation layer. Maybe it would be aware of different drive models and their idiosyncrasies, so on some drives it might just be a NOOP passthrough and on other drives it implements its own log-based storage with batched trims on large contiguous regions, and so on. Since it isn't a full POSIX filesystem it could be much simpler and just focus on the problem it needs to solve - dealing with brain-dead SSD controllers. -- Rich ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [gentoo-user] Understanding fstrim... 2020-04-13 12:18 ` Rich Freeman @ 2020-04-13 13:18 ` tuxic 2020-04-13 14:27 ` Rich Freeman 0 siblings, 1 reply; 22+ messages in thread From: tuxic @ 2020-04-13 13:18 UTC (permalink / raw To: gentoo-user On 04/13 08:18, Rich Freeman wrote: > On Mon, Apr 13, 2020 at 7:55 AM Michael <confabulate@kintzios.com> wrote: > > > > I have noticed when prolonged fstrim takes place on an old SSD drive of mine > > it becomes unresponsive. As Rich said this is not because data is being > > physically deleted, only a flag is switched from 1 to 0 to indicate its > > availability for further writes. > > That, and it is going about it in a brain-dead manner. > > A modern drive is basically a filesystem of sorts. It just uses block > numbers instead of filenames, but it can be just as complex > underneath. > > And just as with filesystems there are designs that are really lousy > and designs that are really good. And since nobody sees the source > code or pays much attention to the hidden implementation details, it > is often not designed with your requirements in mind. > > I suspect a lot of SSDs are using the SSD-equivalent of FAT32 to > manage their block remapping. Some simple algorithm that gets the job > done but which doesn't perform well/etc. > > This makes me wonder if there would be a benefit from coming up with a > flash block layer of some sort that re-implements this stuff properly. > We have stuff like f2fs which does this at the filesystem level. > However, this might be going too far as this inevitably competes with > the filesystem layer on features/etc. > > Maybe what we need is something more like lvm for flash. It doesn't > try to be a filesystem. It just implements block-level storage > mapping one block device to a new block device. It might very well > implement a log-based storage layer. It would accept TRIM commands > and any other related features. It would then have a physical device > translation layer. Maybe it would be aware of different drive models > and their idiosyncrasies, so on some drives it might just be a NOOP > passthrough and on other drives it implements its own log-based > storage with batched trims on large contiguous regions, and so on. > Since it isn't a full POSIX filesystem it could be much simpler and > just focus on the problem it needs to solve - dealing with brain-dead > SSD controllers. > > -- > Rich > Hi Rich, hi Michael, THAT is information I like...now I start(!) to understand the "inner mechanics" of fstrim...thank you very much!!! :::))) One quesion -- not to express any doubt of what you wrote Rich, but onlu to check, whether I understand that detail or not: Fstrim "allows" the drive to trim ittself. The actual "trimming" is done by the drive ittself without any interaction from the outside of the SSD. You wrote: > > Now, the drive controller needs to keep track of which blocks are in > > use (which it does whether you use fstrim or not), and that data is > > probably stored in some kind of flash so that it is persistent, but > > presumably that is managed in such a manner that it is unlikely to > > fail before the rest of the drive fails. and: > > Fundamentally trimming is just giving the drive more information about > > the importance of the data it is storing. Just about any filesystem For me (due my beginners level of knowing the "behind the scene" things) this is kinda contradictionous. On the one hand, the SSD drive keeps track of the information, what blocks are used and unused. And trimming is done by the drive in itsself. On the other hand trimming is "just giving the drive more information about...." What kind of information does the commandline tool fstrim transfers to the SSD beside the command "fstrim yourself" (an ioctl, I think?), which is needed to fstrim the blocks and what kind of information is the SDD collecting itsself for this purpose? Cheers! Meino > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [gentoo-user] Understanding fstrim... 2020-04-13 13:18 ` tuxic @ 2020-04-13 14:27 ` Rich Freeman 2020-04-13 15:41 ` David Haller 0 siblings, 1 reply; 22+ messages in thread From: Rich Freeman @ 2020-04-13 14:27 UTC (permalink / raw To: gentoo-user On Mon, Apr 13, 2020 at 9:18 AM <tuxic@posteo.de> wrote: > > One quesion -- not to express any doubt of what you wrote Rich, but onlu > to check, whether I understand that detail or not: > > Fstrim "allows" the drive to trim ittself. The actual "trimming" is > done by the drive ittself without any interaction from the outside > of the SSD. > ... > On the one hand, the SSD drive keeps track of the information, what > blocks are used and unused. And trimming is done by the drive in > itsself. On the other hand trimming is "just giving the drive more > information about...." > > What kind of information does the commandline tool fstrim transfers to > the SSD beside the command "fstrim yourself" (an ioctl, I think?), > which is needed to fstrim the blocks and what kind of information is > the SDD collecting itsself for this purpose? > So, "trimming" isn't something a drive does really. It is a logical command issued to the drive. The fundamental operations the drive does at the physical layer are: 1. Read a block 2. Write a block that is empty 3. Erase a large group of blocks to make them empty The key attribute of flash is that the physical unit of blocks that can be erased is much larger than the individual blocks that can be read/written. This is what leads to the need for TRIM. If SSDs worked like traditional hard drives where an individual block could be rewritten freely then there would be no need for trimming. Actually, drives with 4k sectors is a bit of an analogy though the cost for partial rewrites on a hard drive is purely a matter of performance and not longevity and the 4k sector issue can be solved via alignment. It is an analogous problem though, and SMR hard drives is a much closer analogy (really no different from 4k sectors except now the magnitude of the problem is MUCH bigger). Whether you use TRIM or not an SSD has to translate any logical operation into the three physical operations above. It also has to balance wear into this. So, if you don't use trim a logical instruction like "write this data to block 1" turns into: 1. Look at the mapping table to determine that logical block 1 is at physical block 123001. 2. Write the new contents of block 1 to physical block 823125 which is unused-but-clean. 3. Update the mapping table so that block 1 is at physical block 823125, and mark 123001 as unused-but-dirty. Then the controller would have two background tasks that it would run periodically: Task 1 - look for contiguous regions marked as unused-but-dirty. When one of the right size/alignment is identified erase it physically and mark it as unused-but-clean. Task 2 - if the amount of unused-but-clean space gets too low, then find areas that have fragmented unused-but-dirty space. The used space in those regions would get copied to new unused-but-clean blocks and remapped, and then task 1 will erase them and mark them as clean. This deals with fragmented unused space. Every SSD will do some variation on this whether you ever use the trim command as it is necessary for wear-leveling and dealing with the physical erase limitations. Now, in this hypothetical case here is how the drive handles a TRIM command. If it gets the logical instruction "TRIM block 1" what it does is: 1. Look at the mapping table to determine that logical block 1 is at physical block 123001. 2. Mark physical block 123001 as unused-but-dirty in the mapping table. That's all it does. There are four ways that a drive can get marked as unused on an SSD: 1. Out of the box all blocks are unused-but-clean. (costs no operations that you care about) 2. The trim command marks a block as unused-but-dirty. (costs no operations) 3. Block overwrites mark the old block as unused-but-dirty. (costs a write operation, but you were writing data anyway) 4. Task 2 can mark blocks as unused-but-dirty. (costs a bunch of reads and writes) Basically the goal of TRIM is to do more of #2 and less of #4 above, which is an expensive read-write defragmentation process. Plus #4 also increases drive wear since it involves copying data. Now this is all a bit oversimplified but I believe it is accurate as far as it illustrates the concept. A real drive probably groups logical blocks a bit more so that it doesn't need to maintain a 1:1 block mapping which seems like it would use a lot of space. Again, it is a bit like a filesystem so all the optimizations filesystems use like extents/etc would apply. At the physical level the principle is that the drive has to deal with the issue that reads/writes are more granular than erases, and everything else flows from that. The same issue applies to SMR hard drives, which were discussed on this list a while ago. -- Rich ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [gentoo-user] Understanding fstrim... 2020-04-13 14:27 ` Rich Freeman @ 2020-04-13 15:41 ` David Haller 2020-04-13 16:05 ` Rich Freeman 0 siblings, 1 reply; 22+ messages in thread From: David Haller @ 2020-04-13 15:41 UTC (permalink / raw To: gentoo-user Hello, On Mon, 13 Apr 2020, Rich Freeman wrote: >So, "trimming" isn't something a drive does really. It is a logical >command issued to the drive. > >The fundamental operations the drive does at the physical layer are: >1. Read a block >2. Write a block that is empty >3. Erase a large group of blocks to make them empty [..] >Now, in this hypothetical case here is how the drive handles a TRIM >command. If it gets the logical instruction "TRIM block 1" what it >does is: > >1. Look at the mapping table to determine that logical block 1 is at >physical block 123001. >2. Mark physical block 123001 as unused-but-dirty in the mapping table. > >That's all it does. There are four ways that a drive can get marked >as unused on an SSD: >1. Out of the box all blocks are unused-but-clean. (costs no >operations that you care about) >2. The trim command marks a block as unused-but-dirty. (costs no operations) >3. Block overwrites mark the old block as unused-but-dirty. (costs a >write operation, but you were writing data anyway) >4. Task 2 can mark blocks as unused-but-dirty. (costs a bunch of reads >and writes) > >Basically the goal of TRIM is to do more of #2 and less of #4 above, >which is an expensive read-write defragmentation process. Plus #4 >also increases drive wear since it involves copying data. Beautifully summarized Rich! But I'd like to add two little aspects: First of all: "physical write blocks" in the physical flash are 128kB or something in that size range, not 4kB or even 512B ... Haven't read, but looking enticing neither https://en.wikipedia.org/wiki/Write_amplification nor https://en.wikipedia.org/wiki/Trim_(computing) I hope they cover it ;) Anyway, a write to a single (used) logical 512B block involves: 1. read existing data of the phy-block-group (e.g. 128KB) 2. write data of logical block to the right spot of in-mem block-group 3. write in-mem block-group to (a different, unused) phy-block-group 4. update all logical block pointers to new phy-block-group as needed 5. mark old phy-block-group as unused And whatnot. And second: fstrim just makes the OS (via the Filesystem driver via the SATA/NVME/SCSI driver through some hoops), or the Filesystem when mounted with 'discard' via the drivers, tell the SSD one simple thing about logical blocks that a deleted file used to use (in the TRIM ATA/SCSI/SATA/NVME command, wikipedite for where TRIM is specced ;): "Hey, SSD, here's a list of LBAs (logical blocks) I no longer need. You may hencewith treat them as empty/unused." Without it, the SSD has no idea about those blocks being unneeded and treats blocks, once written to, as used blocks, doing the _tedious_ Copy-on-Write when a write hits one of those logical blocks, even if those were deleted on the filesystem level years ago... see above WP articles. Without TRIM, the SSD only gets to know the fact, when the driver (the FS) writes again to the same logical block ... With TRIM, the SSD-Controller knows what logical blocks it can treat as unused, and do much better wear-leveling. So, it's sort of a "trickle down 'unlink()' to the SSD"-feature. On the logical-block level, mind you. But for the SSD, that can be quite a "relief" regarding space for wear-leveling. And what takes time when doing a "large" TRIM is transmitting a _large_ list of blocks to the SSD via the TRIM command. That's why e.g. those ~6-7GiB trims I did just before (see my other mail) took a couple of seconds for 13GiB ~ 25M LBAs ~ a whole effin bunch of TRIM commands (no idea... wait, 1-4kB per TRIM and 4B/LBA is max. 1k LBAs/TRIM and for 25M LBAs you'll need minimum 25-100k TRIM commands... go figure ;) no wonder it takes a second or few ;) Oh, and yes, on rotating rust, all that does not matter. You'd just let the data rot and write at 512B (or now 4kB) granularity. Well, those 4k-but-512Bemulated drives (which is about all new ones by now I think) have to do something like SSDs. But only on the 4kB level. Plus the SMR shingling stuff of course. When will those implement TRIM? HTH, -dnh -- All Hardware Sucks and I do not consider myself to actually have any data until there's an offsite backup of it. -- Anthony de Boer ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [gentoo-user] Understanding fstrim... 2020-04-13 15:41 ` David Haller @ 2020-04-13 16:05 ` Rich Freeman 2020-04-13 20:34 ` antlists 0 siblings, 1 reply; 22+ messages in thread From: Rich Freeman @ 2020-04-13 16:05 UTC (permalink / raw To: gentoo-user On Mon, Apr 13, 2020 at 11:41 AM David Haller <gentoo@dhaller.de> wrote: > > First of all: "physical write blocks" in the physical flash are 128kB > or something in that size range, not 4kB or even 512B Yup, though I never claimed otherwise. I just made the generic statement that the erase blocks are much larger than the write blocks, even moreso than on a 4k hard drive. (The only time I mentioned 4k was in the context of hard drives, not SSDs.) > Anyway, a write to a single (used) logical 512B block > involves: > > 1. read existing data of the phy-block-group (e.g. 128KB) > 2. write data of logical block to the right spot of in-mem block-group > 3. write in-mem block-group to (a different, unused) phy-block-group > 4. update all logical block pointers to new phy-block-group as needed > 5. mark old phy-block-group as unused Yup. Hence my statement that my description was a simplification and that a real implementation would probably use extents to save memory. You're describing 128kB extents. However, there is no reason that the drive has to keep all the blocks in an erase group together, other than to save memory in the mapping layer. If it doesn't then it can modify a logical block without having to re-read adjacent logical blocks. > And what takes time when doing a "large" TRIM is transmitting a > _large_ list of blocks to the SSD via the TRIM command. That's why > e.g. those ~6-7GiB trims I did just before (see my other mail) took a > couple of seconds for 13GiB ~ 25M LBAs ~ a whole effin bunch of TRIM > commands (no idea... wait, 1-4kB per TRIM and 4B/LBA is max. 1k > LBAs/TRIM and for 25M LBAs you'll need minimum 25-100k TRIM > commands... go figure ;) no wonder it takes a second or few ;) There is no reason that 100k TRIM commands need to take much time. Transmitting the commands is happening at SATA speeds at least. I'm not sure what the length of the data in a trim instruction is, but even if it were 10-20 bytes you could send 100k of those in 1MB, which takes <10ms to transfer depending on the SATA generation. Now, the problem is the implementation on the drive. If the drive takes a long time to retire each command then that is what backs up the queue, and hence that is why the behavior depends a lot on firmware/etc. The drive mapping is like a filesystem and as we all know some filesystems are faster than others for various operations. Also as we know hardware designers often aren't optimizing for performance in these matters. > Oh, and yes, on rotating rust, all that does not matter. You'd just > let the data rot and write at 512B (or now 4kB) granularity. Well, > those 4k-but-512Bemulated drives (which is about all new ones by now I > think) have to do something like SSDs. But only on the 4kB level. Plus > the SMR shingling stuff of course. When will those implement TRIM? And that would be why I used 4k hard drives and SMR drives as an analogy. 4k hard drives do not support TRIM but as you (and I) pointed out, they're only dealing with 4k at a time. SMR drives sometimes do support TRIM. -- Rich ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [gentoo-user] Understanding fstrim... 2020-04-13 16:05 ` Rich Freeman @ 2020-04-13 20:34 ` antlists 2020-04-13 20:58 ` Rich Freeman 0 siblings, 1 reply; 22+ messages in thread From: antlists @ 2020-04-13 20:34 UTC (permalink / raw To: gentoo-user On 13/04/2020 17:05, Rich Freeman wrote: >> And what takes time when doing a "large" TRIM is transmitting a >> _large_ list of blocks to the SSD via the TRIM command. That's why >> e.g. those ~6-7GiB trims I did just before (see my other mail) took a >> couple of seconds for 13GiB ~ 25M LBAs ~ a whole effin bunch of TRIM >> commands (no idea... wait, 1-4kB per TRIM and 4B/LBA is max. 1k >> LBAs/TRIM and for 25M LBAs you'll need minimum 25-100k TRIM >> commands... go figure;) no wonder it takes a second or few;) > There is no reason that 100k TRIM commands need to take much time. > Transmitting the commands is happening at SATA speeds at least. I'm > not sure what the length of the data in a trim instruction is, but > even if it were 10-20 bytes you could send 100k of those in 1MB, which > takes <10ms to transfer depending on the SATA generation. Dare I say it ... buffer bloat? poor implementation? aiui, the spec says you can send a command "trim 1GB starting at block X". Snag is, the linux block size of 4KB means that it gets split into loads of trim commands, which then clogs up all the buffers ... Plus all too often the trim command is synchronous, so although it is pretty quick, the drive won't accept the next command until the previous one has completed. Cheers, Wol ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [gentoo-user] Understanding fstrim... 2020-04-13 20:34 ` antlists @ 2020-04-13 20:58 ` Rich Freeman 2020-04-14 3:32 ` tuxic 0 siblings, 1 reply; 22+ messages in thread From: Rich Freeman @ 2020-04-13 20:58 UTC (permalink / raw To: gentoo-user On Mon, Apr 13, 2020 at 4:34 PM antlists <antlists@youngman.org.uk> wrote: > > aiui, the spec says you can send a command "trim 1GB starting at block > X". Snag is, the linux block size of 4KB means that it gets split into > loads of trim commands, which then clogs up all the buffers ... > Hmm, found the ATA spec at: http://nevar.pl/pliki/ATA8-ACS-3.pdf In particular page 76 which outlines the LBA addressing for TRIM. It looks like up to 64k ranges can be sent per trim, and each range can cover up to 64k blocks. So that is 2^32 blocks per trim command, or 2TiB of data per TRIM if the blocks are 512 bytes (which I'm guessing is the case for ATA but I didn't check). The command itself would be half a megabyte since each range is 64 bits. But if the kernel chops them up as you say that will certainly add overhead. The drive controller itself is probably the bigger bottleneck unless it is designed to do fast TRIMs. -- Rich ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [gentoo-user] Understanding fstrim... 2020-04-13 20:58 ` Rich Freeman @ 2020-04-14 3:32 ` tuxic 2020-04-14 12:51 ` Rich Freeman 0 siblings, 1 reply; 22+ messages in thread From: tuxic @ 2020-04-14 3:32 UTC (permalink / raw To: gentoo-user On 04/13 04:58, Rich Freeman wrote: > On Mon, Apr 13, 2020 at 4:34 PM antlists <antlists@youngman.org.uk> wrote: > > > > aiui, the spec says you can send a command "trim 1GB starting at block > > X". Snag is, the linux block size of 4KB means that it gets split into > > loads of trim commands, which then clogs up all the buffers ... > > > > Hmm, found the ATA spec at: > http://nevar.pl/pliki/ATA8-ACS-3.pdf > > In particular page 76 which outlines the LBA addressing for TRIM. It > looks like up to 64k ranges can be sent per trim, and each range can > cover up to 64k blocks. So that is 2^32 blocks per trim command, or > 2TiB of data per TRIM if the blocks are 512 bytes (which I'm guessing > is the case for ATA but I didn't check). The command itself would be > half a megabyte since each range is 64 bits. > > But if the kernel chops them up as you say that will certainly add > overhead. The drive controller itself is probably the bigger > bottleneck unless it is designed to do fast TRIMs. > > -- > Rich > Hi all, thanks **a lot** for all this great information! :) Since I have a NVMe drive on a M.2 socket I would be interested at what level/stage (?word? ...sorry...) the data go a different path as with the classical sata SSDs. Is this just "protocol" or there is something different? On the internet I read, that the io-scheduler is choosen differentlu by the kernel, if there is a NVMe driven detected, for example. (I think, the io-scheduler has nothing to do with the fstrim operation itsself (I think...) -- it is there only as an example...) I think, due to the corona lockdown I have to fstrim my hair myself.... :) 8) Cheers! Meino ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [gentoo-user] Understanding fstrim... 2020-04-14 3:32 ` tuxic @ 2020-04-14 12:51 ` Rich Freeman 2020-04-14 14:26 ` Wols Lists 0 siblings, 1 reply; 22+ messages in thread From: Rich Freeman @ 2020-04-14 12:51 UTC (permalink / raw To: gentoo-user On Mon, Apr 13, 2020 at 11:32 PM <tuxic@posteo.de> wrote: > > Since I have a NVMe drive on a M.2 socket I would > be interested at what level/stage (?word? ...sorry...) > the data go a different path as with the classical sata > SSDs. > > Is this just "protocol" or there is something different? NVMe involves both hardware, protocol, and of course software changes driven by these. First, a disclaimer, I am by no means an expert in storage transport protocols/etc and obviously there are a ton of standards so that any random drive works with any random motherboard/etc. If I missed something or have any details wrong please let me know. From the days of IDE to pre-NVMe on PC the basic model was that the CPU would talk to a host bus adapter (HBA) which would in turn talk to the drive controller. The HBA was usually on the motherboard but of course it could be in an expansion slot. The CPU talked to the HBA using the bus standards of the day (ISA/PCI/PCIe/etc) and this was of course a high-speed bus designed to work with all kinds of stuff. AHCI is the latest generation of protocols for communication between the CPU and the HBA so that any HBA can work with any OS/etc using generic drivers. The protocol was designed with spinning hard drives in mind and has some limitations with SSD. The HBA would talk to the drive over SCSI/SAS/SATA/PATA and there are a bunch of protocols designed for this. Again, they were designed in the era of hard drives and have some limitations. The concept of NVMe is to ditch the HBA and stick the drive directly on the PCIe bus which is much faster, and streamline the protocols. This is a little analogous to the shift to IDE from the old days of separate drive controllers - cutting out a layer of interface hardware. Early NVMes just had their own protocols, but a standard was created so that just as with AHCI the OS can use a single driver for any drive vendor. At the hardware level the big change is that NVMe just uses the PCIe bus. That M.2 adapter has a different form factor than a regular PCIe slot, and I didn't look at the whole pinout, but you probably could just make a dumb adapter to plug a drive right into a PCIe slot as electronically I think they're just PCIe cards. I believe they have to be PCIv3+ and typically have 4 lanes, which is a lot of bandwidth. And of course it is the same interface used for NICs/graphics/etc so it is pretty low latency and can support hardware interrupts and all that stuff. It is pretty common for motherboards to share an M.2 slot with a PCIe slot so that you can use one or the other but not both for this reason - the same lanes are in use for both. The wikipedia article has a good comparison of the protocol-level changes. For the most part it mainly involves making things far more parallel. With ATA/AHCI you'd have one queue of commands that was only a few instructions deep. With NVMe you can have thousands of queues with thousands of commands in each and 2048 different hardware interrupts for the drive to be able to signal back when one command vs another has completed (although I'm really curious if anybody takes any kind of advantage of that - unless you have different drivers trying to use the same drive in parallel it seems like MSI-X isn't saving much here - maybe if you had a really well-trusted VM or something or the command set has some way to segment the drive virtually...). This basically allows billions of operations to be in-progress at any given time so there is much less of a bottleneck in the protocol/interface itself. NVMe is all about IOPS. I don't know all the gory details but I wouldn't be surprised that once you get past this if many of the commands themselves are the same just to keep things simple. Or maybe they just rewrote it all from scratch - I didn't look into it and would be curious to hear from somebody who has. Obviously the concepts of read/write/trim/etc are going to apply regardless of interface. -- Rich ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [gentoo-user] Understanding fstrim... 2020-04-14 12:51 ` Rich Freeman @ 2020-04-14 14:26 ` Wols Lists 2020-04-14 14:51 ` Rich Freeman 0 siblings, 1 reply; 22+ messages in thread From: Wols Lists @ 2020-04-14 14:26 UTC (permalink / raw To: gentoo-user On 14/04/20 13:51, Rich Freeman wrote: > I believe they have > to be PCIv3+ and typically have 4 lanes, which is a lot of bandwidth. My new mobo - the manual says if I put an nvme drive in - I think it's the 2nd nvme slot - it disables the 2nd graphics card slot :-( Seeing as I need two graphics cards to double-head my system, that means I can't use two nvmes :-( But using the 1st nvme slot disables a sata slot, which buggers my raid up ... :-( Oh well. That's life :-( Cheers, Wol ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [gentoo-user] Understanding fstrim... 2020-04-14 14:26 ` Wols Lists @ 2020-04-14 14:51 ` Rich Freeman 0 siblings, 0 replies; 22+ messages in thread From: Rich Freeman @ 2020-04-14 14:51 UTC (permalink / raw To: gentoo-user On Tue, Apr 14, 2020 at 10:26 AM Wols Lists <antlists@youngman.org.uk> wrote: > > On 14/04/20 13:51, Rich Freeman wrote: > > I believe they have > > to be PCIv3+ and typically have 4 lanes, which is a lot of bandwidth. > > My new mobo - the manual says if I put an nvme drive in - I think it's > the 2nd nvme slot - it disables the 2nd graphics card slot :-( > First, there is no such thing as an "nvme slot". You're probably describing an M.2 slot. This matters as I'll get to later... As I mentioned many motherboards share a PCIe slot with an M.2 slot. The CPU + chipset only has so many PCIe lanes. So unless they aren't already using them for expansion slots they have to double up. By doubling up they can basically stick more x2/4/8 PCIe slots into the motherboard than they could if they completely dedicated them. Or they could let that second GPU talk directly to the CPU vs having to go through the chipset (I think - I'm not really an expert on PCIe), and let the NVMe talk directly to the CPU if you aren't using that second GPU. > > But using the 1st nvme slot disables a sata slot, which buggers my raid > up ... :-( > While that might be an M.2 slot, it probably isn't an "nvme slot". M.2 can be used for either SATA or PCIe. Some motherboards have one, the other, or both. And M.2 drives can be either, so you need to be sure you're using the right one. If you get the wrong kind of drive it might not work, or it might end up being a SATA drive when you intended to use an NVMe. A SATA drive will have none of the benefits of NVMe and will be functionally no different from just a regular 2.5" SSD that plugs into a SATA cable - it is just a different form factor. It sounds like they doubled up a PCIe port on the one M.2 connector, and they doubled up a SATA port on the other M.2 connector. It isn't necessarily a bad thing, but obviously you need to make tradeoffs. If you want a motherboard with a dozen x16 PCIe's, 5 M.2's,14 SATA ports, and 10 USB3's on it there is no reason that shouldn't be possible, but don't expect to find it in the $60 bargain bin, and don't expect all those lanes to talk directly to the CPU unless you're using EPYC or something else high-end. :) -- Rich ^ permalink raw reply [flat|nested] 22+ messages in thread
* [gentoo-user] Re: Understanding fstrim... 2020-04-13 11:55 ` Michael 2020-04-13 12:18 ` Rich Freeman @ 2020-04-13 15:48 ` Holger Hoffstätte 1 sibling, 0 replies; 22+ messages in thread From: Holger Hoffstätte @ 2020-04-13 15:48 UTC (permalink / raw To: gentoo-user, Michael On 4/13/20 1:55 PM, Michael wrote: > I have noticed when prolonged fstrim takes place on an old SSD drive of mine > it becomes unresponsive. As Rich said this is not because data is being > physically deleted, only a flag is switched from 1 to 0 to indicate its > availability for further writes. This is all true and while the exact behaviour depends on the drive model, nevertheless a common problem is that the drive's request queue is flooded during whole-drive fstrim and that can lead to unresponsiveness, especially when you're using the deadline (or mq-deadline these days) scheduler, which by nature has a tendency to starve readers when continuous long chains of writes happen. One can tune the read/write ratio to be more balanced. A better way to help the block layer out is by either permanently switching to bfq on these (presumably lower-endish) devices, or - my approach - just during scheduled fstrim. I've written a script to do just that and have been using it on all my machines (server, workstation, laptop) for a long time now. Switching to a fair I/O scheduler during scheduled fstrim *completely* fixed the system-wide lag for me. I suggest you try bfq for fstrim; if you'd like the script just let me know. cheers Holger ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2020-04-14 14:51 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-04-13 5:32 [gentoo-user] Understanding fstrim tuxic 2020-04-13 9:22 ` Andrea Conti 2020-04-13 9:49 ` Neil Bothwick 2020-04-13 11:01 ` Andrea Conti 2020-04-13 10:06 ` Michael 2020-04-13 11:00 ` tuxic 2020-04-13 14:00 ` David Haller 2020-04-13 10:11 ` Peter Humphrey 2020-04-13 11:39 ` Rich Freeman 2020-04-13 11:55 ` Michael 2020-04-13 12:18 ` Rich Freeman 2020-04-13 13:18 ` tuxic 2020-04-13 14:27 ` Rich Freeman 2020-04-13 15:41 ` David Haller 2020-04-13 16:05 ` Rich Freeman 2020-04-13 20:34 ` antlists 2020-04-13 20:58 ` Rich Freeman 2020-04-14 3:32 ` tuxic 2020-04-14 12:51 ` Rich Freeman 2020-04-14 14:26 ` Wols Lists 2020-04-14 14:51 ` Rich Freeman 2020-04-13 15:48 ` [gentoo-user] " Holger Hoffstätte
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox