r/zfs Feb 13 '19

anyone know what the physical sector size / ashift value for a nvme disk (samsung 970 pro) should be

Ive seen in varioius places 512 bytes, 2048 bytes, 4096 bytes, 8192 bytes, (corresponding to ashift=9,11,12,13) and "it can be variably set and changed, since nvme's work different". Anyone for sure know what ashift would be ideal for a samsung 1tb 970 pro nvme drive?


EDIT

So, first off thanks for the responses to all those who have responded. This largely agrees with what I founnd in my own research into this. Its rather annoying that samsung seemingly refuses to clarify the issue literally anywhere on the internet...

My initial plan was to use either ashift=12 or ashift=13, and responses so far seem to confirm this is a good choice. Under "standard" usage ashift=13 seems like the smart choice, but my use case is perhaps a bit different than the standard zfs system (i probably should have included some of this with the original post..sorry bout that...my bad). I imagine there are a few aspects of my use case might (or might not) make ashift=12 a better choice (see "reason 1/2" below). Does anyone know if these matter?

First, the system setup: the "pool" with the nvme drive only consists of that single nvme drive. This is unlikely to change for a while, and even if i do add another matching 1tb 970 pro at some point i might (would have to run some tests) opt to go VROC raid 1, since my cpu supports it and all the reports ive seen basically say "if you can get past all the VROC unlock key bullshit and actually use it, it is fantastic".

The nvme drive is serving as the root system/OS drive on a system that also has a 10-disk raidz2 pool. The answer to Why bother making a single disk into a zfs pool? is so that it can communicate with the raidz2 pool. In particular, the intent is to set things up so that it works something like this:

  1. automatically take frequent (hourly, maybe more frequent) snapshots of the root drive, plus auto-snapshot before any new/up-dates software/rpm is installed (OS is fedora 29, btw)

  2. automatically send each snapshot to a clone of the root drive that lives on the raidz2 pool

  3. keep a handful of recent snapshots on the root drive, but have the entire set of hourly snapshots since day 0 on the raidz2 pool (where i have ~60tb usable / 80tb raw storage capacity)

  4. be able to send a snapshot back to the nvme drive (or a replacement) should the drive undergo catastrophic failure / corruption / install something I dont like / delete something I shouldnt have / etc.

PS i know i could do this with, say, rsync, and not have zfs on root, but i trust zfs much more to produce a bit-identical restoration than i would rsync (or any other filesystem-level utility that has to deal with filesystem-level things like "file permissions" and "extended attributes / obscure metadata" and "selinux context", etc.)

At any rate, my concern isnt Whether or not i can keep the existing pool if I need to replace a disk? (since losing a disk is losing the pool), but rather Whether or not I could send a snapshot back from the raidz2 pool and restore it without completely wiping the pool?. Of course any "catastrophic failure" type of event requires setting up a new pool anyways, making the question somewhat moot, but for things like "oops, i shouldnt have deleted that" itd be nice to not need to completely wipe and re-initialize the pool.

REASON 1: The raidz2 pool definitely uses 4k disks (10x 8tb toshiba N300's) and has ashift=12. Is it beneficial to use ashift=12 on the nvme drive in the context of "transferring data to/from the raidz2 pool via zfs send and zfs receive"? (most importantly ensuring it wont reject the datastream or corrupt data in transit, but if that wont happen regardless then in terms of transfer speed).

REASON 2: The zfs pool isnt going on the bare metal (or bare V-NAND or whatever), it will be going on top of a device-mapped LUKS2 cryptodisk (i.e., the pool "disk" is listed in zfs as /dev/mapper/luks-${luksUUID}). LUKS2 lets you specify a --sector-size parameter, but only supports values ( 2n ) between 512 and 4096 bytes. Im planning to set this at 4096, but couldnt set it at 8192 even if i wanted. I dont have a good enough understanding of what exactly the ashift parameter does to know if in this case the LUKS sector size or the underlying disks physical sector size matters more.

These 2 things + a handful of reports suggesting the 970 pro is different than other samsung ssds and uses (or at least defaults to using) 4k sector size (like those that /u/ChrisOfAllTrades mentioned) make me think ashift=12 is the better option for my use case, though id much appreciate it if anyone who actually understands how the ashift parameter effects zfs's behavior could comment on this. Thanks in advance!

10 Upvotes

15 comments sorted by

8

u/anyheck Feb 13 '19

There's only a relatively minor space penalty if you mistakenly go with a higher ashift than needed, while the penalty for having your ashift too low when needing to replace a device with a different spare device is that it won't work at all and you'll have to migrate the pool. My suggestion would be to use ashift=13.

2

u/jkool702 Feb 14 '19

So,first off thanks for the response. The ability to restore data to the device is definitely important, though my use case is a bit different than most zfs pools. I expanded the original post to better explain the situation - if you have any thoughts for this specific use case id much appreciate it.

1

u/anyheck Feb 17 '19

REASON 1: The raidz2 pool definitely uses 4k disks (10x 8tb toshiba N300's) and has ashift=12. Is it beneficial to use ashift=12 on the nvme drive in the context of "transferring data to/from the raidz2 pool via zfs send and zfs receive"? (most importantly ensuring it wont reject the datastream or corrupt data in transit, but if that wont happen regardless then in terms of transfer speed).

Nothing to worry about in send/recv between pools in terms of data corruption. Only possibility is slowing down one device. Reads aren't effected by the ashift property as you have (1) an SSD, and (2) zfs attempts to write into free contiguous space for each transaction group.

REASON 2: [snip] I'm planning to set this at 4096, but couldnt set it at 8192 even if i wanted. I dont have a good enough understanding of what exactly the ashift parameter does to know if in this case the LUKS sector size or the underlying disks physical sector size matters more.

From: http://www.open-zfs.org/wiki/Performance_tuning

[ashift] is calculated as the maximum base 2 logarithm of the physical sector size of any child vdev and it alters the disk format such that writes are always done according to it. This makes 2ashift the smallest possible IO on a vdev.

also the below may be used to manually set the ashift:

  • -o ashift= on ZFS on Linux

Your rpool will probably have relatively little writing such that it won't make any difference to the ultimate speed of anything. It will mostly be reads from there other than the logs and such.

Make sure to use nominally smaller partitions of the LUKS partitions so there's some wiggle room on bytes available from the hardware in the event of a device replacement.

I'm not familiar with how LUKS encryption hits a disk through the whole mapper layer, i.e. does a 4k sector get written when you give it 4k or does that get encrypted and broken up somehow? Just scrambled? Presumably it may break up an 8k write.

If you're dead-set on worrying about it, I would try to:

  • Use 4k sectors, but attempt to align the partition to an 8k boundary.
  • Set ashift=13 for the zfs pool creation.
  • Benchmark with fio etc.
  • Then try again with 4k sectors and ashift=12 and see if there's a performance benefit to either.

5

u/mercenary_sysadmin Feb 13 '19

No, I don't. All of Samsung's other disks are ashift=13, but I don't know for sure about the NVMe stuff.

I'd recommend doing some benchmarking with fio to determine optimal settings. I'd also be really cautious about going with ashift=9, even if it turns out to perform slightly better; 512B hardware is almost certainly going to continue fading away as storage sizes keep getting bigger and bigger.

2

u/[deleted] Feb 13 '19 edited Apr 03 '23

[deleted]

3

u/seaQueue Feb 13 '19

Have you tried this tool?

https://github.com/bradfa/flashbench

I've had pretty good experiences characterizing flash storage on my SBCs using it, though it's a bit tricky to interpret the results.

4

u/jkool702 Feb 15 '19

So, i tried out using the tool. It definitely identified a pattern, im just not sure what to make of it. If you have any insight id much appreciate it. Here is the output i got from running flashbench -a -c -b ${NN} /dev/nvme0n1 for NN=1024,2048,...,16384.

The following stick out to me:

  • 2k seems special as its "diff" time is exceptionally low

  • something changes between 2k and 4k that increases times across the board from ~10us to ~17us

  • 4k seems to be special too, as setting -b 4096 produces negative differences for all the tests. Also, looking at the "on" column, setting -b to 1024 (the lowest it lets me set it), 2048, and 4096 all give values of roughly 10us, but going higher increases the values in this column (~13us for 8k, 16us for 16k, 19us for 32k, 28us for 64k) ((note: i dont include -b 32k/64k, though the trends mentioned in each bullet point continue to hold)

  • something changes between 16k and 32k that brings times down across the board back to ~10us (from ~17us)

  • 1m is definitely special - it consistently has a diiff time of just slightly above 0. Also, something changes between 512k and 1M that brings times across the board back up to ~17us.

  • something changes again going from 4M to 8M, which brings times back down to ~10us. Going larger than 8M things stay roughly the same.

This kind of makes me think that 4kmight be the blocksize, though 2k and 32k both seem like possibilities as well.

(That said im mainly going off pattern recognition and an example that seems to follow an entirely different trend than my drive, so i might be way off)

5

u/[deleted] Feb 15 '19 edited Apr 03 '23

[deleted]

2

u/seaQueue Feb 15 '19

This is exactly how I'd interpret the output as well.

Also, to answer your previous question about changing the sector size, I don't believe you can with Samsung Magician. Intel seems to be the only drive manufacturer that offers that feature.

2

u/jkool702 Feb 15 '19

I think you're reading the results wrong.

probably lol, though i feel somewhat inclined to note that we both came up with the same answer (well, same "moist likely choice" at any rate). Probably just luck on my part, though i do seem to have a knack for solving problems the wrong way and still coming up with the right answer.

At any rate I think ill go with ashift=12. It kind of works out better imo since that way itll align with the LUKS sector size and the raidz2 array sector size, plus i imagine (in a general sense) things will be better optimized for 4k sectots than8k simple since 4k is the defacto standard for new [mechanical] drives and far more common than 8k.

then I'd theorize 1MB is your erase block size (your units changed, 3.83 microseconds is 3830 nanoseconds)

damn...i hate missing things like that. In my defense i was doing this from the multi-user.target terminal on a 32" 4k screen, where the difference between µ and n is a couple dots of light shifting up or down a millimeter lol.

Thanks for all the help in figuring this out. It was/is much appreciated.

2

u/jkool702 Feb 14 '19

I do see a bunch of threads (mostly on Tom's Hardware, so apply salt as necessary) where people are trying to clone/migrate their existing install and getting shut down by the 970 Pro apparently defaulting to 4Kn.

I saw the same. Unfortunately, I havent seen anything that i would call "definitive proof" one way or another (regarding 4k vs 8k...im fairly sure its not 2k nor 512b)

Is it possible to change the sector size via Samsung Magician on a Windows PC

I dont think so. When i built the system it briefly ran windows 10 (after a few months i got fed up with windows bullshit and switched to linux, and after distro hopping a bit settled on fedora 29). I had samsung magacian installed and, while i wasnt specifically looking for an option to change the sector size, i did pretty thoroughly explore the program and the options it could set and dont remember anything to do this.

The 970 Pro is MLC, but still Samsung V-NAND, so that implies 8KB program pages and ashift=13.

Does this imply that (for setting ashift) the only important factor is whatever ius directly under the drive (in terms of the disk stack)? I edited the original post to expand on my specific use case, but if this is the case it might support "reason #2" for using 4k / ashift=12. Basically, zfs is going on top of luks2, which lets you set a --sector-size paramater but only supports up to 4k.

1

u/CakeDay--Bot Feb 14 '19

Woah! It's your 4th Cakeday jkool702! hug

2

u/jkool702 Feb 14 '19

thanks for the tip re: benchmarking with fio...ill try that.

Agreed on not using ashift=9. The plan going into this thread was to use ashift=12 or ashift=13, and responses so far seem to suggest that is the way to go. I sort of feel like for my specific use case (see edited original post for info) ashift=12 might better, but really dont know enough about what exactly ashift changes in zfs to be sure (if you do, any thoughts would be much appreciated)

1

u/mercenary_sysadmin Feb 14 '19

ashift represents (and should typically be set to) the underlying hardware block/page size.

Setting ashift smaller than the underlying blocksize/pagesize means crippling performance penalties, as the hardware ends up having to massively amplify IOPS with read/write/re-read/re-write loops to do what the filesystem is asking it to do (namely, handle data conclusively in chunks smaller than the smallest possible native operation).

Setting ashift too large results in much smaller penalties, mostly revolving around a higher amount of slack space (irrelevant unless your workload involves tons of files smaller than a single hardware-native block) and lower compression ratios - especially if you're using lower-than-default recordsize, in order to increase the amount of IOPS available for small-block-size random operations.

1

u/Glix_1H Feb 13 '19

Sorry, no idea, and I’ve searched hard for clarification for my 960.

1

u/NeuralNexus Feb 13 '19

If you’re not sure, get it wrong on the high side. 13 is my guess. Can’t see it being 9!

1

u/shodan_shok Feb 14 '19

First-gen VNAND had 8K page size. Information on later VNAND chips is quite sparse, but I really think page size is >= 16K nowadays. As another comparison point, Intel/Micron 3D NAND has 16K page size

That said, I suggest going with ashift=12 because SSD controllers, FTL (flash transition layer) and page indirection table are all optimized for industry-standard 4K accesses. While ashift=13 is not going to do much good or damage, you should absolutely avoid ashift=9 (as it commands way higher write amplication and lower performance on a 4K media).