Skip to content

zfs-2.3.2 patchset #17214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 78 commits into
base: zfs-2.3-release
Choose a base branch
from

Conversation

tonyhutter
Copy link
Contributor

Motivation and Context

Proposed zfs-2.3.2 patchset

Description

Bump META file to support 6.14 kernel, support STATX_DIOALIGN, misc fixes.

How Has This Been Tested?

Runners will test

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

robn and others added 19 commits March 24, 2025 13:49
…zfs#17137)

Adding fields to zinject_record_t unexpectedly extended zfs_cmd_t,
preventing some things working properly with 2.3.1 userspace tools
against 2.3.0 kernel module.

This reverts commit fabdd50.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Rob Norris <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Update the META file to reflect compatibility with the 6.14
kernel.

Signed-off-by: Tony Hutter <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: @ImAwsumm
The operator can override TEST_BASE_DIR by setting its source var
FILEDIR through zfs-tests.sh -d. There were a handful of cases where
this was not honoured.

By default FILEDIR (and so TEST_BASE_DIR) is /var/tmp, so there should
be no functional change if the operator does nothing.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
The default file vdevs, constrained binpath and temporary runfiles were
all explicitly places in /var/tmp. Instead, put them under FILEDIR,
which is set from -d and defaults to /var/tmp. TEST_BASE_DIR is also
initialised from FILEDIR, which means all data for the run will now end
up under the operator-specified data dir.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
The config file value overrides any set by the operator, making it quite
difficult to put the test output elsewhere. The default is
/var/tmp/test_results (via BASEDIR in test-runner) so this shouldn't
change anything for the default case.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
The default outputdir had a timestamp appended in TestRun.__init__, and
then the timestamp was unconditionally applied again after the runfile
had been loaded, assuming that an outputdir would be set in the runfile
too. If the runfile didn't have an outputdir, then the outputdir would
get a second timestamp appended.

Further, if test groups or individual tests themselves specificed an
outputdir, those would be set on their config, but would not get a
timestamp appended. It's not entirely clear if that's wrong or not, but
it is certainly not consistent with the rest.

To clean all this up, change things to append a timestamp to a received
outputdir (from arg or runfile) before setting it in any TestRun,
TestGroup or Test object.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Many tests use mktemp to create temporary files and dirs, which will
usually put them in /tmp unless instructed otherwise. This had led to
many tests trying to give mktemp a useful temp path in ad-hoc ways, and
others just using it directly without knowing they're potentially
leaving stuff lying around.

So we set TMPDIR to FILEDIR, which makes the simplest uses of mktemp put
things in the wanted work dir.

Included here is a hack to get TMPDIR into the test. If a test has to be
run as a different user (most of them), it is run through sudo. ld.so
from glibc will not pass TMPDIR to a setuid program, so instead we
re-set TMPDIR after sudo before running the target command.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
In all cases, rely on mktemp itself to make the best decision about
where to place the file or directory. In all cases, that decision will
be $TMPDIR, which we have set globally.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Most of these are trying to use TMPDIR to put their work files somewhere
sensible. Now that we've set up correctly, they can all just use mktemp
to do the job.

In a couple of places cleaning up temp files wasn't being done
correctly, which has been fixed.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
The new Fast Dedup feature has a lot of moving parts, and only some of
them have tests. We have some tests for prefetch and quota, and a
generic ZAP shrinking test, but we don't have anything for the pruning
command or specific to DDT zap shrinking. Here we add a couple small new
tests for zpool ddtprune and DDT-specific ZAP shrinking.

Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Reviewed-by: Allan Jude <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes openzfs#17049
PR openzfs#14161 made spa_do_crypt_objset_mac_abd() to ignore MAC errors
if local MAC can not be calculated at the time.  But it does not
mean we should also ignore portable MAC errors there.

Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#17122
We had a case where we were autoreplacing a disk and
zpool_prepare_disk failed for some reason, and ZED
didn't log the return code.  This commit logs the code.

Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes openzfs#17124
since 4.10, bio->bi_opf needs to be checked to determine all kinds of
flush requests. this was the case prior to the commit referenced below,
but the order of ifdefs was not the usual one (newest up top), which
might have caused this to slip through.

this fixes a regression when using zvols as Qemu block devices, but
might have broken other use cases as well. the symptoms are that all
sync writes from within a VM configured to use such a virtual block
devices are ignored and treated as async writes by the host ZFS layer.

this can be verified using fio in sync mode inside the VM, for example
with

 fio \
 --filename=/dev/sda --ioengine=libaio --loops=1 --size=10G \
 --time_based --runtime=60 --group_reporting --stonewall --name=cc1 \
 --description="CC1" --rw=write --bs=4k --direct=1 --iodepth=1 \
 --numjobs=1 --sync=1

which shows an IOPS number way above what the physical device underneath
supports, with "zpool iostat -r 1" on the hypervisor side showing no
sync IO occuring during the benchmark.

with the regression fixed, both fio inside the VM and the IO stats on
the host show the expected numbers.

Fixes: 846b598
"config: remove HAVE_REQ_OP_* and HAVE_REQ_*"

Signed-off-by: Fabian-Gruenbichler <[email protected]>
Co-authored-by: Alexander Motin <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
The reward for good work is more work.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <[email protected]>
Closed openzfs#17141
Now instead of crashing when attempting to read the corrupt block
pointer, ZFS will return ECKSUM, in a stack that looks like this:

```
none:set-error
zfs.ko`arc_read+0x1d82
zfs.ko`dbuf_read+0xa8c
zfs.ko`dmu_buf_hold_array_by_dnode+0x292
zfs.ko`dmu_read_uio_dnode+0x47
zfs.ko`zfs_read+0x2d5
zfs.ko`zfs_freebsd_read+0x7b
kernel`VOP_READ_APV+0xd0
kernel`vn_read+0x20e
kernel`vn_io_fault_doio+0x45
kernel`vn_io_fault1+0x15e
kernel`vn_io_fault+0x150
kernel`dofileread+0x80
kernel`sys_read+0xb7
kernel`amd64_syscall+0x424
kernel`0xffffffff810633cb
```

This patch should hopefully also prevent such corrupt block pointers
from being written to disk in the first place.

And in zdb, don't crash when printing a block pointer with no valid
DVAs.  If a block pointer isn't embedded yet doesn't have any valid
DVAs, that's a data corruption bug.  zdb should be able to handle the
situation gracefully.

Finally, remove an extra check for gang blocks in SNPRINTF_BLKPTR.  This
check, which compares the asizes of two different DVAs within the same
BP, was added by illumos-gate commit b24ab67[^1], and I can't understand
why.  It doesn't appear to do anything useful, so remove it.

[^1]: illumos/illumos-gate@b24ab67

Fixes		openzfs#17077
Sponsored by:	ConnectWise
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed by: Alek Pinchuk <[email protected]>
Signed-off-by:	Alan Somers <[email protected]>
Closes openzfs#17078
This statx(2) mask returns the alignment restrictions for O_DIRECT
access on the given file.

We're expected to return both memory and IO alignment. For memory, it's
always PAGE_SIZE. For IO, we return the current block size for the file,
which is the required alignment for an arbitrary block, and for the
first block we'll fall back to the ARC when necessary, so it should
always work.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#16972
Implementation of DDT pruning introduced verification of DVAs in
a block pointer during ddt_lookup() to not by mistake free previous
pruned incarnation of the entry.  But when writing a new block in
zio_ddt_write() we might have the DVAs only from override pointer,
which may never have "D" flag to be confused with pruned DDT entry,
and we'll abandon those DVAs if we find a matching entry in DDT.

This fixes deduplication for blocks written via dmu_sync() for
purposes of indirect ZIL write records, that I have tested.  And
I suspect it might actually allow deduplication for Direct I/O,
even though in an odd way -- first write block directly and then
delete it later during TXG commit if found duplicate, which part
I haven't tested.

Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#17120
* FreeBSD 12 is EoL.  Drop it.
* Use the latest FreeBSD 13 and 14 versions.
* Add FreeBSD 15.0-CURRENT.
* Use the current python version.

Sponsored by:	ConnectWise
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by:	Alan Somers <[email protected]>
Closes openzfs#17139
Force receive (zfs receive -F) can rollback or destroy snapshots and
file systems that do not exist on the sending side (see zfs-receive man
page). This means an user having the receive permission can effectively
delete data on receiving side, even if such user does not have explicit
rollback or destroy permissions.

This patch adds the receive:append permission, which only permits
limited, non-forced receive. Behavior for users with full receive
permission is not changed in any way.

Fixes openzfs#16943
Reviewed-by: Ameer Hamza <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Gionatan Danti <[email protected]>
Closes openzfs#17015
@robn
Copy link
Member

robn commented Apr 4, 2025

Looking pretty good. Still hewing very closely to master, which I like for now :)

Please include:

  • 3838a7b (SPDX tag for uconv.c, which is gone on master so never got one)

Consider the following:

  • FreeBSD build tweaks 83fa051 11ca12d
  • FreeBSD FPU enable for PPC: c6b8138
  • Bugfix: dspace underflow: c9198ca
  • Bugfix: device removal cancel lock inversion: 395db27
  • Bugfix: nonrot property not unset properly: e72e39c

I've no skin in any of these; they just look small and important in their contexts. Maybe the last one has unintended performance consequences, which maybe isn't good for a stable series, but I think it's defensible.

If you want the lot, here's a branch ready to go: robn/zfs-2.3.2-staging-robn.

@mmatuska
Copy link
Contributor

mmatuska commented Apr 6, 2025

@tonyhutter please include #17199 (87f8bf6) and #17219 (88e3885)

@tonyhutter
Copy link
Contributor Author

@robn @mmatuska let me see if I can pull those in

@jumbi77
Copy link
Contributor

jumbi77 commented Apr 8, 2025

Just checking https://github.com/openzfs/zfs/commits/master/ . Between Apr 3, 2025 and today are some commits regarding bugfixing, manpage, quotas.... if they are not "too new", maybe pull them in if applicable ?

@tonyhutter
Copy link
Contributor Author

@jumbi77 yea there's a few low-risk bugfixes from that list that I will probably pull in.

Some other things 2.3.2 is waiting on:
Kernel compatibility fix: #17217
Get --enable-linux-builtin working again: #17212

@jumbi77
Copy link
Contributor

jumbi77 commented Apr 12, 2025

The trivial fix for ZED would be great too 😀: 189dc26

AttilaFueloep and others added 19 commits April 16, 2025 09:59
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: SHENGYI HONG <[email protected]>
Closes openzfs#17101
FreeBSD nowadays supports FPU in the kernel on powerpc*, so enable it.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Piotr Kubaj <[email protected]>
Closes openzfs#17191
Since spa_dspace accounts only normal allocation class space,
spa_nonallocating_dspace should do the same.  Otherwise we may get
negative overflow or respective assertion spa_update_dspace() if
removed special/dedup vdev is bigger than all normal class space.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Allan Jude <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes openzfs#17183
FreeBSD kernel's WITNESS code detected lock ordering violation in
spa_vdev_remove_cancel_sync().  It took svr_lock while holding
ms_lock, which is opposite to other places.  I was thinking to
resolve it similar to openzfs#17145, but looking closer I don't think
we even need svr_lock at that point, since we already asserted
svr_allocd_segs is empty, and we don't need to add there segments
we are going to call free_mapped_segment_cb for.

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Allan Jude <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#17164
cmd/zinject/zinject.c:
 - use PRIu64 when printing uint64_t

tests/zfs-tests/cmd/clonefile.c:
 - use an unsigned long long to store result from strtoull()
 - use %jd for printing off_t, %zu for size_t, %zd for ssize_t

tests/zfs-tests/tests/functional/vdev_disk/page_alignment.c:
 - use %zx to print size_t

Discovered when compiling on FreeBSD i386.

Signed-off-by: Martin Matuska <[email protected]>

Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Allan Jude <[email protected]>
Reviewed-by: @ImAwsumm
When opening a vdev and setting the nonrot property, we used to wait for
each child to be opened before examining its nonrot property. When the
change was made to open vdevs asynchronously, we didn't move the nonrot
check out of the main loop. As a result, the nonrot property is almost
always set to false, regardless of the actual type of the underlying
disks. The fix is simply to move the nonrot check to a separate loop
after the taskq has been waited for.

Sponsored-by: Klara, Inc.
Sponsored-by: Eshtek, Inc.

Signed-off-by: Paul Dagnelie <[email protected]>
Co-authored-by: Paul Dagnelie <[email protected]>

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
- don't include foreign machine assembly files
- reduce diff to FreeBSD module Makefile

Discovered in FreeBSD port filesystems/openzfs-kmod

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Signed-off-by: Martin Matuska <[email protected]>
Closes openzfs#17219
Signed-off-by: Rob Norris <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
When a dedup write fails, we try to roll the DDT entry back to a known
good state. However, this also rolls the refcounts and the last-update
time back to the state they were at when we started this write. This
doesn't appear to be able to cause any refcount leaks (after the fix in
17123). This PR prevents that from happening by only rolling back the
parts of the DDT entry that have been updated by the write so far.

Sponsored-by: iXsystems, Inc.
Sponsored-by: Klara, Inc.

Signed-off-by: Paul Dagnelie <[email protected]>
Co-authored-by: Paul Dagnelie <[email protected]>

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
…fs#15972). (openzfs#17213)

The problem was identified in handling of the zpool get state command
line arguments. A pointer vdev was used to point to the argv[1], and
its address set to cb.cb_vdevs.cb_names(pointer to array of strings)
so any increment to cb_names resulted in a segfault. Fix covers a
special case of root parameter at argv[1] and remaining cases are
handled by passing in the argv + 1, which allows cb_names iteration
of next command line arguments (vdevs).

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Attila Fülöp <[email protected]>

Signed-off-by: Syed Shahrukh Hussain <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Richard Kojedzinszky <[email protected]>
Closes openzfs#17208
Debian requires libtirpc-dev.  Update our debian/control file to
match Debian's upstream one.

Closes: openzfs#17197

Signed-off-by: Tony Hutter <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: @manfromafar
The 6.0 kernel removes the 'migratepage' VFS op. Check for
migratepage.

Signed-off-by: Tony Hutter <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Alexander Motin <[email protected]
The tiniest typo in dd2a46b (openzfs#17106) broke it, by setting the wrong
var with the test var, resulting in it always producing "no".

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <[email protected]>
Reviewed by: Attila Fülöp <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17236
zfs_notify_email will now include an empty line separating the header
from the body of the email in case the subject is not provided via a
command line argument. This is necessary for programs like sendmail to
function correctly (everything up to the first empty line is interpreted
as header, which previously resulted in either missing message parts or
unsent emails)

Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Felix Schmidt <[email protected]>
Closed openzfs#17238
IVs != 96 bits get hashed with GHASH to bring them to 96 bits. Any call
to GHASH will mix the ghash state in gcm_ghash. This is expected to be
zero at first use in an encrypt or decrypt operation, so it needs to be
zeroed after using GHASH in setup.

gcm_init() does this, but gcm_avx_init() zeroed it before setup, not
after, resulting in incorrect encrypt/decrypt results when using AVX GCM
with an IV != 96 bits.

OpenZFS _always_ uses a 96 bit IV (ZIO_DATA_IV_LEN) so this will never
have been hit in any real-world use, which is extremely fortunate, as we
would have incorrectly-encrypted data on-disk. Still, as long as we have
this code here we should make sure it's correct.

Thanks-to: Joel Low <[email protected]>
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: Attila Fülöp <[email protected]>
Fix build errors on Fedora 42 like:

  module/zcommon/zfs_valstr.c:193:16: error: initializer-string for
  array of 'char' truncates NUL terminator but destination lacks
  'nonstring' attribute (3 chars into 2 available)

The arrays in zpool_vdev_os.c and zfs_valstr.c don't need to be
NULL terminated, but we do so to make GCC happy.

Closes: openzfs#17242

Signed-off-by: Tony Hutter <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
@shodanshok
Copy link
Contributor

@tonyhutter maybe #17255 can be considered for inclusion? Thanks.

@devsk
Copy link

devsk commented Apr 19, 2025

@tonyhutter, Quick question: At what point do we say "we have collected enough bugfixes for 2.3.2" and go into doing a release candidate for 2.3.2?

Typically, I wait till .2 patchset release to upgrade a big project like ZFS. So, just curious as to how far along is it. You don't have to answer if its too much bother.

Signed-off-by: Tony Hutter <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
META file and changelog updated.

Signed-off-by: Tony Hutter <[email protected]>
@tonyhutter
Copy link
Contributor Author

@devsk there's no hard and fast rule. Typically these patchset PRs are out for about 3 weeks to get feedback and gather additional last-minute patches. I don't have anything else in mind to pull in for this one. We will need to build the new zfs-release RPMs for Fedora 42 though, which is on my TODO list.

@tonyhutter
Copy link
Contributor Author

@shodanshok #17255 has not been merged yet, so no plans for inclusion at this point.

@no-usernames-left
Copy link

@tonyhutter How come the betas are not used to verify compatibility, instead waiting until release to take action?

@ciphorg
Copy link

ciphorg commented Apr 24, 2025

I admit to some naivety here so please bare with. I am mostly interested in your thoughts, so that I may better understand the processes here. If this isn't the forum for it (it likely isn't), just let me know. ( <3 )

Would it be possible to split the patch sets into more normalized sets? One for general ZFS patches, one for Linux current release, and another prepping for Linux next (or a variation thereof)? Would this effect your current branching, tagging, and build strategy? I think it may lead to a higher release cadence, but one where linux compat PR's could be merged and released in closer cadence with linux current release.

I am one of the foolish who runs root-on-enc-zfs on a rolling distro (arch), with a systemd initramfs... as my daily driver. I don't really mind being behind for long, I just keep a handy lts kernel ready incase I botch an update. I'm comfortable playing with DKMS and getting zfs-current to compile for myself, but I rather enjoy knowing the ZFS releases are well tested and version tracked with my package manager.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.