Why ZIL replay hole blkptr will zero data? #17899

gnlwlb-cmyk · 2025-11-05T07:27:06Z

gnlwlb-cmyk
Nov 5, 2025

I encountered a scenario involving a ZIL log address pointer being a hole:
The filesystem mode was set to sync=always, logbias=throughput, using a separate log (SLOG) device on a dedicated disk. During mixed read/write testing with vdbench, when the machine lost power, the following sequence occurred:
Read request #1 failed, triggering vdev_probe, which set: vd->vdev_cant_write |= !vps->vps_writeable;
Write request #2 failed during metaslab_alloc because vd->vdev_cant_write was set, resulting in the blkptrwithin the ZIL record becoming a hole.
The parent ZIO for the ZIL write (associated with request #2) completed successfully.
After machine recovery, during zpool import, ZIL replay started and encountered the ZIL record with the hole. The code then wrote zeros for the data, as shown below, leading to data loss:
static int
zil_read_log_data(zilog_t *zilog, const lr_write_t *lr, void *wbuf)
{
zio_flag_t zio_flags = ZIO_FLAG_CANFAIL;
const blkptr_t *bp = &lr->lr_blkptr;
arc_flags_t aflags = ARC_FLAG_WAIT;
arc_buf_t *abuf = NULL;
zbookmark_phys_t zb;
int error;

if (BP_IS_HOLE(bp)) {
    if (wbuf != NULL)
        memset(wbuf, 0, MAX(BP_GET_LSIZE(bp), lr->lr_length));
    return (0);
}
// ... (other code)

}

Is this a bug?

amotin · 2025-11-05T19:56:28Z

amotin
Nov 5, 2025
Collaborator

I guess hole block pointer might be a result of a zero-compressed block. In which case I'd also replay it by writing zeroes. Though I don't understand your part of "resulting in the blkptrwithin the ZIL record becoming a hole".

1 reply

gnlwlb-cmyk Nov 6, 2025
Author

Well, it maybe not zero-compressed block. Let me describe this process again to see if there are any flaws.
In the dmu_sync process, zio->io_bp points to the lr_blkptr space within the lr_write_t structure. The code is as follows:

int
zfs_get_data(void *arg, uint64_t gen, lr_write_t *lr, char *buf,
    struct lwb *lwb, zio_t *zio)
{
    ...
    blkptr_t *bp = &lr->lr_blkptr;
    zgd->zgd_bp = bp;

    ASSERT3U(dbp->db_offset, ==, offset);
    ASSERT3U(dbp->db_size, ==, size);

    error = dmu_sync(zio, lr->lr_common.lrc_txg,
        zfs_get_done, zgd);
    ...
}

Inside zio_write_compress, the contents of zio->io_bp(which points to lr->lr_blkptr) are set to zero (BP_ZERO). The code is as follows:

static zio_t *
zio_write_compress(zio_t *zio)
{
    ...
    if (!BP_IS_HOLE(bp) && BP_GET_LOGICAL_BIRTH(bp) == zio->io_txg &&
        BP_GET_PSIZE(bp) == psize &&
        pass >= zfs_sync_pass_rewrite) {
        VERIFY3U(psize, !=, 0);
        enum zio_stage gang_stages = zio->io_pipeline & ZIO_GANG_STAGES;

        zio->io_pipeline = ZIO_REWRITE_PIPELINE | gang_stages;
        zio->io_flags |= ZIO_FLAG_IO_REWRITE;
    } else {
        BP_ZERO(bp); // This sets the blkptr pointed to by zio->io_bp to zero
        zio->io_pipeline = ZIO_WRITE_PIPELINE;
    }
    ...
}

In metaslab_alloc_dva, vdev_allocatable(vd) returns false, causing the allocation to fail and return ENOSPC. At this point, the content pointed to by zio->io_bp(which is the value of lr_write_t's lr_blkptr) is zero. The code is as follows:

int
metaslab_alloc_dva(spa_t *spa, metaslab_class_t *mc, uint64_t psize,
    dva_t *dva, int d, dva_t *hintdva, uint64_t txg, int flags,
    zio_alloc_list_t *zal, int allocator)
{
    ...
    if (try_hard) {
        spa_config_enter(spa, SCL_ZIO, FTAG, RW_READER);
        allocatable = vdev_allocatable(vd);
        spa_config_exit(spa, SCL_ZIO, FTAG);
    } else {
        allocatable = vdev_allocatable(vd);
    }
    ...
    if (!allocatable) {
        metaslab_trace_add(zal, mg, NULL, psize, d,
            TRACE_NOT_ALLOCATABLE, allocator);
        goto next;
    }
    ...
    return (SET_ERROR(ENOSPC));
}

The dmu_sync corresponding to the zio fails, but its parent IO, the ZIL's lwb_write_zio(carrying the ZIL record where lr_write_t's lr_blkptr is zero), is successfully written to disk.

In the scenario described above, if we perform a zpool import, it will replay this record where lr_blkptris zero, resulting in data being zeroed out.

Finally, let's discuss why vdev_allocatable(vd) returns false in metaslab_alloc. This is a power loss scenario. A read IO failed, which called vdev_probe, and then set vd->vdev_cant_write to true.

I can stably reproduce this scenario in testing. I'm not sure if this is a design flaw?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why ZIL replay hole blkptr will zero data? #17899

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Why ZIL replay hole blkptr will zero data? #17899

Uh oh!

gnlwlb-cmyk Nov 5, 2025

Replies: 1 comment · 1 reply

Uh oh!

amotin Nov 5, 2025 Collaborator

Uh oh!

Uh oh!

gnlwlb-cmyk Nov 6, 2025 Author

gnlwlb-cmyk
Nov 5, 2025

Replies: 1 comment 1 reply

amotin
Nov 5, 2025
Collaborator

gnlwlb-cmyk Nov 6, 2025
Author