Skip to content

ompio produces adverse, conflicting access patterns for some collective writes #13376

@carns

Description

@carns

We have observed some highly unusual access patterns being generated by collective write operations in ompio. Many thanks to @wkliao for helping to narrow down how to reproduce the behavior.

The following test program can be used to help illustrate the problem:
https://gist.github.com/carns/b8242d706ad9a9b410016c99e170c696

The test program should be executed with 4 processes. Each process writes a different (but adjacent) 100 byte region of a file using MPI_File_write_at_all(). Command line arguments specify the file to write and the overall starting offset. cb_nodes=4 is set using an MPI_Info hint to force ompio to use multiple aggregators even when this program is executed on a single workstation.

The following scenarios were executed on an Ubuntu 25.04 laptop with an ext4 file system. OpenMPI was installed from git origin/main (6ec63e2) with the --enable-debug option. Extra printfs were added to the code as described below in some cases to better understand behavior.

Scenario 1 (collective write produces a suboptimal but understandable access pattern)

If the example program is executed with a 0 byte offset (mpiexec -n 4 ./ompio-access-pattern foo.dat 0) then it produces the following access pattern to local disk, as captured using the Darshan profiling tool with DXT tracing mode:

 # Module    Rank  Wt/Rd  Segment          Offset       Length    Start(s)      End(s)
 X_POSIX       0  write        0               0             100      0.0071      0.0072   
 X_POSIX       1  write        0             100             100      0.0071      0.0071   
 X_POSIX       2  write        0             200             100      0.0071      0.0071   
 X_POSIX       3  write        0             300             100      0.0071      0.0072

It is surprising that it elected to use all 4 available aggregators to write the file since there were only 400 total bytes of data, but the pattern is understandable. Each rank wrote exactly it's portion of the file as if the writes had been issued with independent operations.

Scenario 2 (collective write produces an unusual conflicting access pattern)

If the example program is executed with a 1 byte offset (mpiexec -n 4 ./ompio-access-pattern foo.dat 1) then then the access pattern reported by Darshan DXT begins to look strange:

 # Module    Rank  Wt/Rd  Segment          Offset       Length    Start(s)      End(s)
 X_POSIX       1  write        0             100             100      0.0066      0.0070   
 X_POSIX       2  write        0             200             100      0.0066      0.0071   
 X_POSIX       3  write        0             300             100      0.0066      0.0075 
 X_POSIX       0   read        0               1             399      0.0076      0.0076   
 X_POSIX       0   read        1               1             399      0.0076      0.0077   
 X_POSIX       0  write        0               1             400      0.0077      0.0077   

The resulting output file is technically correct (the contents are intact), but it was written using a combination of reads and writes, with rank 0 apparently performing a read/modify/write over segments of the file that were also written by other ranks, meaning that ompio introduced conflicting writes where none were originally present at the application level.

I modified the OpenMPI source code as follows to get a better understanding of what it was doing at the fbtl level:

--- a/ompi/mca/fbtl/posix/fbtl_posix_pwritev.c
+++ b/ompi/mca/fbtl/posix/fbtl_posix_pwritev.c
@@ -46,7 +46,15 @@ ssize_t  mca_fbtl_posix_pwritev(ompio_file_t *fh )
     ssize_t bytes_written=0;
     struct flock lock;
     int lock_counter=0;
-    
+
+    printf("mca_fbtl_posix_pwritev() on rank %d writing %d extents in \"%s\":\n", fh->f_rank, fh->f_num_of_io_entries, fh->f_filename);
+    int x;
+    for(x=0; x<fh->f_num_of_io_entries; x++) {
+        printf("\t[%lu - %lu]\n", (long unsigned)fh->f_io_array[x].offset,
+            (long unsigned)(fh->f_io_array[x].offset + fh->f_io_array[x].length  - 1));
+    }
+
+
     if (NULL == fh->f_io_array) {
         return OMPI_ERROR;
     }
@@ -79,6 +87,7 @@ ssize_t  mca_fbtl_posix_pwritev(ompio_file_t *fh )
             do_data_sieving = false;
         }
                 
+        printf("\t(do_data_sieving = %d)\n", do_data_sieving);
         if ( do_data_sieving) {
             bytes_written = mca_fbtl_posix_pwritev_datasieving (fh, &lock, &lock_counter);
         }

This produced the following output for the 1-byte offset example:

mca_fbtl_posix_pwritev() on rank 0 writing 2 extents in "foo.dat":
        [1 - 99]
        [400 - 400]
        (do_data_sieving = 1)
mca_fbtl_posix_pwritev() on rank 1 writing 1 extents in "foo.dat":
        [100 - 199]
mca_fbtl_posix_pwritev() on rank 2 writing 1 extents in "foo.dat":
        [200 - 299]
mca_fbtl_posix_pwritev() on rank 3 writing 1 extents in "foo.dat":
        [300 - 399]

This reveals that the data was distributed such that each process was responsible for 100 bytes, but rank 0 was responsible for two distinct segments; 99 bytes starting at offset 1 and 1 byte starting at offset 400. The mca_fbtl_posix_pwritev() interpreted this as an access pattern that could be optimized via data sieving, so it translates this into a read/modify/write. That explains the access pattern observed by Darshan DXT.

This behavior is not specific to the access sizes, offsets, or number of ranks that I am using in this example. This is just a reproducer meant to illustrate the problem with a small, simple scenario.

It should also be noted that both scenarios also generate fcntl advisory locks which can be observed using strace (Darshan does not record these). In scenario 1 these operations are unnecessary, but they do not impact correctness. In scenario 2 they are necessary (because ompio is introducing conflicting writes). In OpenMPI <= 5.0.5 this frequently causes data corruption, presumably because of the problem fixed in #12759 .

Even though the data is no longer corrupted when using newer versions of OpenMPI (or origin/main as in this walkthrough), this is undoubtedly a poor access pattern. The collective write is split into very small regions, some regions are split into discontiguous extents, and automatic data sieving turns this into conflicting write operations that must be serialized with advisory locks.

Is there any way to avoid or improve this behavior? Is it intentional?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions