ompio produces adverse, conflicting access patterns for some collective writes

We have observed some highly unusual access patterns being generated by collective write operations in ompio.  Many thanks to @wkliao for helping to narrow down how to reproduce the behavior.

The following test program can be used to help illustrate the problem:
https://gist.github.com/carns/b8242d706ad9a9b410016c99e170c696

The test program should be executed with 4 processes.  Each process writes a different (but adjacent) 100 byte region of a file using `MPI_File_write_at_all()`.  Command line arguments specify the file to write and the overall starting offset.  `cb_nodes=4` is set using an `MPI_Info` hint to force ompio to use multiple aggregators even when this program is executed on a single workstation.

The following scenarios were executed on an Ubuntu 25.04 laptop with an ext4 file system.  OpenMPI was installed from git origin/main (6ec63e2b69d4ef5b0e085d38abfab35d24008318) with the `--enable-debug` option.  Extra printfs were added to the code as described below in some cases to better understand behavior.

## Scenario 1 (collective write produces a suboptimal but understandable access pattern)

If the example program is executed with a 0 byte offset (`mpiexec -n 4 ./ompio-access-pattern foo.dat 0`) then it produces the following access pattern to local disk, as captured using the Darshan profiling tool with DXT tracing mode:

```
 # Module    Rank  Wt/Rd  Segment          Offset       Length    Start(s)      End(s)
 X_POSIX       0  write        0               0             100      0.0071      0.0072   
 X_POSIX       1  write        0             100             100      0.0071      0.0071   
 X_POSIX       2  write        0             200             100      0.0071      0.0071   
 X_POSIX       3  write        0             300             100      0.0071      0.0072
```
It is surprising that it elected to use all 4 available aggregators to write the file since there were only 400 total bytes of data, but the pattern is understandable.  Each rank wrote exactly it's portion of the file as if the writes had been issued with independent operations.

## Scenario 2 (collective write produces an unusual conflicting access pattern)

If the example program is executed with a 1 byte offset (`mpiexec -n 4 ./ompio-access-pattern foo.dat 1`) then then the access pattern reported by Darshan DXT begins to look strange:

```
 # Module    Rank  Wt/Rd  Segment          Offset       Length    Start(s)      End(s)
 X_POSIX       1  write        0             100             100      0.0066      0.0070   
 X_POSIX       2  write        0             200             100      0.0066      0.0071   
 X_POSIX       3  write        0             300             100      0.0066      0.0075 
 X_POSIX       0   read        0               1             399      0.0076      0.0076   
 X_POSIX       0   read        1               1             399      0.0076      0.0077   
 X_POSIX       0  write        0               1             400      0.0077      0.0077   
```

The resulting output file is technically correct (the contents are intact), but it was written using a combination of reads and writes, with rank 0 apparently performing a read/modify/write over segments of the file that were also written by other ranks, meaning that ompio introduced conflicting writes where none were originally present at the application level.

I modified the OpenMPI source code as follows to get a better understanding of what it was doing at the fbtl level:

```
--- a/ompi/mca/fbtl/posix/fbtl_posix_pwritev.c
+++ b/ompi/mca/fbtl/posix/fbtl_posix_pwritev.c
@@ -46,7 +46,15 @@ ssize_t  mca_fbtl_posix_pwritev(ompio_file_t *fh )
     ssize_t bytes_written=0;
     struct flock lock;
     int lock_counter=0;
-    
+
+    printf("mca_fbtl_posix_pwritev() on rank %d writing %d extents in \"%s\":\n", fh->f_rank, fh->f_num_of_io_entries, fh->f_filename);
+    int x;
+    for(x=0; x<fh->f_num_of_io_entries; x++) {
+        printf("\t[%lu - %lu]\n", (long unsigned)fh->f_io_array[x].offset,
+            (long unsigned)(fh->f_io_array[x].offset + fh->f_io_array[x].length  - 1));
+    }
+
+
     if (NULL == fh->f_io_array) {
         return OMPI_ERROR;
     }
@@ -79,6 +87,7 @@ ssize_t  mca_fbtl_posix_pwritev(ompio_file_t *fh )
             do_data_sieving = false;
         }
                 
+        printf("\t(do_data_sieving = %d)\n", do_data_sieving);
         if ( do_data_sieving) {
             bytes_written = mca_fbtl_posix_pwritev_datasieving (fh, &lock, &lock_counter);
         }
```

This produced the following output for the 1-byte offset example:

```
mca_fbtl_posix_pwritev() on rank 0 writing 2 extents in "foo.dat":
        [1 - 99]
        [400 - 400]
        (do_data_sieving = 1)
mca_fbtl_posix_pwritev() on rank 1 writing 1 extents in "foo.dat":
        [100 - 199]
mca_fbtl_posix_pwritev() on rank 2 writing 1 extents in "foo.dat":
        [200 - 299]
mca_fbtl_posix_pwritev() on rank 3 writing 1 extents in "foo.dat":
        [300 - 399]
```

This reveals that the data was distributed such that each process was responsible for 100 bytes, but rank 0 was responsible for two distinct segments; 99 bytes starting at offset 1 and 1 byte starting at offset 400.  The `mca_fbtl_posix_pwritev()` interpreted this as an access pattern that could be optimized via data sieving, so it translates this into a read/modify/write.  That explains the access pattern observed by Darshan DXT.

This behavior is not specific to the access sizes, offsets, or number of ranks that I am using in this example.  This is just a reproducer meant to illustrate the problem with a small, simple scenario.

It should also be noted that both scenarios also generate `fcntl` advisory locks which can be observed using `strace` (Darshan does not record these).  In scenario 1 these operations are unnecessary, but they do not impact correctness.  In scenario 2 they *are* necessary (because ompio is introducing conflicting writes).  In OpenMPI <= 5.0.5 this frequently causes data corruption, presumably because of the problem fixed in https://github.com/open-mpi/ompi/pull/12759 .

Even though the data is no longer corrupted when using newer versions of OpenMPI (or origin/main as in this walkthrough), this is undoubtedly a poor access pattern.  The collective write is split into very small regions, some regions are split into discontiguous extents, and automatic data sieving turns this into conflicting write operations that must be serialized with advisory locks.

Is there any way to avoid or improve this behavior?  Is it intentional?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ompio produces adverse, conflicting access patterns for some collective writes #13376

Scenario 1 (collective write produces a suboptimal but understandable access pattern)

Scenario 2 (collective write produces an unusual conflicting access pattern)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ompio produces adverse, conflicting access patterns for some collective writes #13376

Description

Scenario 1 (collective write produces a suboptimal but understandable access pattern)

Scenario 2 (collective write produces an unusual conflicting access pattern)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions