async uart flush takes an additional ~15uS

## Bug description

I am experimenting with an async RS485 interface on the esp32s3 and I have noticed some large differences between sync Uart and the async Uart. I am using the esp_rtos embassy InteruptExecutor with Software Interrupt 2 running on the second core.

I am communicating with a motor that has incredibly tight timings. It replies to the RS485 just 2 or 3 uS after the last TX byte is sent. 
If the dir pin isn't set low immediately after the Uart has finished sending it results in the first few bytes of the reply being dropped.

In the screen grab of the oscilloscope the yellow line is the rs485 A, the pink the dir and the blue is the RX of the uart. You can see that the first byte is being missed as the DIR pin is not lowered till 15uS after the TX Uart has finished transmitting.

I have tested with sync and it works as expected.

![](https://cloudstore.wetaworkshop.co.nz/index.php/s/NWnzTyzfpZmYboZ/download/esp32s3-async-uart.png)

## To Reproduce
```rust
#![no_std]
#![no_main]
#![deny(
    clippy::mem_forget,
    reason = "mem::forget is generally not safe to do with esp_hal types, especially those \
    holding buffers for the duration of a data transfer."
)]

use defmt::*;
use embassy_executor::Spawner;
use embassy_time::{Duration, Timer};
use esp_hal::gpio::AnyPin;
use esp_hal::interrupt::Priority;
use esp_hal::interrupt::software::SoftwareInterruptControl;
use esp_hal::system::{CpuControl, Stack};
use esp_hal::timer::timg::TimerGroup;
use esp_hal::uart;
use esp_hal::{clock::CpuClock, gpio::Output};
use esp_rtos::embassy::InterruptExecutor;
use static_cell::StaticCell;
use {esp_backtrace as _, esp_println as _};

extern crate alloc;

esp_bootloader_esp_idf::esp_app_desc!();

#[embassy_executor::task]
async fn bear_motors(
    u: uart::AnyUart<'static>,
    tx: AnyPin<'static>,
    rx: AnyPin<'static>,
    dir: AnyPin<'static>,
) {
    let config = uart::Config::default().with_baudrate(500_000);
    let mut uart = uart::Uart::new(u, config)
        .expect("uart config failed")
        .with_tx(tx)
        .with_rx(rx)
        .into_async();

    let mut dir = Output::new(dir, esp_hal::gpio::Level::Low, Default::default());

    const PACKET: &[u8] = &[0xff, 0xff, 1, 2, 1, 0xfb];

    let wait = message_transfer_time_embassy(PACKET.len() as u32, 500_000);
    info!("{}", wait.as_micros());
    loop {
        dir.set_high();
        _ = uart.write_async(PACKET).await.unwrap();
        uart.flush_async().await.unwrap();
        dir.set_low();
        if let Ok(read) = uart.read_async(&mut read_buffer).await {
            if read_buffer[..read].starts_with(&[0xff, 0xff]) {
                info!("read packet successfully: {:X}", read_buffer[..read]);
            } else {
                warn!("failed to read packet: {:X}", read_buffer[..read]);
            }
        }
        Timer::after(Duration::from_millis(500)).await;
    }
}

#[esp_rtos::main]
async fn main(_spawner: Spawner) -> ! {
    // generator version: 1.0.1

    let config = esp_hal::Config::default().with_cpu_clock(CpuClock::max());
    let peripherals = esp_hal::init(config);

    esp_alloc::heap_allocator!(#[esp_hal::ram(reclaimed)] size: 73744);

    let sw_int = SoftwareInterruptControl::new(peripherals.SW_INTERRUPT);
    let timg0 = TimerGroup::new(peripherals.TIMG0);
    esp_rtos::start(timg0.timer0, sw_int.software_interrupt0);

    let mut cpu_control = CpuControl::new(peripherals.CPU_CTRL);

    static EXECUTOR_CORE_1: StaticCell<InterruptExecutor<2>> = StaticCell::new();

    let executor_core1 = InterruptExecutor::new(sw_int.software_interrupt2);
    let executor_core1 = EXECUTOR_CORE_1.init(executor_core1);

    static APP_CORE_STACK: StaticCell<Stack<8192>> = StaticCell::new();
    let app_core_stack = APP_CORE_STACK.init(Stack::new());

    let u = peripherals.UART1;
    let tx = peripherals.GPIO16;
    let rx = peripherals.GPIO18;
    let dir = peripherals.GPIO15;

    let _guard = cpu_control
        .start_app_core(app_core_stack, move || {
            let spawner = executor_core1.start(Priority::Priority3);
            spawner
                .spawn(bear_motors(u.into(), tx.into(), rx.into(), dir.into()))
                .unwrap();
            loop {}
        })
        .unwrap();
    loop {}
 }

```

## Expected behavior

In sync rust the DIR pin is lowered as expected immediately after the last uart byte is sent.  
I would expect that the async one is able to perform similary, when running on a dedicated core with the InteruptExecutor.

I have had a look through the async flush function and notice that `self.flush_last_byte();` is called which may be unnecessary. If I comment it out it drops the overshoot time from ~15.5uS to 5.5uS which is expected as that function includes a 10uS delay. 

I also tried removing the flush (the pin is set low >1uS after its set high) and using a timer based approach (by calculating the time to write the bytes based on the baud rate) but the timer interrupt is even more unreliable.

## Environment

```
Chip type:         esp32s3 (revision v0.2)
Crystal frequency: 40 MHz
Flash size:        16MB
Features:          WiFi, BLE
MAC address:       cc:ba:97:10:50:14
```
Using main branch of esp_hal, embassy-executor v0.9.1


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

async uart flush takes an additional ~15uS #4483

Bug description

To Reproduce

Expected behavior

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

async uart flush takes an additional ~15uS #4483

Description

Bug description

To Reproduce

Expected behavior

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions