Offline RL ?

I have deep dived through your paper and don't understand how CtRL-SIm was trained, but may be i just skipped some thoughts :)
I have question:

Is it correct, that you pretrained CtRL-SIm model and then fine tuned it by PPO in online rl ? (sampling trajectories in gpudrive by using several updates policy)

If answer is no, what you have used for old policy in PPO ?