Skip to content

Separate type-information derivation into auto and semiauto #282

@gaelrenoux-datadome

Description

@gaelrenoux-datadome

Before I start explaining: I'm willing to work on the PR if you're interested, but I thought it better to discuss it with you first :-)

So, we're using flink-scala-api for type-information (I work with @arnaud-daroussin). One thing we've noted is that if we used it "as intended" (by just importing org.apache.flinkx.api.serializers._ everywhere), it leads to very high compilation times. With the old Flink API, the full clean-compile took around 160 seconds, and with flink-scala-api it moved up to 200 seconds. However, we managed to cut quite a lot of it by using semi-auto derivation instead of full-auto derivation: we've reduced the time down to 140 seconds, even less than before the migration.

I'm not sure how familiar you are with semi-auto vs full-auto derivation? The idea is that instead of importing the macro everywhere, we declare implicit TypeInformation vals in the companion objects of all classes, and they're automatically found (hence semi-auto: they're declared manually, but found automatically). In addition to faster compile times, semi-auto also had the advantage of letting us create custom TypeInformations for certain class where the macro would have worked, but wouldn't have been as optimized for runtime performance. => You trade convenience for control.

So for example, instead of:

import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flinkx.api.serializers._

final case class Alert(message: String)

final case class Notification(alerts: List[Alert])

object Job {
  val info = implicitly[TypeInformation[Notification]]
}

We have:

import org.apache.flink.api.common.typeinfo.TypeInformation
// Don't import deriveTypeInformation
import org.apache.flinkx.api.serializers.{deriveTypeInformation => _, _}

final case class Alert(message: String)

object Alert {
  implicit val alertInfo: TypeInformation[Alert] = org.apache.flinkx.api.serializers.deriveTypeInformation
}

final case class Notification(alerts: List[Alert])

object Notification {
  implicit val notificationInfo: TypeInformation[Notification] = // some custom stuff
}

object Job {
  val info = implicitly[TypeInformation[Notification]]
}

The issue is that flink-scala-api doesn't really support semi-auto derivation natively.

So, we had to jump through some hoops. As you can see, we have to be careful to never import deriveTypeInformation, because it would have a higher priority as an implicit (being already in the scope) than the one on the entity's companion object. That's very error-prone: it's easy to miss (we did it a few times), because if you do everything seems to work "mostly" fine. So instead, we just created our own class that copied everything from org.apache.flinkx.api.serializers except deriveTypeInformation.

Another issue is that it doesn't notice when a type-information is missing, because deriveTypeInformation ends up calling itself if necessary. So for example, this shouldn't compile in semi-auto, but it does:

import org.apache.flink.api.common.typeinfo.TypeInformation
// Don't import deriveTypeInformation
import org.apache.flinkx.api.serializers.{deriveTypeInformation => _, _}

final case class Alert(message: String)

object Alert {
  // No TypeInformation declared
}

final case class Notification(alerts: List[Alert])

object Notification {
  // note that deriveTypeInformation is not in the implicit context, we call it by its full name
  // so it shouldn't find a way to get a TypeInformation[Alert]
  implicit val notificationInfo: TypeInformation[Notification] = org.apache.flinkx.api.serializers.deriveTypeInformation
}

object Job {
  val info = implicitly[TypeInformation[Notification]]
}

OK, that was a wall of text, sorry 😅

So: what do you think about supporting both auto and semi-auto derivation?

That's something projects like Circe are already doing. The idea would be to have two separate packages for the derivation of serializers and type-informations, called auto and semiauto. The generic type-informations (for stuff like Option, List, etc.) would be in a parent trait, inherited both by auto and semi-auto, and the macro would be the only thing being different between the two. Note that on the semi-auto derivation, the cache is not necessary, because the declared type-information vals are doing the job.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions