Skip to content

LLM Dataset Implementation #1058

@farook-edev

Description

@farook-edev

This issue is a general container for matters relating to datasets in general. Discussions on TinyMMLU or IFEval specifically should go in the sub issues for this one.

Current Status

TinyMMLU

  • Dataset is converted from .parquet to .tfrecord via a utility script.
  • Dataset loads .tfrecord and stores data inside samples.
  • Dataset provides samples by id to driver/backend in proper format.*
  • Dataset Processes output from driver/backend.*
  • Dataset calculates and provides accuracy using output data on device.

IFEval

  • Dataset is converted from .jsonl to .tfrecord via a utility script.
  • Dataset loads .tfrecord and stores data inside samples.
  • Dataset provides samples by id to driver/backend in proper format.*
  • Dataset Processes output from driver/backend.*
  • Dataset calculates and provides accuracy using output data on device.

* This includes tokenization/detokenization using common SentencePiece utility code.

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions