Skip to content

Feature List: TOSS4: 2020 Q2

Stephen Herbein edited this page Jun 13, 2019 · 3 revisions

System Instance

  • Limit KVS Content Growth
  • Tolerate Compute Nodes Down
  • Drain Nodes
  • Detect/Monitor Nodes/Resources Up/Down

Execution System

  • I/O to/from files with file per process
  • Multi-prog support (MPMD)
  • pty support
  • affinity/mapping
  • Jobspec + R -> local
  • Environment
  • Debugger support
    • MPIR
    • Distributed Sync
  • Launch OpenMPI 3.1+
  • PMI
  • job completion log
    • simple append interface
    • offline & online query (x-post w/ porcelain)
  • real job shell
  • signal jobs (x-post w/ porcelain)

Job Submission

  • Job Priorities (x-post w/ bank/accounting)
  • Job Dependencies

Resource Management

  • Query available/allocated/down resources
  • Resource configuration language
  • Resource discovery vs config file

Porcelain

  • List jobs in queue order with filtering
  • Run/submit
  • scheduler front-end work
  • alter job priorities
    • hold
    • cancel
    • expediate
  • query completed jobs (x-post w/ execution system)
  • Transition Tools
    • flux srun
  • signal jobs (x-post w/ execution system)

Bank/Accounting

  • Specify bank on submission
  • Tools/storage for EOY analysis
  • User permissions
  • Fair-Share Scheduling
  • Job Priorities (x-post w/ job submission)
  • Slurm Database

Resource Matching Integration w/ Exec System

  • Resource matching interfaces w/ new exec system
  • Scheduler ? support

Sched Optimization and Resiliency

  • Scheduler performance optimization
  • Scheduler resiliency improvements
    • Support unload/load via job manager
  • Scheduler memory optimization
  • Planner optimization

Support for Queues & Partitions

  • Queue Equivalent (e.g., job tags)
    • W/ policy support (e.g., wall time limit)

ATDM L2 Milestones

  • Power Monitoring
    • monitoring support for job-level power/perf data
    • from various databases
  • Tools Interface
  • Storage ???
    • Burst Buffer support w/in simulator
    • Add stage-in/out support in jobspec
    • Data staging flux module
  • GPU

Security

  • IMP + Sontain
  • IMP PAM Support
  • IMP Prolog/Epilog support
Clone this wiki locally