Long-Running Simulation Orchestrator

A specialized concurrent task scheduler designed for managing large-scale scientific simulations that run for months across hundreds of compute nodes. This system maximizes research output through efficient scheduling with robust checkpointing, fault tolerance, and scenario prioritization capabilities tailored to long-duration computational science workloads.

Features

Long-Running Job Management System: Protects extended simulations from preemption while efficiently managing resource allocation over periods of months.
Simulation Dependency Tracking: Tracks relationships between simulation stages and automates transitions between phases of complex models.
Equipment Failure Resilience: Minimizes recalculation after hardware failures with intelligent checkpointing and partial result preservation.
Resource Usage Forecasting: Generates accurate forecasts of simulation resource requirements for grant reporting and capacity planning.
Scenario Priority Management: Adjusts resource allocation among different simulation scenarios based on preliminary result promise and research potential.

System Architecture

The system is organized into several key components:

Job Management: Manages long-running simulations with preemption protection and resource scheduling
- scheduler.py: Handles job scheduling and resource allocation
- queue.py: Manages the job queue with prioritization strategies
- reservation.py: Manages resource reservations and conflict resolution
Dependency Tracking: Manages complex workflows and relationships between simulation stages
- graph.py: Implements a directed graph for tracking dependencies
- tracker.py: Tracks dependencies between simulation components
- workflow.py: Manages multi-stage simulation workflows
Failure Resilience: Provides robust recovery from hardware failures
- checkpoint_manager.py: Manages creation and restoration of checkpoints
- failure_detector.py: Detects hardware and software failures
- resilience_coordinator.py: Coordinates recovery from failures
Resource Forecasting: Predicts resource needs for planning
- data_collector.py: Gathers resource utilization data
- forecaster.py: Generates resource usage forecasts
- optimizer.py: Optimizes resource allocation
- reporter.py: Generates reports for resource usage
Scenario Management: Prioritizes the most promising research directions
- comparator.py: Compares different simulation scenarios
- evaluator.py: Evaluates scenario scientific promise
- priority_manager.py: Adjusts resource allocation based on priorities

Installation

# Set up a virtual environment
uv venv
source .venv/bin/activate

# Install the package in development mode
uv pip install -e .

# Install required dependencies
uv pip install networkx scikit-learn pytest pytest-json-report

Usage Examples

# Initialize the job manager
from concurrent_task_scheduler.job_management.scheduler import JobScheduler, LongRunningJobManager

# Create a scheduler with the hybrid strategy
scheduler = JobScheduler()

# Create a long-running job manager
job_manager = LongRunningJobManager(scheduler=scheduler)

# Register compute nodes
from concurrent_task_scheduler.models import ComputeNode, NodeType
node = ComputeNode(
    id="node-1",
    name="Compute Node 1",
    node_type=NodeType.COMPUTE,
    cpu_cores=32,
    memory_gb=128.0,
    gpu_count=4,
    storage_gb=1024.0,
    network_bandwidth_gbps=10.0,
)
job_manager.register_node(node)

# Submit a simulation
from concurrent_task_scheduler.models import Simulation
simulation = Simulation(...)
job_manager.submit_simulation(simulation)

Testing

# Install the test dependencies
uv pip install pytest pytest-json-report

# Run all tests
pytest

# Generate test coverage report
pytest --cov=concurrent_task_scheduler

# Generate JSON report for validation
pytest --json-report --json-report-file=pytest_results.json

Advanced Features

Checkpointing Strategies: Configure different checkpointing strategies based on simulation type and importance
Failure Recovery: Automatically recover from various failure scenarios with minimal data loss
Dynamic Priority Adjustment: Adjust simulation priorities based on preliminary results and system conditions
Resource Forecasting: Generate accurate forecasts for future resource needs based on historical patterns
Maintenance Windows: Handle planned system maintenance with minimal disruption to long-running simulations

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
concurrent_task_scheduler		concurrent_task_scheduler
tests		tests
.gitignore		.gitignore
INSTRUCTIONS.md		INSTRUCTIONS.md
README.md		README.md
pyproject.toml		pyproject.toml
pytest_results.json		pytest_results.json
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Long-Running Simulation Orchestrator

Features

System Architecture

Installation

Usage Examples

Testing

Advanced Features

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

code-refactor/concurrent_task_scheduler_scientific_computing

Folders and files

Latest commit

History

Repository files navigation

Long-Running Simulation Orchestrator

Features

System Architecture

Installation

Usage Examples

Testing

Advanced Features

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages