A specialized concurrent task scheduler designed for managing large-scale scientific simulations that run for months across hundreds of compute nodes. This system maximizes research output through efficient scheduling with robust checkpointing, fault tolerance, and scenario prioritization capabilities tailored to long-duration computational science workloads.
- Long-Running Job Management System: Protects extended simulations from preemption while efficiently managing resource allocation over periods of months.
- Simulation Dependency Tracking: Tracks relationships between simulation stages and automates transitions between phases of complex models.
- Equipment Failure Resilience: Minimizes recalculation after hardware failures with intelligent checkpointing and partial result preservation.
- Resource Usage Forecasting: Generates accurate forecasts of simulation resource requirements for grant reporting and capacity planning.
- Scenario Priority Management: Adjusts resource allocation among different simulation scenarios based on preliminary result promise and research potential.
The system is organized into several key components:
-
Job Management: Manages long-running simulations with preemption protection and resource scheduling
scheduler.py: Handles job scheduling and resource allocationqueue.py: Manages the job queue with prioritization strategiesreservation.py: Manages resource reservations and conflict resolution
-
Dependency Tracking: Manages complex workflows and relationships between simulation stages
graph.py: Implements a directed graph for tracking dependenciestracker.py: Tracks dependencies between simulation componentsworkflow.py: Manages multi-stage simulation workflows
-
Failure Resilience: Provides robust recovery from hardware failures
checkpoint_manager.py: Manages creation and restoration of checkpointsfailure_detector.py: Detects hardware and software failuresresilience_coordinator.py: Coordinates recovery from failures
-
Resource Forecasting: Predicts resource needs for planning
data_collector.py: Gathers resource utilization dataforecaster.py: Generates resource usage forecastsoptimizer.py: Optimizes resource allocationreporter.py: Generates reports for resource usage
-
Scenario Management: Prioritizes the most promising research directions
comparator.py: Compares different simulation scenariosevaluator.py: Evaluates scenario scientific promisepriority_manager.py: Adjusts resource allocation based on priorities
# Set up a virtual environment
uv venv
source .venv/bin/activate
# Install the package in development mode
uv pip install -e .
# Install required dependencies
uv pip install networkx scikit-learn pytest pytest-json-report# Initialize the job manager
from concurrent_task_scheduler.job_management.scheduler import JobScheduler, LongRunningJobManager
# Create a scheduler with the hybrid strategy
scheduler = JobScheduler()
# Create a long-running job manager
job_manager = LongRunningJobManager(scheduler=scheduler)
# Register compute nodes
from concurrent_task_scheduler.models import ComputeNode, NodeType
node = ComputeNode(
id="node-1",
name="Compute Node 1",
node_type=NodeType.COMPUTE,
cpu_cores=32,
memory_gb=128.0,
gpu_count=4,
storage_gb=1024.0,
network_bandwidth_gbps=10.0,
)
job_manager.register_node(node)
# Submit a simulation
from concurrent_task_scheduler.models import Simulation
simulation = Simulation(...)
job_manager.submit_simulation(simulation)# Install the test dependencies
uv pip install pytest pytest-json-report
# Run all tests
pytest
# Generate test coverage report
pytest --cov=concurrent_task_scheduler
# Generate JSON report for validation
pytest --json-report --json-report-file=pytest_results.json- Checkpointing Strategies: Configure different checkpointing strategies based on simulation type and importance
- Failure Recovery: Automatically recover from various failure scenarios with minimal data loss
- Dynamic Priority Adjustment: Adjust simulation priorities based on preliminary results and system conditions
- Resource Forecasting: Generate accurate forecasts for future resource needs based on historical patterns
- Maintenance Windows: Handle planned system maintenance with minimal disruption to long-running simulations