Solve Diagnostics#
With your datasets ingested and cataloged, you can now solve and execute diagnostics using the ref solve command.
1. Run all diagnostics (default)#
By default, ref solve will discover and schedule all available diagnostics across all providers. The default executor is the local executor, which runs diagnostics in parallel using a process pool:
This will:
- Query the catalog of ingested datasets (observations and model-output), completing any missing metadata as needed
- Determine which diagnostics are applicable and how many different executions are needed
- Execute each diagnostic in parallel on your machine
- Use a timeout of 3600 seconds (1 hour) to complete the runs
Note: it is normal for some executions to fail (e.g., due to missing data or configuration). You can re-run or inspect failures as needed.
Tip
To target a specific provider or diagnostic, use the --provider and --diagnostic flags:
# Run only PMP diagnostics
ref solve --provider pmp
# Run only diagnostics containing "enso" in their slug
ref solve --diagnostic enso
Replace pmp or enso with any provider or diagnostic slug listed in your installation.
2. Monitor execution status#
You can view the status of execution groups with:
Each group corresponds to a set of related executions (e.g., all runs of a diagnostic for one model). To see details for a specific group, use:
This will show the status (pending, running, succeeded, failed) of each execution in the group and any error messages. This log output is very useful to include if you need to report an issue or seek help.
3. Re-execution and the dirty flag#
Each execution group tracks whether it is dirty -- meaning it needs to be rerun. The solver uses this flag, along with a hash of the input datasets, to decide which executions to schedule on each solve.
An execution group is automatically marked dirty when:
- It is first created (no executions have been run yet)
- New data is ingested that changes the set of input datasets
System errors vs diagnostic errors#
When an execution fails, the system distinguishes between two types of failure:
- System errors (out-of-memory, disk full, worker crash): The dirty flag is left set,
so the execution will be automatically retried on the next
ref solve. These failures are transient and may succeed when conditions change. - Diagnostic errors (logic bugs, invalid data handling): The dirty flag is cleared, preventing the same failing diagnostic from being retried indefinitely with the same data.
Successful executions also clear the dirty flag.
Retrying failed diagnostics#
If a diagnostic fails due to a logic error, it will not be retried on subsequent solves unless you take action. There are several ways to retry:
Retry specific groups using flag-dirty:
# Find failed execution groups
ref executions list-groups --not-successful
# Flag a specific group for retry
ref executions flag-dirty <group_id>
# Re-solve to pick up the flagged groups
ref solve
Retry all failed executions using --rerun-failed:
This is useful after fixing a bug in a diagnostic provider or resolving an environment issue that caused widespread failures.
Retry stuck executions using fail-running:
If executions appear stuck (e.g., due to a worker crash or out-of-memory error), you can mark them as failed and flag their groups for retry:
# Fail all stuck executions
ref executions fail-running --force
# Fail only executions stuck for more than 2 hours
ref executions fail-running --older-than 2 --force
# Re-solve to retry the flagged groups
ref solve
Next steps#
Once diagnostics have completed, visualise the results in the Visualise tutorial.