Solve Diagnostics#

With your datasets ingested and cataloged, you can now solve and execute diagnostics using the ref solve command.

1. Run all diagnostics (default)#

By default, ref solve will discover and schedule all available diagnostics across all providers. The default executor is the local executor, which runs diagnostics in parallel using a process pool:

ref solve --timeout 3600

This will:

Query the catalog of ingested datasets (observations and model-output), completing any missing metadata as needed
Determine which diagnostics are applicable and how many different executions are needed
Execute each diagnostic in parallel on your machine
Use a timeout of 3600 seconds (1 hour) to complete the runs (0 = no timeout)

Note: it is normal for some executions to fail (e.g., due to missing data or configuration). You can re-run or inspect failures as needed.

Because the local executor runs many diagnostics in parallel, it is worth tuning parallelism and thread limits (including the OpenMP/BLAS environment variables) before a large solve. See How to control memory use and parallelism.

Tip

To target a specific provider or diagnostic, use the --provider and --diagnostic flags:

# Run only PMP diagnostics
ref solve --provider pmp

# Run only diagnostics containing "enso" in their slug
ref solve --diagnostic enso

Replace pmp or enso with any provider or diagnostic slug listed in your installation.

2. Monitor execution status#

You can view the status of execution groups with:

ref executions list-groups

Each group corresponds to a set of related executions (e.g., all runs of a diagnostic for one model). To see details for a specific group, use:

ref executions inspect <group_id>

This will show the status (pending, running, succeeded, failed) of each execution in the group and any error messages. This log output is very useful to include if you need to report an issue or seek help.

3. Re-execution and the dirty flag#

Each execution group tracks whether it is dirty -- meaning it needs to be rerun. The solver uses this flag, along with a hash of the input datasets, to decide which executions to schedule on each solve.

An execution group is automatically marked dirty when:

It is first created (no executions have been run yet)
New data is ingested that changes the set of input datasets

System errors vs diagnostic errors#

When an execution fails, the system distinguishes between two types of failure:

System errors (out-of-memory, disk full, worker crash): The dirty flag is left set, so the execution will be automatically retried on the next ref solve. These failures are transient and may succeed when conditions change.
Diagnostic errors (logic bugs, invalid data handling): The dirty flag is cleared, preventing the same failing diagnostic from being retried indefinitely with the same data.

Successful executions also clear the dirty flag.

Retrying failed diagnostics#

If a diagnostic fails due to a logic error, it will not be retried on subsequent solves unless you take action. There are several ways to retry:

Retry specific groups using flag-dirty:

# Find failed execution groups
ref executions list-groups --not-successful

# Flag a specific group for retry
ref executions flag-dirty <group_id>

# Re-solve to pick up the flagged groups
ref solve

Retry all failed executions using --rerun-failed:

ref solve --rerun-failed

This is useful after fixing a bug in a diagnostic provider or resolving an environment issue that caused widespread failures.

Retry stuck executions using fail-running:

If executions appear stuck (e.g., due to a worker crash or out-of-memory error), you can mark them as failed and flag their groups for retry:

# Fail all stuck executions
ref executions fail-running --force

# Fail only executions stuck for more than 2 hours
ref executions fail-running --older-than 2 --force

# Re-solve to retry the flagged groups
ref solve

Next steps#

Once diagnostics have completed, visualise the results in the Visualise tutorial.