Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 19 additions & 22 deletions skills/cuopt-numerical-optimization-api-python/BENCHMARK.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,14 @@ This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the s
## Evaluation Summary

- Skill: `cuopt-numerical-optimization-api-python`
- Evaluation date: 2026-05-29
- Evaluation date: 2026-06-10
- NVSkills-Eval profile: `external`
- Environment: `local`
- Dataset: 1 evaluation tasks
- Attempts per task: 2
- Environment: `astra-sandbox`
- Dataset: 4 evaluation tasks
- Attempts per task: 1
- Pass threshold: 50%
- Overall verdict: FAIL
The skill should be reviewed before NVSkills-Eval publication. **Skill owners should address the applicable findings below and rerun NVSkills-Eval to refresh this benchmark.**

## Agents Used

Expand Down Expand Up @@ -42,9 +43,9 @@ Underlying evaluation signals used in this run:

## Test Tasks

The benchmark dataset contained 1 evaluation tasks:
The benchmark dataset contained 4 evaluation tasks:

- Positive tasks: 1 tasks where the skill was expected to activate.
- Positive tasks: 4 tasks where the skill was expected to activate.
- Negative tasks: 0 tasks where no skill was expected.
- Unlabeled tasks: 0 tasks where positive/negative intent could not be inferred.

Expand All @@ -54,39 +55,39 @@ Task composition is derived from the evaluation dataset when possible. Entries w

| Dimension | Num | `claude-code` | `codex` |
|---|---:|---:|---:|
| Security | 2 | 100% (+0%) | 100% (+0%) |
| Correctness | 2 | 100% (+0%) | 82% (+5%) |
| Discoverability | 2 | 100% (+0%) | 84% (+5%) |
| Effectiveness | 2 | 79% (-1%) | 40% (-9%) |
| Efficiency | 2 | 93% (-0%) | 77% (+1%) |
| Security | 4 | 100% (+0%) | 100% (+0%) |
| Correctness | 4 | 65% (+29%) | 64% (+8%) |
| Discoverability | 4 | 50% (+44%) | 44% (+25%) |
| Effectiveness | 4 | 66% (+17%) | 56% (+3%) |
| Efficiency | 4 | 61% (+37%) | 44% (+17%) |

Score values show skill-assisted performance. Values in parentheses show uplift versus the no-skill baseline when baseline data is available.

## Tier 1: Static Validation Summary

Tier 1 validation passed with observations. NVSkills-Eval ran 9 checks and found 15 total findings.
Tier 1 validation passed with observations. NVSkills-Eval ran 9 checks and found 10 total findings.

Top findings:

- MEDIUM PII/gps_coordinates: GPS coordinates (location information) (`references/qp_examples.md:162`)
- MEDIUM PII/gps_coordinates: GPS coordinates (location information) (`references/qp_examples.md:163`)
- MEDIUM PII/gps_coordinates: GPS coordinates (location information) (`references/qp_examples.md:164`)
- MEDIUM PII/phone_numbers: International phone number (`assets/mps_solver/results.md:48`)
- MEDIUM PII/phone_numbers: International phone number (`assets/mps_solver/results.md:69`)
- MEDIUM SCHEMA/body_recommended_section: Missing recommended section: '## Instructions' (`skills/cuopt-numerical-optimization-api-python/SKILL.md`)
- MEDIUM SCHEMA/body_recommended_section: Missing recommended section: '## Examples' (`skills/cuopt-numerical-optimization-api-python/SKILL.md`)
- LOW QUALITY/quality_discoverability: Description doesn't mention WHEN to use this skill (`skills/cuopt-numerical-optimization-api-python/SKILL.md`)

## Tier 2: Deduplication Summary

Tier 2 validation reported findings. NVSkills-Eval ran 2 checks and found 9 total findings.

Top findings:

- HIGH DUPLICATE/duplicate: Duplicate content found across assets/lp_warmstart/README.md and assets/lp_warmstart/model.py:
"# LP PDLP Warmstart" in assets/lp_warmstart/README.md (lines 1-5)
vs "(module docstring)" in assets/lp_warmstart/model.py (lines 1-4) (`assets/lp_warmstart/README.md:1`)
- HIGH DUPLICATE/duplicate: Duplicate content found across SKILL.md and assets/mps_solver/README.md and references/qp_examples.md:
"# Solve" in SKILL.md (lines 63-67)
vs "# Configure and solve" in assets/mps_solver/README.md (lines 76-80)
vs "# Solve" in references/qp_examples.md (lines 47-51) (`SKILL.md:63`)
- HIGH DUPLICATE/duplicate: Duplicate content found across assets/lp_warmstart/README.md and assets/lp_warmstart/model.py:
"# LP PDLP Warmstart" in assets/lp_warmstart/README.md (lines 1-5)
vs "(module docstring)" in assets/lp_warmstart/model.py (lines 1-4) (`assets/lp_warmstart/README.md:1`)
- HIGH DUPLICATE/duplicate: Duplicate content found across assets/milp_basic/README.md and assets/milp_basic/model.py:
"# Minimal MILP" in assets/milp_basic/README.md (lines 1-10)
vs "(module docstring)" in assets/milp_basic/model.py (lines 1-6) (`assets/milp_basic/README.md:1`)
Expand All @@ -97,7 +98,3 @@ Top findings:
"# Check status (CRITICAL: use PascalCase!)" in SKILL.md (lines 68-74)
vs "# ✅ CORRECT" in SKILL.md (lines 148-151)
vs "# Check solution" in assets/mps_solver/README.md (lines 81-85) (`SKILL.md:68`)

## Publication Recommendation

The skill should be reviewed before NVSkills-Eval publication. Skill owners should address the findings above and rerun NVSkills-Eval to refresh this benchmark.
47 changes: 45 additions & 2 deletions skills/cuopt-numerical-optimization-api-python/evals/evals.json
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
[
{
"id": "numopt-py-eval-001-lp-api-call-sequence",
"question": "I want to solve a small LP (continuous variables only, maximize a linear objective with linear constraints) using the cuOpt Python API. List the API calls in order name each method, one line per method, no full runnable script.",
"question": "I want to solve a small LP (continuous variables only, maximize a linear objective with linear constraints) using the cuOpt Python API. List the API calls in order \u2014 name each method, one line per method, no full runnable script.",
"expected_skill": "cuopt-numerical-optimization-api-python",
"expected_script": null,
"ground_truth": "The agent produces an ordered list of API calls without a runnable script. The list, in order: (1) Import Problem, CONTINUOUS, and MAXIMIZE from cuopt.linear_programming.problem, and SolverSettings from cuopt.linear_programming.solver_settings. (2) Construct Problem('name'). (3) For each decision variable, call problem.addVariable(lb=..., vtype=CONTINUOUS, name=...). (4) For each constraint, call problem.addConstraint(<linear expression> <= or >= or == <rhs>, name=...). (5) Call problem.setObjective(<linear expression>, sense=MAXIMIZE). (6) Construct SolverSettings(); call set_parameter('time_limit', ...) for time budget. (7) Call problem.solve(settings). (8) Check problem.Status.name in ['Optimal', 'PrimalFeasible'] (PascalCase status names — case-sensitive). (9) Read problem.ObjValue for the objective, and each variable's .getValue() for its optimal value. The agent uses LP (not MILP / QP) because all variables are continuous and the objective is linear. Mentions that status names are PascalCase (Optimal, not OPTIMAL or optimal) — case sensitivity matters.",
"ground_truth": "The agent produces an ordered list of API calls without a runnable script. The list, in order: (1) Import Problem, CONTINUOUS, and MAXIMIZE from cuopt.linear_programming.problem, and SolverSettings from cuopt.linear_programming.solver_settings. (2) Construct Problem('name'). (3) For each decision variable, call problem.addVariable(lb=..., vtype=CONTINUOUS, name=...). (4) For each constraint, call problem.addConstraint(<linear expression> <= or >= or == <rhs>, name=...). (5) Call problem.setObjective(<linear expression>, sense=MAXIMIZE). (6) Construct SolverSettings(); call set_parameter('time_limit', ...) for time budget. (7) Call problem.solve(settings). (8) Check problem.Status.name in ['Optimal', 'PrimalFeasible'] (PascalCase status names \u2014 case-sensitive). (9) Read problem.ObjValue for the objective, and each variable's .getValue() for its optimal value. The agent uses LP (not MILP / QP) because all variables are continuous and the objective is linear. Mentions that status names are PascalCase (Optimal, not OPTIMAL or optimal) \u2014 case sensitivity matters.",
"expected_behavior": [
"Selects LP (not MILP or QP) given continuous variables and a linear objective",
"Lists the API calls in order without producing a full runnable script",
Expand All @@ -15,5 +15,48 @@
"Mentions that status names are case-sensitive (PascalCase)",
"Does not invent method names that are not in the skill"
]
},
{
"id": "numopt-py-eval-002-status-case-sensitivity",
"question": "My cuOpt Python LP solve runs without error but the result block never executes. Here is the check I wrote: if problem.Status.name == 'OPTIMAL': print(problem.ObjValue). What is wrong and how do I fix it?",
"expected_skill": "cuopt-numerical-optimization-api-python",
"expected_script": null,
"ground_truth": "The check silently fails because cuOpt status names use PascalCase, not ALL_CAPS. The string 'OPTIMAL' never matches. The correct LP status values to check are 'Optimal' and 'PrimalFeasible'. The fixed check is: if problem.Status.name in ['Optimal', 'PrimalFeasible']: print(problem.ObjValue). For MILP the correct values are 'Optimal' and 'FeasibleFound'. This is a common silent bug \u2014 the solve completes successfully but the code path that reads results is skipped because the string comparison always returns False.",
"expected_behavior": [
"Identifies the bug as a case mismatch \u2014 'OPTIMAL' is wrong, 'Optimal' is correct",
"States that cuOpt status names are PascalCase, not ALL_CAPS",
"Gives the correct LP check: problem.Status.name in ['Optimal', 'PrimalFeasible']",
"Notes that for MILP the passing status is 'FeasibleFound' not 'FEASIBLE_FOUND' or 'FEASIBLEFOUND'",
"Explains why this is a silent failure \u2014 no exception is raised, the block just never executes"
]
},
{
"id": "numopt-py-eval-003-integer-vs-continuous-workers",
"question": "I am modeling a staffing problem where I need to decide how many nurses to assign to each ward. Should the nurse count variables be INTEGER or CONTINUOUS in the cuOpt Python API, and what vtype constant do I use for each?",
"expected_skill": "cuopt-numerical-optimization-api-python",
"expected_script": null,
"ground_truth": "Nurse counts should be INTEGER because nurses are discrete countable entities \u2014 you cannot assign 2.7 nurses to a ward. The vtype constant is INTEGER (imported from cuopt.linear_programming.problem). The addVariable call would be: problem.addVariable(lb=0, vtype=INTEGER, name='ward_a_nurses'). This makes the problem a MILP, not an LP. CONTINUOUS would be wrong here because it allows fractional values, which are meaningless for headcounts. The rule is: 'how many things' (people, vehicles, machines) \u2192 INTEGER; 'how much of something' (hours, tonnes, dollars) \u2192 CONTINUOUS.",
"expected_behavior": [
"States nurse counts must be INTEGER because nurses are discrete countable entities",
"Names the correct vtype constant: INTEGER (imported from cuopt.linear_programming.problem)",
"Shows or describes the addVariable call with vtype=INTEGER",
"States this makes the problem MILP, not LP",
"Explains why CONTINUOUS is wrong \u2014 it allows fractional nurse counts",
"States the rule: countable things \u2192 INTEGER, measurable amounts \u2192 CONTINUOUS"
]
},
{
"id": "numopt-py-eval-004-qp-maximize-workaround",
"question": "I want to maximize a quadratic objective using the cuOpt Python QP API. When I pass sense=MAXIMIZE to setObjective, I get an error. What is the correct approach?",
"expected_skill": "cuopt-numerical-optimization-api-python",
"expected_script": null,
"ground_truth": "The cuOpt QP solver only supports MINIMIZE \u2014 MAXIMIZE is rejected for quadratic objectives. The correct workaround is to negate all coefficients in the objective and minimize the negated expression. For example, to maximize -0.04*x1*x1 - 0.02*x2*x2 (a concave quadratic with NSD Q), minimize 0.04*x1*x1 + 0.02*x2*x2 with sense=MINIMIZE. The resulting problem.ObjValue will be the negated maximum; multiply by -1 to recover the true maximum. All variables must remain CONTINUOUS \u2014 integer QP is not supported. The Q matrix of the original maximization problem must be negative semi-definite (NSD) for the problem to be concave and have a finite maximum; after negation it becomes PSD, which is what the solver expects. Maximizing a convex quadratic (positive coefficients) is unbounded and not a meaningful use case.",
"expected_behavior": [
"States QP only supports MINIMIZE \u2014 MAXIMIZE is rejected",
"Gives the correct workaround: negate all objective coefficients and use sense=MINIMIZE",
"Notes that problem.ObjValue will be negated and must be multiplied by -1 to get the true maximum",
"Reminds that all variables must be CONTINUOUS \u2014 integer QP is not supported",
"Does not suggest a non-existent MAXIMIZE_QP or similar invented API"
]
}
]
28 changes: 14 additions & 14 deletions skills/cuopt-numerical-optimization-api-python/skill-card.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ This skill is ready for commercial/non-commercial use. <br>
NVIDIA <br>

### License/Terms of Use: <br>
Apache-2.0 <br>
Apache 2.0 <br>
## Use Case: <br>
Developers and engineers use this skill to formulate and solve linear programming (LP), mixed-integer linear programming (MILP), and quadratic programming (QP) optimization problems using NVIDIA cuOpt's GPU-accelerated Python API. <br>
Developers and engineers solving linear, mixed-integer, and quadratic programming problems using NVIDIA cuOpts GPU-accelerated Python API for scheduling, portfolio optimization, production planning, and least-squares fitting. <br>

### Deployment Geography for Use: <br>
Global <br>
Expand All @@ -19,25 +19,25 @@ Risk: Review before execution as proposals could introduce incorrect or misleadi
Mitigation: Review and scan skill before deployment. <br>

## Reference(s): <br>
- [QP Examples (least-squares, maximization workaround, matrix form)](references/qp_examples.md) <br>
- [cuOpt User Guide](https://docs.nvidia.com/cuopt/user-guide/latest/introduction.html) <br>
- [cuOpt Examples](https://github.com/NVIDIA/cuopt-examples) <br>
- [QP Examples Reference](references/qp_examples.md) <br>
- [cuOpt Examples Repository](https://github.com/NVIDIA/cuopt-examples) <br>


## Skill Output: <br>
**Output Type(s):** [Code, API Calls, Analysis] <br>
**Output Format:** [Python code with inline solver output] <br>
**Output Type(s):** [Code, API Calls] <br>
**Output Format:** [Python code with inline solver configuration] <br>
**Output Parameters:** [1D] <br>
**Other Properties Related to Output:** [None] <br>

## Evaluation Agents Used: <br>
- claude-code <br>
- codex <br>
- `claude-code` <br>
- `codex` <br>



## Evaluation Tasks: <br>
Evaluated against 1 task with 2 attempts per agent; pass threshold 50%. NVSkills-Eval profile: external. <br>
Evaluated against 4 evaluation tasks (NVSkills-Eval external profile, astra-sandbox environment, 1 attempt per task). <br>

## Evaluation Metrics Used: <br>
Reported benchmark dimensions: <br>
Expand All @@ -61,11 +61,11 @@ Underlying evaluation signals used in this run: <br>
## Evaluation Results: <br>
| Dimension | Num | `claude-code` | `codex` |
|---|---:|---:|---:|
| Security | 2 | 100% (+0%) | 100% (+0%) |
| Correctness | 2 | 100% (+0%) | 82% (+5%) |
| Discoverability | 2 | 100% (+0%) | 84% (+5%) |
| Effectiveness | 2 | 79% (-1%) | 40% (-9%) |
| Efficiency | 2 | 93% (-0%) | 77% (+1%) |
| Security | 4 | 100% (+0%) | 100% (+0%) |
| Correctness | 4 | 65% (+29%) | 64% (+8%) |
| Discoverability | 4 | 50% (+44%) | 44% (+25%) |
| Effectiveness | 4 | 66% (+17%) | 56% (+3%) |
| Efficiency | 4 | 61% (+37%) | 44% (+17%) |

## Skill Version(s): <br>
26.08.00 (source: frontmatter) <br>
Expand Down
Loading