Skip to content

MKLLUFactorization corrupts cache.A, breaking DefaultLinearSolver QR safety fallback #887

@ChrisRackauckas-Claude

Description

@ChrisRackauckas-Claude

Description

MKLLUFactorization performs LU factorization in-place on cache.A, overwriting the original matrix with L\U factors. When the factorization fails (singular matrix) and DefaultLinearSolver's safety fallback (line 564 of src/default.jl) triggers QRFactorization(ColumnNorm()), the QR operates on the corrupted L\U factors instead of the original matrix, producing a garbage solution.

The regular LUFactorization does NOT have this bug — it uses lu(A, check=false) which allocates a new object, leaving cache.A intact for the fallback.

Reproducer

using LinearSolve, LinearAlgebra

# Any singular matrix will trigger this
n = 20
J = randn(n, n)
J[19, :] .= J[18, :]  # make rank-deficient

b = randn(n)

# Direct QR gives the correct minimum-norm solution
sol_qr = solve(LinearProblem(copy(J), copy(b)), QRFactorization(ColumnNorm()))
println("QR directly: norm = $(norm(sol_qr.u))")  # correct

# DefaultLinearSolver on a system where MKL is available:
# LU fails → falls back to QR on corrupted A → wrong answer
sol_default = solve(LinearProblem(copy(J), copy(b)))
println("DefaultLinearSolver: norm = $(norm(sol_default.u))")  # wrong if MKL is used

# To verify the corruption directly:
cache = init(LinearProblem(copy(J), copy(b)))
A_before = copy(cache.A)
SciMLBase.solve!(cache, MKLLUFactorization())  # fails, but corrupts cache.A
println("cache.A corrupted: $(!(A_before  cache.A))")  # true
println("max diff: $(maximum(abs.(A_before - cache.A)))")  # large

Note: This reproducer requires MKL_jll to be loadable (i.e., LinearSolve.usemkl[] == true). The bug does not manifest when LUFactorization is used instead of MKLLUFactorization.

Impact

This was discovered while investigating why NonlinearSolve.TrustRegion() converges on Julia 1.12 but fails with MaxIters on Julia 1.11 for the same problem and same package versions (KiteModels.jl PR #261). The root cause:

  • Julia 1.11: MKL_jll loads → defaultalg selects MKLLUFactorization → LU fails on singular Jacobian → QR fallback gets corrupted matrix → garbage Newton step (norm=4889) → Dogleg trust region diverges
  • Julia 1.12: MKL_jll doesn't load → defaultalg selects LUFactorization → LU fails → QR fallback gets correct matrix → correct Newton step (norm=661.6) → solver converges

The Jacobian is rank-deficient by 2 (condition number ~10^33), which is inherent to the physics problem. NLsolve.jl handles this via Moré/MINPACK autoscaling; NonlinearSolve's TrustRegion relies on the linear solve being correct.

Bug Location

src/mkl.jl line ~303:

res = getrf!(A; ipiv = cacheval[1].ipiv, info = cacheval[2])

getrf! overwrites A in-place with L\U factors. Since A is cache.A (or an alias of it), the original matrix is destroyed.

src/default.jl lines 564-570 (the safety fallback):

if sol.retcode === ReturnCode.Failure && alg.safetyfallback
    sol = SciMLBase.solve!(cache, QRFactorization(ColumnNorm()), args...; kwargs...)
end

This reuses cache (with the now-corrupted cache.A) for the QR solve.

Suggested Fix

Either:

  1. Make MKLLUFactorization work on a copy of cache.A (matching LUFactorization's behavior), or
  2. Have the DefaultLinearSolver fallback path save/restore cache.A before calling the QR fallback, or
  3. Set cache.isfresh = true and restore cache.A from the original problem before the fallback

Option 1 is probably cleanest and consistent with how LUFactorization already works.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions