Description
MKLLUFactorization performs LU factorization in-place on cache.A, overwriting the original matrix with L\U factors. When the factorization fails (singular matrix) and DefaultLinearSolver's safety fallback (line 564 of src/default.jl) triggers QRFactorization(ColumnNorm()), the QR operates on the corrupted L\U factors instead of the original matrix, producing a garbage solution.
The regular LUFactorization does NOT have this bug — it uses lu(A, check=false) which allocates a new object, leaving cache.A intact for the fallback.
Reproducer
using LinearSolve, LinearAlgebra
# Any singular matrix will trigger this
n = 20
J = randn(n, n)
J[19, :] .= J[18, :] # make rank-deficient
b = randn(n)
# Direct QR gives the correct minimum-norm solution
sol_qr = solve(LinearProblem(copy(J), copy(b)), QRFactorization(ColumnNorm()))
println("QR directly: norm = $(norm(sol_qr.u))") # correct
# DefaultLinearSolver on a system where MKL is available:
# LU fails → falls back to QR on corrupted A → wrong answer
sol_default = solve(LinearProblem(copy(J), copy(b)))
println("DefaultLinearSolver: norm = $(norm(sol_default.u))") # wrong if MKL is used
# To verify the corruption directly:
cache = init(LinearProblem(copy(J), copy(b)))
A_before = copy(cache.A)
SciMLBase.solve!(cache, MKLLUFactorization()) # fails, but corrupts cache.A
println("cache.A corrupted: $(!(A_before ≈ cache.A))") # true
println("max diff: $(maximum(abs.(A_before - cache.A)))") # large
Note: This reproducer requires MKL_jll to be loadable (i.e., LinearSolve.usemkl[] == true). The bug does not manifest when LUFactorization is used instead of MKLLUFactorization.
Impact
This was discovered while investigating why NonlinearSolve.TrustRegion() converges on Julia 1.12 but fails with MaxIters on Julia 1.11 for the same problem and same package versions (KiteModels.jl PR #261). The root cause:
- Julia 1.11: MKL_jll loads →
defaultalg selects MKLLUFactorization → LU fails on singular Jacobian → QR fallback gets corrupted matrix → garbage Newton step (norm=4889) → Dogleg trust region diverges
- Julia 1.12: MKL_jll doesn't load →
defaultalg selects LUFactorization → LU fails → QR fallback gets correct matrix → correct Newton step (norm=661.6) → solver converges
The Jacobian is rank-deficient by 2 (condition number ~10^33), which is inherent to the physics problem. NLsolve.jl handles this via Moré/MINPACK autoscaling; NonlinearSolve's TrustRegion relies on the linear solve being correct.
Bug Location
src/mkl.jl line ~303:
res = getrf!(A; ipiv = cacheval[1].ipiv, info = cacheval[2])
getrf! overwrites A in-place with L\U factors. Since A is cache.A (or an alias of it), the original matrix is destroyed.
src/default.jl lines 564-570 (the safety fallback):
if sol.retcode === ReturnCode.Failure && alg.safetyfallback
sol = SciMLBase.solve!(cache, QRFactorization(ColumnNorm()), args...; kwargs...)
end
This reuses cache (with the now-corrupted cache.A) for the QR solve.
Suggested Fix
Either:
- Make
MKLLUFactorization work on a copy of cache.A (matching LUFactorization's behavior), or
- Have the DefaultLinearSolver fallback path save/restore
cache.A before calling the QR fallback, or
- Set
cache.isfresh = true and restore cache.A from the original problem before the fallback
Option 1 is probably cleanest and consistent with how LUFactorization already works.
Description
MKLLUFactorizationperforms LU factorization in-place oncache.A, overwriting the original matrix with L\U factors. When the factorization fails (singular matrix) andDefaultLinearSolver's safety fallback (line 564 ofsrc/default.jl) triggersQRFactorization(ColumnNorm()), the QR operates on the corrupted L\U factors instead of the original matrix, producing a garbage solution.The regular
LUFactorizationdoes NOT have this bug — it useslu(A, check=false)which allocates a new object, leavingcache.Aintact for the fallback.Reproducer
Note: This reproducer requires MKL_jll to be loadable (i.e.,
LinearSolve.usemkl[] == true). The bug does not manifest whenLUFactorizationis used instead ofMKLLUFactorization.Impact
This was discovered while investigating why
NonlinearSolve.TrustRegion()converges on Julia 1.12 but fails withMaxIterson Julia 1.11 for the same problem and same package versions (KiteModels.jl PR #261). The root cause:defaultalgselectsMKLLUFactorization→ LU fails on singular Jacobian → QR fallback gets corrupted matrix → garbage Newton step (norm=4889) → Dogleg trust region divergesdefaultalgselectsLUFactorization→ LU fails → QR fallback gets correct matrix → correct Newton step (norm=661.6) → solver convergesThe Jacobian is rank-deficient by 2 (condition number ~10^33), which is inherent to the physics problem. NLsolve.jl handles this via Moré/MINPACK autoscaling; NonlinearSolve's TrustRegion relies on the linear solve being correct.
Bug Location
src/mkl.jlline ~303:getrf!overwritesAin-place with L\U factors. SinceAiscache.A(or an alias of it), the original matrix is destroyed.src/default.jllines 564-570 (the safety fallback):This reuses
cache(with the now-corruptedcache.A) for the QR solve.Suggested Fix
Either:
MKLLUFactorizationwork on a copy ofcache.A(matchingLUFactorization's behavior), orcache.Abefore calling the QR fallback, orcache.isfresh = trueand restorecache.Afrom the original problem before the fallbackOption 1 is probably cleanest and consistent with how
LUFactorizationalready works.