Refactor Sem multi-group support#317
Refactor Sem multi-group support#317alyst wants to merge 32 commits intoStructuralEquationModels:develfrom
Conversation
3c39941 to
32cea82
Compare
- for SemImplied require spec::SemSpec as positional - for SemLossFunction require implied argument
eb039a2 to
88a1ff0
Compare
88a1ff0 to
0406f29
Compare
|
@Maximilian-Stefan-Ernst It might be a nice idea to use copilot for catching typos, incorrect sentences, but also potential bugs. |
deduplicate the correction scale methods and move to Sem.jl
remove update_observed!()
to suppress info about inv(obs_cov)
0406f29 to
0096211
Compare
0096211 to
05abcd9
Compare
|
Thank you a lot for those changes, @alyst! I have a few high level points before I review in detail:
Let me know what you think of that! |
|
@Maximilian-Stefan-Ernst Thank you for the review! I think these are very valid points.
Internally, it is easy to implement. I'm not sure about the user-facing API. The current approach is to pass only the types of the objects and construct tehm using the keyword parameters that are broadcasted to all constructed elements of the SEM. @SEM(
# implied definitions
[:implied1 => RAM(...),
:implied2 => RAMSymbolic(...)
],
# loss term definitions
[:loss1 => SemML(:implied1, ...), # instead of passing RAM() object directly,
:loss2 => SemFIML(:implied1, ...), # reusing the same implied object
:loss3 => SemWLS(:implied2, ...),
....
],
)It will expand into a code that first builds RAM objects, and then substitutes their references in the loss term construction with the actual implied objects, and finally constructs the SEM using the loss objects. I might have overlooked the implied objects sharing, because I am not using this feature myself (I was more focused on multi-group and regularization -- the implied objects share some parameters, but are not identical).
The This case actually highlights one of the issues that I wanted to address. For the bootstrap, For the broader updates that change the model structure or the configuration of individual elements, |
|
Ah, another consideration about sharing the implied term -- as we discussed RAMSymbolic has to be |
calls replace_observed() for the underlying term
the kwarg specifies whether to recalculate weights
33b9243 to
293c88b
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## devel #317 +/- ##
==========================================
+ Coverage 71.83% 72.70% +0.86%
==========================================
Files 51 57 +6
Lines 2223 2469 +246
==========================================
+ Hits 1597 1795 +198
- Misses 626 674 +48 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@Maximilian-Stefan-Ernst I've added kwargs mechanism, and specifically I've also fixed the observed initialization for ensemble SEMs, that fixed the CFI failure, so now all the tests pass. |
| function replace_observed(loss::SemLoss, data::Union{AbstractMatrix, DataFrame}; kwargs...) | ||
| old_obs = SEM.observed(loss) | ||
| new_observed = | ||
| typeof(old_obs).name.wrapper(data = data, observed_vars = observed_vars(old_obs)) | ||
| return replace_observed(loss, new_observed; kwargs...) | ||
| end | ||
|
|
||
| # non-SEM loss terms are unchanged | ||
| replace_observed(loss::AbstractLoss, ::Any; kwargs...) = loss | ||
|
|
||
| # LossTerm: delegate to inner loss | ||
| replace_observed(term::LossTerm, data; kwargs...) = | ||
| LossTerm(replace_observed(loss(term), data; kwargs...), id(term), weight(term)) |
There was a problem hiding this comment.
Maybe we can define these defaults only for losses in the package. If someone implements a new loss, they would have to define a method for replace_observed - this adds some overhead, but would act as a safety to avoid someone calling it on their own loss term (that might need updating) and getting wrong results.
There was a problem hiding this comment.
That's nice ideas!
The default replace_observed(loss::SemLoss, ...) (L60) actually just calls the constructor for that SemLoss, which probably should take care of all loss-specific updates in most situtations.
That's why I have thought it is a good default.
But if you think replace_observed() might require a special logic more often, I can update the PR:
- L60 would be renamed to
default_replace_observed(loss::SemLoss) - a fallback
replace_observed(loss::SemLoss)will throw a message instructing to implementreplace_observed()for the concrete type, and suggestingSEM.default_replace_observed()as the candidate in simple cases - for SEM.jl losses that don't need special handling (SemML etc),
replace_observed(loss::SemML, ...) = default_replace_observed(loss, ....)
There was a problem hiding this comment.
That sounds great!
One thing: I think for SemML we actually can/should have a specialized method, because nothing has to be updated if the observed variables don't change (which is now disallowed), and hessianeval should stay the same for the new model - I'll make a comment.
| @@ -225,22 +235,3 @@ function non_posdef_return(par) | |||
| return typemax(eltype(par)) | |||
| end | |||
| end | |||
There was a problem hiding this comment.
| end | |
| end | |
| replace_observed(loss::SemML, ::SemObserved; kwargs...) = loss |
There was a problem hiding this comment.
Actually, all SEM loss terms (<:SemLoss) have to reference observed and implied.
Since replace_observed() replaces the observed object in the loss object, we have to create a new loss object, so the default replace_observed(loss::SemLoss, new_observed::SemObserved) method defined in abstract.jl should be fine here.
The semantics of the replace_observed(model::Sem) is to create a copy of the SEM model.
Since this is a public method, it could be called in different contexts, and allowing different models sharing the same subobjects can create problems (e.g. SemML can have its own internal state, preallocated matrices etc, so updating the parameters of the old and the new models may result in unexpected behaviour).
In general replace_observed() should not return the same object.
The idea was to keep it simple at the expense of some potential overhead for the bootstrap use case (many matrices created for the bootstrap copies of the model), but Julia's GC should handle it well, and intermediate matrices are created by evaluate!() in any case.
There was a problem hiding this comment.
Ah yes, my code suggestion does not make sense.
Re sharing subobjects
this is something we can not really avoid at the moment, because after calling replace_observed on a model, implied types share memory, and therefore, it is not save to fit models created this way in parallel. I think there are two options:
- copying everything when calling
replace_observed, alsoimpliedobjects - copying as little as possible and expect the user to
deepcopymodels if they want to fit them in parallel.
I prefer option 2, because it gives more fine-grained control and is better for (serial) bootstrap.
Re SemML specifically
In line with the above I would prefer having SemML not copying it's internal matrices, and also keeping the hessianeval from the original model.
There was a problem hiding this comment.
I think the design issue here is that at the moment there is no separation between storing the SEM model definition and storing the transient state required to calculate covariation matrices, objectives, gradients etc.
I.e. storing all intermediate matrices in SemImpliedState subtypes (managed by the specific subtype of SemImplied, e.g. RAMSymbolicState managed by RAMSymbolic etc) and keeping SemImplied objects and their fields truly immutable.
The same for SemLoss subtypes.
There could be pools of these states, and for calculations evaluate!() would acquire the object from the pool as needed, and release it back upon completion.
So semb = replace_observed(sema, newobserved) would create a shallow copy of sema sharing its implied terms (and their state pools) with semb,
it will also create new loss terms that only differ by the newobserved from the old ones, but will share the pools of internal states with the sema etc.
Otherwise I don't see how to cleanly manage it in the long run.
It was less of a problem before bootstrap became a more prominent feature, but for bootstrap with its potentially parallelized computations managing states becomes important.
Having said that, I have implemented similar approaches in the past -- it reduces the stress on GC in terms of the number of allocated objects and the amount of memory.
But it does not result in dramatic (x2) improvements of performance, because Julia's GC essentially does the same thing as the proposed design, but at the lower level:
even if replace_observed(sema) deep-copies sema, this copy is short-lived and gets efficiently collected once not used.
|
Thanks a lot again, @alyst! Once the last open comments are resolved, I would be happy to merge. I tried to activate Copilot, but since this repo is part of an organization, it seems to me like I would have to get Copilot for business, which is quite expensive. Not sure if there is another way. |
| _subtype_info(::Sem) = "" | ||
| _subtype_info(::SemFiniteDiff) = " : Finite Difference Approximation" | ||
|
|
||
| function Base.show(io::IO, sem::AbstractSem) |
There was a problem hiding this comment.
Some general considerations for show(sem::AbstractSem).
- Julia recommends making show() as close to the code as possible with more verbose human-readable version implemented by
show(io::IO, ::MIME"text/plain", ::T).
This show() version does not show the actual type info, which I think is important for debugging etc.
Given that there is alsodetails()methods, maybe it's better to makeshow()more "coder-friendly" with more type information etc, anddetails()more "reader-friendly" or "math-friendly" with more details and more natural language. Also, we can defineshow(io::IO, ::MIME"text/plain", sem) = details(io, sem). - in line with that,
_subtype_info()could be used for thedetails(), andshow()could just show the actual type - it would be nice to show the total number of parameters (
nparams()). I think it is an important aspect of the model that affects computations. - loss terms are printed as
- Loss functions. I am personally more in favor of a "term" term, because it matches their role in the scalar SEM objective (elements of a sum), whereas functions suggest that there is some vector-based multi-objective.
There was a problem hiding this comment.
Okay, as I understand it, the output of show(io::IO, ::MIME"text/plain", ::T) is getting printed to the REPL. What is the actual use case of the other show method then?
I think I would keep the name Loss functions since it is a widely used name in numerical optimization. Wikipedia writes
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function)[1] is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event.
so I think it should be clear that the output of the individual loss functions are real numbers.
There was a problem hiding this comment.
the output of show(io::IO, ::MIME"text/plain", ::T) is getting printed to the REPL. What is the actual use case of the other show method then?
Yes, REPL uses show(io::IO, ::MIME"text/plain", ::T), but if you call show(obj) explicitly, it will output a more terse code-compatible version, and print(obj) uses it as well:
julia> a = [1 2; 3 4]
2×2 Matrix{Int64}:
1 2
3 4
julia> show(a)
[1 2; 3 4]
julia> @show a;
a = [1 2; 3 4]
julia> print(a)
[1 2; 3 4]
julia> b
Dict{Symbol, Int64} with 2 entries:
:y => 2
:x => 1
julia> show(b)
Dict(:y => 2, :x => 1)So whenever user programmatically prints the SEM object, it will use the default show().
I am not advocating for a full "code-like" output, but most show() implementations, including the text/plain variants, provide the actual type of the object.
It is very helpful, because it allows the user to search the documentation.
I think I would keep the name Loss functions since it is a widely used name in numerical optimization.
I think we have full alignment that what is being printed are "loss functions" individually.
In fact, when "function" is obvious from the context, it is often dropped (as is also done in the Wikipedia article).
My suggestion to use "terms" is to:
- highlight that these are the components of the Sem objective function, which is a weighted sum of loss functions -- rather then just a collection of loss functions
- align with the API, because it already uses LossTerm, sem_terms(), loss_terms() etc
* rename get_fields!() into build_sem_terms() for clarity * move set_field_type!() code into Sem() ctor since its not used outside
@Maximilian-Stefan-Ernst Thanks for the thorough review, as always!
I think you are right. Unfortunately, it looks like there's no way to just enable copilot for the repository to let the 3rd party collaborators with the copilot subscription use it. |
This is a largest remaining part of #193, which changes some interfaces.
Refactoring of the SEM types
AbstractLossis the base type for all functionsSemLoss{O,I} <: AbstractLossis the base type for all SEM losses, it now requires to have observed::O and implied::I fieldSemLossctor should always be given observed and implied (positional),meanstructurekeyword is gone -- loss should always respect implied specification.LossTermis a thin wrapper aroundAbstractLossthat adds optional id of the loss term and optional weightSemis a container ofLossTermobjects (accessible vialoss_terms(sem), orloss_term(sem, id)), so it can handle multiple SEM terms (accessible viasem_terms(sem)-- subset ofloss_terms(sem), orsem_term(sem, id)).It replaces both the old
SemandSemEnsemble.AbstractSingleSem,AbstractSemCollectionandSemEnsembleare gone.Method changes
Multi-term SEMs could be created like
Or with weights specification
The new Sem() and loss-term constructors rely less on keyword arguments and more on positional arguments, but some keywords support is present.
update_observed!()was removed. It was only used byreplace_observed(),but otherwise in-place model modification with unclear semantics is error-prone.
replace_observed(sem, data)was simplified by removing support of additional keywords or requirement to pass SEM specification.It only creates a copy of the given
Semwith the observed data replaced,but implied and loss definitions intact.
Changing observed vars is not supported -- that is something use-case specific
that user should implement in their code.
check_single_lossfun()was renamed intocheck_same_semterm_type()asit better describes what it does. If check is successful, it returns the specific
subtype of
SemLoss.bootstrap()andse_bootstrap()usebootstrap!(acc::BootstrapAccumulator, ...)function to reduce code duplication
bootstrap()returnsBootstrapResult{T}for better type inferencefit_measures()now also accepts vector of functions, and includesCFIby default (DEFAULT_FIT_MEASURESconstant)test_fitmeasures()was tweaked to handle more repetitive code: calculating the subset of fit measures, and compairing this subset against lavaan refs, checking for measures that could not be applied to given loss types (SemWLS).