crossvalidate/predictit.sthlp at main · wbuchanan/crossvalidate · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
{smcl}
{* *! version 0.0.6 22mar2024}{...}
{vieweralsosee "[R] estat classification" "mansection R estat_classification"}{...}
{vieweralsosee "" "--"}{...}
{viewerjumpto "Syntax" "predictit##syntax"}{...}
{viewerjumpto "Description" "predictit##description"}{...}
{viewerjumpto "Options" "predictit##options"}{...}
{viewerjumpto "Examples" "predictit##examples"}{...}
{viewerjumpto "Additional Information" "predictit##additional"}{...}
{viewerjumpto "Contact" "predictit##contact"}{...}
{title:Generating and Managing Model Predictions for Cross-Validation in Stata}

{marker syntax}{...}
{title:Syntax}

{p 8 32 2}
{cmd:predictit} {it:"estimation command"} {cmd:,} {cmdab:ps:tub(}{it:string asis}{cmd:)}
[{cmdab:spl:it(}{it:varname}{cmd:)} {cmdab:c:lasses(}{it:integer}{cmd:)}
{cmdab:kf:old(}{it:integer}{cmd:)} {cmdab:thr:eshold(}{it:real}{cmd:)}
{cmdab:mod:ifin(}{it:string asis}{cmd:)} {cmdab:kfi:fin(}{it:string asis}{cmd:)}
{cmdab:noall} {cmdab:pm:ethod(}{it:string asis}{cmd:)}
{cmdab:po:pts(}{it:string asis}{cmd:)}]{p_end}

{synoptset 25 tabbed}{...}
{synoptline}
{synopthdr}
{synoptline}
{syntab:Required}
{synopt :{opt ps:tub}}a new variable name for predicted values{p_end}
{syntab:Optional}
{synopt :{opt spl:it}}name of the variable that identifies the training split(s){p_end}
{synopt :{opt c:lasses}}is used to specify the number of classes for classification models; default is {cmd:classes(0)}.{p_end}
{synopt :{opt kf:old}}specifies the number of folds in the training set; default is {cmd:kfold(1)}.{p_end}
{synopt :{opt thr:eshold}}positive outcome threshold; default is {cmd:threshold(0.5)}{p_end}
{synopt :{opt mod:ifin}}a modified if expression{p_end}
{synopt :{opt kfi:fin}}a modified if expression{p_end}
{synopt :{opt noall}}suppresses prediction on entire training set for K-Fold cases{p_end}
{synopt :{opt pm:ethod}}predicted statistic from {help predict}; default is {cmd:pmethod(pr)}{p_end}
{synopt :{opt po:pts}}options passed to {help predict} in addition to the method{p_end}
{synoptline}


{marker description}{...}
{title:Description}

INCLUDE help xvphase-predict

{marker options}{...}
{title:Options}

{dlgtab:Required}

{phang}
{opt ps:tub} is used to define a new variable name/stub for the predicted values
from the validation/test set.  When K-Fold cross-validation is used, this
option defines the name of the variable containing the predicted values from
each of the folds and will be used as a variable stub to store the results from
fitting the model to all of the training data.


{dlgtab:Optional}

{phang}
{opt spl:it} must contain the name of the variable that stores the test,
validation, and test splits.  There will only be a single variable if the splits
were created using {help splitit}.  Additionally, if you are passing an
estimation command string as the first argument to this command, you must
pass the split variable name to this option.

{phang}
{opt c:lasses} is used to distinguish between models of non-categorical data (
{opt c:lasses(0)}), binary data ({opt c:lasses(2)}), and multinomial/ordinal
data ({opt c:lasses(>= 3)}).  You will only need to pass an argument to this
parameter if you are using some form of a classification model.  Internally, it
is used to determine whether to call {help predict} (in the case of
{opt c:lasses(0)}) or {help classify} (in all other cases).

{phang}
{opt kf:old} defines the number of K-Folds used for the training set.  In other
places, we reference using K-Fold cross-validation in the more common form,
where the training set consists of multiple subsets of data.  However, standard
train/test and train/validation/test splits are simply a special case of K-Fold
cross-validation where there is only a single fold.

{phang}
{opt thr:eshold} defines the probability cutoff used to determine a positive
classification for binary response models.  This value functions the same way
as it does in the case of {help estat_classification:estat classification}.

{phang}
{opt mod:ifin} is the if expression to use for the predictions on the individual
K-Folds.  As a reminder, an 80/20 train/test split, is a special case of K-Fold
cross-validation with a single K-Fold.  This option can be used in lieu of
passing the estimation command string.  If an estimation command string is also
passed, the value passed to this parameter will be overwritten by the value
generated by the internal call to {help cmdmod}.  If {help fitit} was called
prior to this command, you can pass {cmd:`macval(e(predifin))'} to this
parameter to provide the modified if expression for predictions.

{phang}
{opt kfi:fin} is the if expression used to generate predictions on the entire
training set when using K-Fold cross-validation.  This is typically used at the
conclusion of hyperparameter tuning to provide an assessment of the model fit
when fitted to all of the K-Folds simultaneously, prior to evaluating the
performance on a test set.

{phang}
{opt no:all} is an option to prevent predicting the outcome for a model fitted
to the entire training set when using K-Fold cross-validation.  If this option
is used, {opt kfi:fin} will have no effect since the relevant predictions will
not be generated.

{phang}
{opt pm:ethod} is passed internally to Stata's {help predict} command to
generate the predicted values of the outcome for the out-of-sample data. The
default value used by {cmd:predictit} depends on the value passed to the
{opt c:lasses} option.  When option {opt c:lasses} is set to 0 the prediction
method will default to {opt xb}; in all other instances, the prediction method
will default to {opt pr}.

{phang}
{opt po:pts} is passed internally to Stata's {help predict} command to
generate the predicted values of the outcome for the out-of-sample data. For
multivariate outcome models, like {help sureg}, this option can be used to
specify which of the equations should be used to predict the outcome of interest.
It can also be used to specify the {opt nooff:set} option in single or
multi-equation models.  Consult the {help predict} documentation for the model
you are fitting for more details.

{marker examples}{...}
{title:Examples}

{p 4 4 2}Update these to reflect predictit{p_end}

{p 4 4 2}Load example data{p_end}
{p 8 4 2}{stata webuse lbw, clear}{p_end}
{p 4 4 2}Create a variable to identify the sample to fit the data to{p_end}
{p 8 4 2}{stata g byte touse = runiformint(1, 6)}{p_end}
{p 4 4 2}Fit a model to the data{p_end}
{p 8 4 2}{stata fitit "logit low age smoke", spl(touse) kf(5) res(lmod))}{p_end}
{p 4 4 2}Generate predictions for the five training folds{p_end}
{p 8 4 2}{stata predictit, ps(pred) c(2) k(5) mod(`macval(r(predifin))')}{p_end}
{p 4 4 2}Generate predicted values for the model fitted to the entire training set and the individual K-Folds{p_end}
{p 8 4 2}{stata predictit, ps(pred) c(2) k(5) mod(`macval(r(predifin))') kfi(`macval(r(kfpredifin))')}{p_end}
{p 4 4 2}Alternative syntax for the previous two examples{p_end}
{p 8 4 2}{stata predictit "logit low age smoke", ps(pred) c(2) spl(touse) kf(5) noall}{p_end}
{p 8 4 2}{stata predictit "logit low age smoke", ps(pred) c(2) spl(touse) kf(5)}{p_end}

{marker additional}{...}
{title:Additional Information}
{p 4 4 8}If you have questions, comments, or find bugs, please submit an issue in the {browse "https://github.com/wbuchanan/crossvalidate":crossvalidate GitHub repository}.{p_end}

{marker contact}{...}
{title:Contact}
{p 4 4 8}William R. Buchanan, Ph.D.{p_end}
{p 4 4 8}Sr. Research Scientist, SAG Corporation{p_end}
{p 4 4 8}{browse "https://www.sagcorp.com":SAG Corporation}{p_end}
{p 4 4 8}wbuchanan at sagcorp [dot] com{p_end}

{p 4 4 8}Steven D. Brownell, Ph.D.{p_end}
{p 4 4 8}Economist, SAG Corporation{p_end}
{p 4 4 8}{browse "https://www.sagcorp.com":SAG Corporation}{p_end}
{p 4 4 8}sbrownell at sagcorp [dot] com{p_end}