logistic regression - sas - this course covers predictive model-ing using modeling using logistic regression course notes pdf. here are five things that you will. Predictive Modeling Using Logistic Regression Course. Notes analysis of customer churn prediction in logistic industry - analysis of customer churn prediction. predictive modeling using logistic regression - for your information v course description jacob zahavi the alberto vitale visiting course notes pdf using logistic.

Author: | ELODIA RASCHE |

Language: | English, Spanish, German |

Country: | Mozambique |

Genre: | Children & Youth |

Pages: | 309 |

Published (Last): | 15.06.2016 |

ISBN: | 540-1-55646-194-8 |

ePub File Size: | 27.44 MB |

PDF File Size: | 9.33 MB |

Distribution: | Free* [*Regsitration Required] |

Downloads: | 46590 |

Uploaded by: | ROSALIA |

regression course notes pdf. Here are five things that you will always find in her bag What is the purpose of that It makes - Predictive Modeling Using Logistic. Predictive modeling using logistic regression course notes pdf. Here are five things that you will always find in her bag What is the purpose of that It makes. Download as PDF, TXT or read online from Scribd. Flag for Predictive Modeling Using Logistic Regression Course Notes was developed by William J. E.

What is the cost ratio that optimizes both sensitivity and specificity? One large bin would give a constant logit. The log of the p-value is calculated in order to produce a more visually appealing graph. The output object also has information on the proportion of variation explained by the clusters and the maximum second eigenvalue in a cluster. The logit transformation is used to constrain the posterior probability to be between zero and one. The more the distributions overlap, the weaker the model. The output activation function transforms its inputs so that the predictions are on an appropriate scale.

The model was overly sensitive to peculiarities of the particular training data, in addition to true features of their joint distribution. The more flexible the underlying model and less plentiful the data, the more that overfitting is a problem. When a relatively inflexible model like linear logistic regression is fitted to massive amounts of data, overfitting may not be a problem Hand However, the chance of overfitting is increased by variable selection methods and supervised input preparation such as collapsing levels of nominal variables based on associations with the target.

It is prudent to assume overfitting until proven otherwise. Large differences between the performance on the training and test sets usually indicate overfitting. The model is fit to the remainder training data set and performance is evaluated on the holdout portion test data set. Usually from one-fourth to one-half of the development data is used as a test set Picard and Berk After assessment, it is common practice to refit the final model on the entire undivided data set.

When the holdout data is used for comparing, selecting, and tuning models and the chosen model is assessed on the same data set that was used for comparison, then the optimism principle again applies. In this situation, the holdout sample is more correctly called a validation data set, not a test set. The test set is used for a final assessment of a fully specified classifier Ripley If model tuning and a final assessment are both needed, then the data should be split three ways into training, validation, and test sets.

When data is scarce, it is inefficient to use only a portion for training. Furthermore, when the test set is small, the performance measures may be unreliable because of high variability. For small and moderate data sets, -fold cross-validation Breiman et al. In 5-fold cross-validation, for instance, the data would be split into five equal sets.

The entire modeling process would be redone on each four-fifths of the data using the remaining one-fifth for assessment. The five assessments would then be averaged. In this way, all the data is used for both training and assessment. Another approach that is frugal with the data is to assess the model on the same data set that was used for training but to penalize the assessment for optimism Ripley The appropriate penalty can be determined theoretically or using computationally intensive methods such as the bootstrap.

The model and all the input preparation steps then need to be redone on the training set. The validation data will be used for assessment. Consequently, it needs to be treated as if it were truly new data where the target is unknown.

The results of the analysis on the training data need to be applied to the validation data, not recalculated. Several input-preparation steps can be done before the data is split.

Creating missing indicators should be done on the full development data because the results will not change. The offset variable is also created before the data is split to get the best estimate of the proportion of events. The variable U is created using the RANUNI function, which generates pseudo- random numbers from a uniform distribution on the interval 0,1. Using a particular number, greater than zero, will produce the same split each time the DATA step is run.

If the seed was zero, then the data would be split differently each time the DATA step was run. This example presents a cautious approach and does not involve the validation data at all. Since the slope estimates are not effected by the offset variable, the offset variable is not needed.

This level should be chosen based on the models performance on the validation data set. This reduction is less than the results of the backward elimination method on weighted data. This illustrates that the significance level for SLSTAY should not be chosen arbitrarily, but rather on a trial and error basis based on the models performance on the validation data set.

The offset variable is needed to correctly adjust the intercept. Array X contains the variables with missing values and array MED contains the variables with the medians. The DO loop replaces the missing values with the medians. An allocation rule corresponds to a threshold value cutoff of the posterior probability. For example, all cases with probabilities of default greater than.

For a given cutoff, how well does the classifier perform? The fundamental assessment tool is the confusion matrix. The confusion matrix is a crosstabulation of the actual and predicted classes. It quantifies the confusion of the classifier. The event of interest, whether it is unfavorable like fraud, churn, or default or favorable like response to offer , is often called a positive, although this convention is arbitrary.

Ideally, one would like large values of all these statistics. The context of the problem determines which of these measures is the primary concern.

In contrast, a fraud investigator might be most concerned with sensitivity because it gives the proportion of frauds that would be detected. The ROC curve displays the sensitivity and specificity for the entire range of cutoffs values. As the cutoff decreases, more and more cases are allocated to class 1, hence the sensitivity increases and specificity decreases. As the cutoff increases, more and more cases are allocated to class 0, hence the sensitivity decreases and specificity increases.

Consequently, the ROC curve intersects 0,0 and 1,1. If the posterior probabilities were arbitrarily assigned to the cases, then the ratio of false positives to true positives would be the same as the ratio of the total actual negatives to the total actual positives.

Consequently, the baseline random model is a 45 angle going through the origin. As the ROC curve bows above the diagonal, the predictive power increases. A perfect model would reach the 0,1 point where both sensitivity and specificity equal 1.

The cumulative gains chart displays the positive predicted value and depth for a range of cutoff values. As the cutoff increases the depth decreases. If the posterior probabilities were arbitrarily assigned to the cases, then the gains chart would be a horizontal line at 1.

The gains chart is widely used in database marketing to decide how deep in a database to go with a promotion. The simplest way to construct this curve is to sort and bin the predicted posterior probabilities for example, deciles. The gains chart is easily augmented with revenue and cost information. A plot of sensitivity versus depth is sometimes called a Lorentz curve, concentration curve, or a lift curve although lift value is not explicitly displayed.

This plot and the ROC curve are very similar because depth and 1specificity are monotonically related. If the proper adjustments were made when the model was fitted, then the predicted posterior probabilities are correct. However, the confusion matrices would be incorrect with regard to the population because the event cases are over-represented.

Sensitivity and specificity, however, are not affected by separate sampling because they do not depend on the proportion of each class in the sample. For example, if the sample represented the population, then n1 cases are in class 1. The proportion of those that were allocated to class 1 is Se. Thus, there are n1Se true positives. Convergence was not attained in 0 iterations for the intercept-only model. Results of fitting the intercept-only model are based on the last maximum likelihood iteration.

Validity of the model fit is questionable. The validity of the model fit is questionable. Knowledge of the population priors and sensitivity and specificity is sufficient to fill in the confusion matrices.

Several additional statistics can be calculated in a DATA step: The selected cutoffs occur where values of the estimated posterior probability change, provided the posterior probabilities are more than.

ROC data set and the various plots can be specified using the point-and-click interface. To determine the optimal cutoff, a performance criterion needs to be defined. If the goal were to increase the sensitivity of the classifier, then the optimal classifier would allocate all cases to class 1. If the goal were to increase specificity, then the optimal classifier would be to allocate all cases to class 0.

For realistic data, there is a tradeoff between sensitivity and specificity. Higher cutoffs decrease sensitivity and increase specificity. Lower cutoffs decrease specificity and increase sensitivity. The decision-theoretic approach starts by assigning misclassification costs losses to each type of error false positives and false negatives.

The optimal decision rule minimizes the total expected cost risk. The Bayes rule is the decision rule that minimizes the expected cost. In the two-class situation, the Bayes rule can be determined analytically. If you classify a case into class 1, then the cost is FP cost 1 p where p is the true posterior probability that a case belongs to class 1. Solving for p gives the optimal cutoff probability. Since p must be estimated from the data, the plug-in Bayes rule is used in practice.

Note that the Bayes rule only depends on the ratio of the costs, not on their actual values. If the misclassification costs are equal, then the Bayes rule corresponds to a cutoff of 0.

Hand commented that The use of error rate often suggests insufficiently careful thought about the real objectives. When the target event is rare, the cost of a false negative is usually greater than the cost of a false positive. The cost of not soliciting a responder is greater than the cost of sending a promotion to someone who does not respond.

The cost of accepting an applicant who will default is greater than the cost of rejecting someone who would pay-off the loan. The cost of approving a fraudulent transaction is greater than the cost of denying a legitimate one.

Such considerations dictate cutoffs that are less often much less than. Examining the performance of a classifier over a range of cost ratios can be useful. The central cutoff, 1, tends to maximizes the mean of sensitivity and specificity. Because increasing sensitivity usually corresponds to decreasing specificity, the central cutoff tends to equalize sensitivity and specificity. Statistics that summarize the performance of a classifier across a range of cutoffs can also be useful for assessing global discriminatory power.

One approach is to measure the separation between the predicted posterior probabilities for each class. The more the distributions overlap, the weaker the model. The simplest statistics are based on the difference between the means of the two distributions.

In credit scoring, the divergence statistic is a scaled difference between the means Nelson Hand discusses several summary measures based on the difference between the means. The well-known t-test for comparing two distributions is based on the difference between the means. The t-test has many optimal properties when the two distributions are symmetric with equal variance and have light tails.

However, the distributions of the predicted posterior probabilities are typically asymmetric with unequal variance. Many other two-sample tests have been devised for non-normal distributions Conover The test statistic, D, is the maximum vertical difference between the cumulative distributions.

If D equals zero, the distributions are everywhere identical. The maximum value of the K-S statistic, 1, occurs when the distributions are perfectly separated. Use of the K-S statistic for comparing predictive models is popular in database marketing.

An oversampled validation data set does not affect D because the empirical distribution function is unchanged if each case represents more than one case in the population. In the predictive modeling context, it could be argued that location differences are paramount.

Because of its generality, the K-S test is not particularly powerful at detecting location differences. The most powerful nonparametric two-sample test is the Wilcoxon-Mann-Whitney test. The Wilcoxon version of this popular two-sample test is based on the ranks of the data.

In the predictive modeling context, the predicted posterior probabilities would be ranked from smallest to largest. The test statistic is based on the sum of the ranks in the classes. A perfect ROC curve would be a horizontal line at one that is, sensitivity and specificity would both equal one for all cutoffs. In this case, the c statistic would equal one. The c statistic technically ranges from zero to one, but in practice, it should not get much lower than.

A perfectly random model, where the posterior probabilities were assigned arbitrarily, would give a 45 angle straight ROC curve that intersects the origin; hence, it would give a c statistic of 0. Oversampling does not affect the area under the ROC curve because sensitivity and specificity are unaffected. The area under the ROC curve is also equivalent to the Gini coefficient, which is used to summarize the performance of a Lorentz curve Hand The results of the Wilcoxon test can be used to compute the c statistic.

To correct for the optimistic bias, a common strategy is to holdout a portion of the development data for assessment. Statistics that measure the predictive accuracy of the model include sensitivity and positive predicted value. Graphics such as the ROC curve, the gains chart, and the lift chart can also be used to assess the performance of the model. If the assessment data set was obtained by splitting oversampled data, then the assessment data set needs to be adjusted.

This can be accomplished by using the sensitivity, specificity, and the prior probabilities. In predictive modeling, the ultimate use of logistic regression is to allocate cases to classes. To determine the optimal cutoff probability, the plug-in Bayes rule can be used.

The information you need is the ratio of the costs of false negatives to the cost of false positives. This optimal cutoff will minimize the total expected cost. For example, the cost of not soliciting a responder is greater than the cost of sending a promotion to someone who does not respond. Such considerations dictate cutoffs that are usually much less than.

A popular statistic that summarizes the performance of a model across a range of cutoffs is the Kolmogorov-Smirnov statistic. However, this statistic is not as powerful in detecting location differences as the Wilcoxon-Mann- Whitney test. Thus, the c statistic should be used to assess the performance of a model across a range of cutoffs. Decision trees Breiman et al. Trees recursively partition the input space in order to isolate regions where the class composition is homogenous.

Trees are flexible, interpretable, and scalable. When the target is binary, these plots are not very enlightening. A useful plot to detect nonlinear relationships is a plot of the empirical logits. A simple, scalable, and robust smoothing method is to plot empirical logits for quantiles of the input variables. These logits use a minimax estimate of the proportion of events in each bin Duffy and Santner This eliminates the problem caused by zero counts and reduces variability.

The number of bins determines the amount of smoothing the fewer bins, the more smoothing. One large bin would give a constant logit. For very large data sets and intervally-scaled inputs, bins often works well. If the standard logistic model were true, then the plots should be linear. Sample variability can cause apparent deviations, particularly when the bin size is too small. However, serious nonlinearities, such as nonmonotonicity, are usually easy to detect. The bins will be equal size quantiles except when the number of tied values exceeds the bin size, in which case the bin will be enlarged to contain all the tied values.

The empirical logits are plotted against the mean of the input variable in each bin. This needs to be computed as well.

BINS data set. Hand-Crafted New Input Variables 2. Polynomial Models 3. Flexible Multivariate Function Estimators 4. Do Nothing 1. Skilled and patient data analysts can accommodate nonlinearities in a logistic regression model by transforming or discretizing the input variables.

This can become impractical with high-dimensional data and increases the risk of overfitting. Section 5. Methods such as classification trees, generalized additive models, projection pursuit, multivariate adaptive regression splines, radial basis function networks, and multilayer perceptrons Section 5. Standard linear logistic regression can produce powerful and useful classifiers even when the estimates of the posterior probabilities are poor. Often more flexible approaches do not show enough improvement to warrant the effort.

A linear model first-degree polynomial is planar. A quadratic model second-degree polynomial is a paraboloid or a saddle surface. Squared terms allow the effect of the inputs to be nonmonotonic.

Moreover, the effects of one variable can change linearly across the levels of the other variables. Cubic models third-degree polynomials allow for a minimum and a maximum in each dimension. Moreover, the effect of one variable can change quadraticly across the levels of the other variables. Higher degree polynomials allow even more flexibility. A polynomial model can be specified by including new terms to the model.

The new terms are products of the original variables. A full polynomial model of degree d includes terms for all possible products involving up to d terms. The squared terms allow the effect of the inputs to be nonmonotonic. The cross product terms allow for interactions among the inputs. Cubic terms allow for even more flexibility.

Chief among them is the curse of dimensionality. The number of terms in a d- degree polynomial increases at a rate proportional to the number of dimensions raised to the d power for example, the number of terms in a full quadratic model increases at a quadratic rate with dimension.

For example, the influence that individual cases have on the predicted value at a particular point the equivalent kernel does not necessarily increase the closer they are to that particular point. Higher-degree polynomials are not reliable smoothers. One suggested approach is to include all the significant input variables and use a forward selection to detect any two-factor interactions and squared terms.

This approach will miss the interactions and squared terms with non-significant input variables, but the number of possible two-factor interactions and squared terms is too large for the LOGISTIC procedure to assess.

The c statistic increased from. However, this model should be assessed using the validation data set because the inclusion of many higher order terms may increase the risk of overfitting. The input layer contains input units, one for each input dimension. In a feed- forward neural network, the input layer can be connected to a hidden layer containing hidden units neurons.

The hidden layer may be connected to other hidden layers, which are eventually connected to the output layer. The hidden units transform the incoming inputs with an activation function. The output layer represents the expected target. The output activation function transforms its inputs so that the predictions are on an appropriate scale. A neural network with a skip layer has the input layer connected directly to the output layer. Neural networks are universal approximators; that is, they can theoretically fit any nonlinear function not necessarily in practice.

The price they pay for flexibility is incomprehensibility they are a black box with regard to interpretation. A network diagram is a graphical representation of an underlying mathematical model. Many popular neural networks can be viewed as extensions of ordinary logistic regression models.

A MLP is a feed-forward network where each hidden unit nonlinearly transforms a linear combination of the incoming inputs. The nonlinear transformation activation function is usually sigmoidal, such as the hyperbolic tangent function. When the target is binary, the appropriate output activation function is the logistic function the inverse logit.

A skip layer represents a linear combination of the inputs that are untransformed by hidden units. Consequently, a MLP with a skip layer for a binary target is a standard logistic regression model that includes new variables representing the hidden units. These new variables model the higher-order effects not accounted for by the linear model.

A MLP with no hidden layers is just the standard logistic regression model a brain with no neurons. The parameters can be estimated in all the usual ways, including Bernoulli ML. However, the numerical aspects can be considerably more challenging.

The sigmoidal surfaces are themselves combined with a planar surface skip layer. This combination is transformed by the logistic function to give the estimate of the posterior probability.

The first step is to drastically reduce the dimension using a decision tree. A decision tree is a flexible multivariate model that effectively defies the curse of dimensionality. Retaining only the variables selected by a decision tree is one way to reduce the dimension without ignoring nonlinearities and interactions. Set the model role of INS to target. Right click in the open area where the prior profiles are activated and select Add.

Select Prior vector and change the prior probabilities to represent the true proportions in the population 0. Right click on Prior vector and select Set to use. Close the target profiler and select Yes to save the changes.

The Data Replacement node does imputation for the missing values. Select the Imputation Methods tab and choose median as the method. A variable selection tree can be fit in the Variable Selection node using the chi-squared method under the Target Associations tab. Run the flow and note that 10 inputs were accepted. In the Neural Network node, select the Basic tab. For the Network architecture, select Multilayer Perceptron.

Set the number of hidden neurons to 3 and Yes to Direct connections. Also select 3 for Preliminary runs Default as the Training technique 2 hours for the Runtime limit.

Run the flow from the Assessment node. After nonlinear relationships are identified, you can transform the input variables, create dummy variables, or fit a polynomial model. The problem with hand-crafted input variables is the time and effort it takes. The problem with higher-degree polynomial models is their non-local behavior. Another solution to dealing with nonlinearities and interactions is to fit a neural network in Enterprise Miner. Neural networks are a class of flexible nonlinear regression models that can theoretically fit any nonlinear function.

However, the price neural networks pay for flexibility is incomprehensibility. Neural networks are useful, though, if the only goal is prediction, and understanding the model is of secondary importance.

Score new data using the final parameter estimates from a weighted logistic regression. Create a weight variable that adjusts for oversampling. The proportion of events in the population is. Calculate the probabilities and print out the first 25 observations. Name the variable with the logit values INS1. Print out the first 25 observations. Compare this scoring method with the method in step d. Replace missing values using group-median imputation.

The VAR statement lists the variables to be grouped. The objective is to cluster the variables and choose a representative one in each cluster. Alternatively, use the results of 2. Use the Output Delivery System to print out the last iteration of the clustering algorithm.

Use the TREE procedure to produce a dendrogram of the variable clustering. Which criteria created the fewest clusters? Which criteria created the most clusters? Submit the following code: Compare variable selection methods. Use all of the numeric variables and the RES categorical variable. Use all of the numeric variables and the dummy variables for RES. Use the output delivery system to determine which model is the best according to the SBC criterion?

Fit a stepwise linear regression model using the REG procedure: Many statistical methods do not scale well to massive data sets because most methods were developed for small data sets generated from designed experiments. F, D, C, B, A , , , , M, F When there are a large number of input variables, there are usually a variety of measurement scales represented.

The input variable may be intervally scaled amounts , binary, nominally scaled names , ordinally scaled grades , or counts. Nominal input variables with a large number of levels such as ZIP code are commonplace and present complications for regression analysis. Predictive modelers consider large numbers hundreds of input variables.

The number of variables often has a greater effect on computational performance than the number of cases. High dimensionality limits the ability to explore and model the relationships among the variables. This is known as the curse of dimensionality, which was distilled by Breiman et al. The remedy is dimension reduction; ignore irrelevant and redundant dimensions without inadvertently ignoring important ones.

Usually more data leads to better models, but having an ever-larger number of nonevent cases has rapidly diminishing return and can even have detrimental effects. With rare events, the effective sample size for building a reliable prediction model is closer to 3 the number of event cases than to the nominal size of the data set Harrell A seemingly massive data set might have the predictive potential of one that is much smaller.

One widespread strategy for predicting rare events is to build a model on a sample that disproportionally over-represents the event cases for example, an equal number of events and nonevents. Such an analysis introduces biases that need to be corrected so that the results are applicable to the population. Each important dimension might affect the target in complicated ways.

Moreover, the effect of each input variable might depend on the values of other input variables. The curse of dimensionality makes this difficult to untangle. Many classical modeling methods including standard logistic regression were developed for inputs with effects that have a constant rate of change and do not depend on any other inputs. These might be different types of models. These might be different complexities of models of the same type. A common pitfall is to overfit the data; that is, to use too complex a model.

An overly complex model might be too sensitive to peculiarities in the sample data set and not generalize well to new data. Using too simple a model, however, can lead to underfitting, where true features are disregarded. Some of the business applications include target marketing, attrition prediction, credit scoring, and fraud detection. One challenge in building a predictive model is that the data usually was not collected for purposes of data analysis. Therefore, it is usually massive, dynamic, and dirty.

For example, the data usually has a large number of input variables. This limits the ability to explore and model the relationships among the variables. Thus, detecting interactions and nonlinearities becomes a cumbersome problem. When the target is rare, a widespread strategy is to build a model on a sample that disproportionally over-represents the events.

The results will be biased, but they can be easily corrected to represent the population. A common pitfall in building a predictive model is to overfit the data.

An overfitted model will be too sensitive to the nuances in the data and will not generalize well to new data. However, a model that underfits the data will systematically miss the true features in the data.

Chapter 2 Fitting the Model 2. Each case belongs to one of two classes. A binary indicator variable represents the class label for each case. The parameters, 0,,k, are unknown constants that must be estimated from the data.

Probability must be between zero and one. The logit transformation which is the log of the odds is a device for constraining the posterior probability to be between zero and one. Therefore, modeling the logit with a linear combination gives estimated probabilities that are constrained to be between zero and one. The link function depends on the scale of the target. On the probability scale it becomes a sigmoidal surface. Different parameter values give different surfaces with different slopes and different orientations.

The nonlinearity is solely due to the constrained scale of the target. The coefficients are the slopes. Exponentiating each parameter estimate gives the odds ratios, which compares the odds of the event in one group to the odds of the event in another group. The odds ratio represents the multiplicative effect of each input variable. Moreover, the effect of each input variable does not depend on the values of the other inputs additivity. However, this simple interpretation depends on the model being correctly specified.

In predictive modeling, you should not presume that the true posterior probability has such a simple form. Think of the model as an approximating hyper plane. Consequently, you can determine the extent that the inputs are important to the approximating plane. This is more correctly termed logistic discrimination McClachlan An allocation rule is merely an assignment of a cutoff probability where cases above the cutoff are allocated to class 1 and cases below the cutoff are allocated to class 0.

The standard logistic discrimination model separates the classes by a linear surface hyper plane. The decision boundary is always linear. Determining the best cutoff is a fundamental concern in logistic discrimination. The method of maximum likelihood ML is usually used to estimate the unknown parameters in the logistic regression model. The likelihood function is the joint probability density function of the data treated as a function of the parameters.

The maximum likelihood estimates are the values of the parameters that maximize the probability of obtaining the sample data. In ML estimation the combination of parameter values that maximize the likelihood or log-likelihood are pursued. There is, in general, no closed form analytical solution for the ML estimates as there is for linear regression on a normally distributed response. They must be determined using an iterative optimization algorithm.

Consequently, logistic regression is considerably more computationally expensive than linear regression. Software for ML estimation of the logistic model is commonplace. The seven input variables included in the model were selected arbitrarily. In this example, the parameterization method is reference cell coding and the reference level is S. The STB option displays the standardized estimates for the parameters for the continuous input variables.

The UNITS statement enables you to obtain an odds ratio estimate for a specified change in an input variable. The Response Profile table shows the target variable values listed according to their ordered values. Four iterations were needed to fit the logistic model.

At each iteration, the 2 log likelihood decreased and the parameter estimates changed. These are goodness-of-fit measures you can use to compare one model to another. RES U 1 The parameter estimates measure the rate of change in the logit log odds corresponding to a one-unit change in input variable, adjusted for the effects of the other inputs. The parameter estimates are difficult to compare because they depend on the units in which the variables are measured.

The standardized estimates convert them to standard deviation units. The absolute value of the standardized estimates can be used to give an approximate ranking of the relative importance of the input variables on the fitted logistic model. The variable RES has no standardized estimate because it is a class variable.

For example, the odds of acquiring an insurance product is. For all pairs of observations with different values of the target variable, a pair is concordant if the observation with the outcome has a higher predicted outcome probability based on the model than the observation without the outcome. A pair is discordant if the observation with the outcome has a lower predicted outcome probability than the observation without the outcome. The four rank correlation indexes Somers D, Gamma, Tau-a, and c are computed from the numbers of concordant and discordant pairs of observations.

In general, a model with higher values for these indexes the maximum value is 1 has better predictive ability than a model with lower values for these indexes. Consequently, the odds of acquiring the insurance product increases 7. Predictions can be made by simply plugging in the new values of the inputs. The estimates are named corresponding to their input variable. The data set to be scored typically would not have a target variable.

The logistic function inverse of the logit needs to be applied to compute the posterior probability. Joint Separate In joint mixture sampling, the input-target pairs are randomly selected from their joint distribution. In separate sampling, the inputs are randomly selected from their distributions within each target class.

Separate sampling is standard practice in supervised classification. When the target event is rare, it is common to oversample the rare event, that is, take a disproportionately large number of event cases. Oversampling rare events is generally believed to lead to better predictions Scott and Wild Separate sampling is also known as case-control sampling choice-based sampling stratified sampling on the target, not necessarily taken with proportional allocation biased sampling y-conditional sampling outcome-dependent sampling oversampling.

The priors, 0 and 1, represent the population proportions of class 0 and 1, respectively. The proportions of the target classes in the sample are denoted 0 and 1. In separate sampling non-proportional 0 0 and 1 1. The adjustments for oversampling require the priors be known a priori. This assumption is appropriate for joint sampling but not for separate sampling. However, the effects of violating this assumption can be easily corrected. In logistic regression, only the estimate of the intercept, 0, is affected by using Bernoulli ML on data from a separate sampling design Prentice and Pike Consequently, the effect of oversampling is to shift the logits by a constant amount the offset.

This vertical shift of the logit affects the posterior probability in a corresponding fashion. Alternatively, the offset could be applied after the standard model is fitted. Both approaches give identical results. For both types of adjustments, the population priors, 1 and 1, need to be known a priori while the sample priors, 0 and 1, can be estimated from the data.

Since only the intercept is affected, are the adjustments necessary? No, if the goal were only to understand the effects of the input variables on the target or if only the rank order of the predicted values scores is required. The proportion of the target event in the population was. The probabilities for class 0 in the population and sample are computed as one minus the probabilities in class 1. The only difference between the parameter estimates with and without the offset variable is the intercept term.

The list of parameter estimates contains a new entry for the variable OFF, which has a fixed value of one. The probabilities computed from this model have been adjusted down because the population probability is much lower than the sample probability.

The SCORE procedure uses the final parameter estimates from the logistic model with the offset variable to score new data. A more efficient way of adjusting the posterior probabilities for the offset is to fit the model without the offset and adjust the fitted posterior probabilities afterwards in a DATA step. The two approaches are statistically equivalent. Sampling weights adjust the data so that it better represents the true population.

When a rare target event has been oversampled, class 0 is under-represented in the sample. Consequently, a class-0 case should actually count more in the analysis than a class-1 case.

The predicted values will be properly corrected by using weights that are inversely proportional to selection probabilities for each class, the number of cases in the sample divided by the number of cases in the population.

The classes now are in the same proportion in the adjusted sample as they are in the population. The normalization causes less distortion in standard errors and p-values. While statistical inference is not the goal of the analysis, p-values are used as tuning parameters in variable selection algorithms.

The offset method and the weighted method are not statistically equivalent. The parameter estimates are not exactly the same, but they have the same large-sample statistical properties. When the linear-logistic model is correctly specified, the offset method un-weighted analysis is considered superior. However, when the logistic model is merely an approximation to some nonlinear model, weighted analysis has advantages Scott and Wild The weights are.

They could have been assigned manually without having to reference macro variables. Consequently, this syntax is a more compact way of expressing a conditional. The figures in the column represent the sample sizes adjusted to the population proportions.

Note that the Sum of the Weights equals the total sample size. The weighted analysis also changed the goodness-of-fit measures, the standardized estimates, and the rank correlation indexes.

The logit transformation is used to constrain the posterior probability to be between zero and one. The parameter estimates are estimated using the method of maximum likelihood. This method finds the parameter estimates that are most likely to occur given the data. When you exponentiate the slope estimates, you obtain the odds ratio which compares the odds of the event in one group to the odds of the event in another group. A DATA step can then be used to take the inverse of the logit to compute the posterior probability.

When you oversample rare events, you can use the OFFSET option to adjust the model so that the posterior probabilities reflect the population. The two methods are not statistically equivalent. When the linear-logistic model is correctly specified, the offset method is considered superior. However, when the logistic model is merely an approximation to some nonlinear model, the weighted analysis has advantages.

A value is missing completely at random MCAR if the probability that it is missing is independent of the data. MCAR is a particularly easy mechanism to manage but is unrealistic in most predictive modeling applications.

The probability that a value is missing might depend on the unobserved value credit applicants with fewer years at their current job might be less inclined to provide this information. The probability that a value is missing might depend on observed values of other input variables customers with longer tenures might be less likely to have certain historic transactional data. Missingness might depend on a combination of values of correlated inputs.

An even more pathological missing-value mechanism occurs when the probability that a value is missing depends on values of unobserved lurking predictors transient customers might have missing values on a number of variables. A fundamental concern for predictive modeling is that the missingness is related to the target maybe the more transient customers are the best prospects for a new offer.

In complete-case analysis, only those cases without any missing values are used in the analysis. Complete-case analysis has some moderately attractive theoretical properties even when the missingness depends on observed values of other inputs Donner ; Jones However, complete-case analysis has serious practical shortcomings with regards to predictive modeling. Even a smattering of missing values can cause an enormous loss of data in high dimensions. New Case: Another practical consideration of any treatment of missing values is scorability the practicality of method when it is deployed.

The purpose of predictive modeling is scoring new cases. How would a model built on the complete cases score a new case if it had a missing value? To decline to score new incomplete cases would only be practical if there were a very small number of missings. Missing Value Imputation 6. Imputation means filling in the missing values with some reasonable value. Many methods have been developed for imputing missing values Little The principal consideration for most methods is getting valid statistical inference on the imputed data, not generalization.

Often, subject-matter knowledge can be used to impute missing data. For example, the missing values might be miscoded zeros. Create missing indicators. Use median imputation fill the missing value of xj with the median of the complete cases for that variable. Create a new level representing missing unknown for categorical inputs.

This strategy is somewhat unsophisticated but satisfies two of the most important considerations in predictive modeling: A new case is easily scored; first replace the missing values with the medians from the development data and then apply the prediction model.

There is statistical literature concerning different missing value imputation methods, including discussions of the demerits of mean and median imputation and missing indicators Donner ; Jones Unfortunately, most of the advice is based on considerations that are peripheral to predictive modeling.

There is very little advice when the functional form of the model is not assumed to be correct, when the goal is to get good predictions that can be practically applied to new cases, when p-values and hypothesis tests are largely irrelevant, and when the missingness may be highly pathological depending on lurking predictors.

Two arrays are created, one called MI, which contains the missing value indicator variables, and one called X, which contains the input variables. It is critical that the order of the variables in the array MI matches the order of the variables in array X. Defining the dimension with an asterisk causes the array elements to be automatically counted. Thus, the DO loop will execute 15 times in this example. The assignment statement inside the DO loop causes the entries of MI to be 1 if the corresponding entry in X is missing, and zero otherwise.

Mean-imputation uses the unconditional mean of the variable. An attractive extension would be to use the mean conditional on the other inputs. This is referred to as regression imputation. Regression imputation would usually give better estimates of the missing values. Specifically, k linear regression models could be built one for each input variable using the other inputs as predictors.

This would presumably give better imputations and be able to accommodate missingness that depends on the values of the other inputs. An added complication is that the other inputs may have missing values. Consequently, the k imputation regressions also need to accommodate missing values. Cluster-mean imputation is a somewhat more practical alternative: This method can accommodate missingness that depends on the other input variables. This method is implemented in the Enterprise Miner.

A simple but less effective alternative is to define a priori segments for example, high, middle, low, and unknown income , and then do mean or median imputation within each segment Exercise 2. You can specify the type of parameterization to use, such as effect coding and reference coding, and the reference level.

The choice of the reference level is immaterial in predictive modeling because different reference levels give the same predictions. Smarter Variables 75 75 HomeVal Local Urbanicity Expanding categorical inputs into dummy variables can greatly increase the dimension of the input space.

A smarter method is to use subject-matter information to create new inputs that represent relevant sources of variation. A categorical input might be best thought of as a link to other data sets. For example, geographic areas are often mapped to several relevant demographic variables. The coefficient of a dummy variable represents the difference in the logits between that level and the reference level.

When quasi-complete separation occurs, one of the logits will be infinite, the likelihood does not have a maximum in at least one dimension, so the ML estimate of that coefficient will be infinite.

If the zero- cell category is the reference level, then all the coefficients for the dummy variables will be infinite. Quasi-complete separation complicates model interpretation. It can also affect the convergence of the estimation algorithm. Furthermore, it might lead to incorrect decisions regarding variable selection. The most common cause of quasi-complete separation in predictive modeling is categorical inputs with rare categories. The best remedy for sparseness is collapsing levels of the categorical variable.

This is not always practical in predictive modeling. A simple data-driven method for collapsing levels of contingency tables was developed by Greenacre , The levels rows are hierarchically clustered based on the reduction in the chi-squared test of association between the categorical variable and the target.

At each step, the two levels that give the least reduction in the chi-squared statistic are merged. This method will quickly throw rare categories in with other categories that have similar marginal response rates. While this method is simple and effective, there is a potential loss of information because only univariate associations are considered. Instead of writing to the listing file directly.

SAS procedures can now create an output object for each piece of output that is displayed. You can then take the data component of the output object and convert it to a SAS data set.

This means that every number in every table of every procedure can be accessed via a data set. The first step is to create a data set that contains the proportion of the target event INS and number of cases in each level.

The ID statement specifies a variable that identifies observations in 3. EigenvalueTable Label: Eigenvalues of the Covariance Matrix Template: EigenvalueTable Path: ClusterHistory Label: Cluster History Template: ClusterHistory Path: At each step, the levels that give the smallest decrease in chi- squared are merged. The rows in the summary represent the results after the listed clusters were merged.

The number of clusters is reduced from 18 to 1. When previously collapsed levels are merged, they are denoted using the CL as the prefix and the number of resulting clusters as the suffix.

For example, at the sixth step CL15 represents B1 and B17 that were merged at the fourth step creating 15 clusters. To calculate the optimum number of clusters, the chi-square statistic and the associated p-value needs to be computed for each collapsed contingency table.

This information can be obtained by multiplying the chi-square statistic from the 19x2 contingency table with the proportion of chi-squared remaining after the levels are collapsed. The FREQ procedure is used to compute the chi-square statistic for the 19x2 contingency table. The function LOGSDF computes the log of the probability that an observation from a specified distribution is greater than or equal to a specified value. The arguments for the function are the specified distribution in quotes, the numeric random variable, and the degrees of freedom.

The log of the p-value is calculated in order to produce a more visually appealing graph.

The VPOS option specifies the number of print positions on the vertical axis. In this example, the proportion of the chi-squared statistic is the vertical axis. The horizontal axis is also roughly ordered by the mean proportion of events in each cluster. Choosing a cluster near the center of the tree for example, B11, B18, B7, and B2 as a reference group may lead to better models if the variable selection method you choose incrementally adds variables to the model Cohen Four dummy variables are created with the second cluster designated as the reference level.

Note that the dummy variables are numbered sequentially. Redundancy is an unsupervised concept it does not involve the target variable. In contrast, irrelevant inputs are not substantially related to the target. In high dimensional data sets, identifying irrelevant inputs is more difficult than identifying redundant inputs. A good strategy is to first reduce redundancy and then tackle irrelevancy in a lower dimension space. A set of k variables can be transformed into a set of k principal components.

The principal components PCs are linear combinations of the k variables constructed to be jointly uncorrelated and to explain the total variability among the original standardized variables. The correlation matrix is the covariance matrix of the standardized variables. Since each standardized variable has a variance equal to one, the total variability among the standardized variables is just the number of variables. The principal components are produced by an eigen-decomposition of the correlation matrix.

The eigenvalues are the variances of the PCs; they sum to the number of variables. The first PC corresponds to the first eigenvalue and explains the largest proportion of the variability. Each PC explains a decreasing amount of the total variability.

In practice, dimension reduction is achieved by retaining only the first few PCs provided they explain a sufficient proportion of the total variation. The reduced set of PCs might then be used in place of the original variables in the analysis. The result of clustering k variables is a set of k cluster components. Like PCs, the cluster components are linear combinations of the original variables.

Unlike the PCs, the cluster components are not uncorrelated and do not explain all the variability in the original variables. The cluster components scores are standardized to have unit variance. The three cluster components in the above example correspond to eigenvalues of 1. Fast backward elimination had the best overall performance, a linear increase in time as the number of inputs increased. Note that ordinary backwards without the FAST option would have been slower than stepwise.

The FAST option uses the full model fit to approximate the remaining slope estimates for each subsequent elimination of a variable from the model. The SLSTAY option specifies the significance level for a variable to stay in the model in a backward elimination step. The significance level was chosen arbitrarily for illustrative purposes. Since the best subsets method does not support class variables, dummy variables for RES are created in a data step.

The SBC is essentially the 2 log likelihood plus a penalty term that increases as the model gets bigger. Smaller values of SBC are preferable.

The score test statistic is asymptotically equivalent to the likelihood ratio statistic. The 2 log likelihood is a constant minus the likelihood ratio statistic. Therefore, the output delivery system is used to create an output data set with the score statistic and the number of variables.

The first DATA step selects one observation the number of observations and creates a macro variable OBS that contains the number of observations. This is a different model than the one selected in the backward method. First, missing values need to be replaced with reasonable values. Missing indicator variables are also needed to accommodate whether the missingness is related to the target.

If there are nominal input variables with numerous levels, the levels should be collapsed to reduce the likelihood of quasi-complete separation and to reduce the redundancy among the levels.

Furthermore, if there are numerous input variables, variable clustering should be performed to reduce the redundancy among the variables. To assist in identifying nonlinear associations, the Hoeffdings D statistic can be used. A variable with a low rank in the Spearman correlation statistic but with a high rank in the Hoeffdings D statistic may indicate that the association with the target is nonlinear.

For example, the above classifier was fit or more properly overfit to a 10 case data set. This is called overfitting. The model was overly sensitive to peculiarities of the particular training data, in addition to true features of their joint distribution. The more flexible the underlying model and less plentiful the data, the more that overfitting is a problem. When a relatively inflexible model like linear logistic regression is fitted to massive amounts of data, overfitting may not be a problem Hand However, the chance of overfitting is increased by variable selection methods and supervised input preparation such as collapsing levels of nominal variables based on associations with the target.

It is prudent to assume overfitting until proven otherwise. Large differences between the performance on the training and test sets usually indicate overfitting. The model is fit to the remainder training data set and performance is evaluated on the holdout portion test data set.

Usually from one-fourth to one-half of the development data is used as a test set Picard and Berk After assessment, it is common practice to refit the final model on the entire undivided data set.

When the holdout data is used for comparing, selecting, and tuning models and the chosen model is assessed on the same data set that was used for comparison, then the optimism principle again applies. In this situation, the holdout sample is more correctly called a validation data set, not a test set.

The test set is used for a final assessment of a fully specified classifier Ripley If model tuning and a final assessment are both needed, then the data should be split three ways into training, validation, and test sets. When data is scarce, it is inefficient to use only a portion for training. Furthermore, when the test set is small, the performance measures may be unreliable because of high variability.

For small and moderate data sets, -fold cross-validation Breiman et al. In 5-fold cross-validation, for instance, the data would be split into five equal sets. The entire modeling process would be redone on each four-fifths of the data using the remaining one-fifth for assessment. The five assessments would then be averaged. In this way, all the data is used for both training and assessment.

Another approach that is frugal with the data is to assess the model on the same data set that was used for training but to penalize the assessment for optimism Ripley The appropriate penalty can be determined theoretically or using computationally intensive methods such as the bootstrap.

The model and all the input preparation steps then need to be redone on the training set. The validation data will be used for assessment. Consequently, it needs to be treated as if it were truly new data where the target is unknown.

The results of the analysis on the training data need to be applied to the validation data, not recalculated. Several input-preparation steps can be done before the data is split. Creating missing indicators should be done on the full development data because the results will not change.

The offset variable is also created before the data is split to get the best estimate of the proportion of events. The variable U is created using the RANUNI function, which generates pseudo- random numbers from a uniform distribution on the interval 0,1. Using a particular number, greater than zero, will produce the same split each time the DATA step is run. If the seed was zero, then the data would be split differently each time the DATA step was run.

This example presents a cautious approach and does not involve the validation data at all. Since the slope estimates are not effected by the offset variable, the offset variable is not needed. This level should be chosen based on the models performance on the validation data set. This reduction is less than the results of the backward elimination method on weighted data.

This illustrates that the significance level for SLSTAY should not be chosen arbitrarily, but rather on a trial and error basis based on the models performance on the validation data set.

The offset variable is needed to correctly adjust the intercept. Array X contains the variables with missing values and array MED contains the variables with the medians. The DO loop replaces the missing values with the medians. An allocation rule corresponds to a threshold value cutoff of the posterior probability. For example, all cases with probabilities of default greater than.

For a given cutoff, how well does the classifier perform? The fundamental assessment tool is the confusion matrix. The confusion matrix is a crosstabulation of the actual and predicted classes. It quantifies the confusion of the classifier.

The event of interest, whether it is unfavorable like fraud, churn, or default or favorable like response to offer , is often called a positive, although this convention is arbitrary. Ideally, one would like large values of all these statistics. The context of the problem determines which of these measures is the primary concern.

In contrast, a fraud investigator might be most concerned with sensitivity because it gives the proportion of frauds that would be detected. The ROC curve displays the sensitivity and specificity for the entire range of cutoffs values. As the cutoff decreases, more and more cases are allocated to class 1, hence the sensitivity increases and specificity decreases. As the cutoff increases, more and more cases are allocated to class 0, hence the sensitivity decreases and specificity increases.

Consequently, the ROC curve intersects 0,0 and 1,1. If the posterior probabilities were arbitrarily assigned to the cases, then the ratio of false positives to true positives would be the same as the ratio of the total actual negatives to the total actual positives. Consequently, the baseline random model is a 45 angle going through the origin. As the ROC curve bows above the diagonal, the predictive power increases.

A perfect model would reach the 0,1 point where both sensitivity and specificity equal 1. The cumulative gains chart displays the positive predicted value and depth for a range of cutoff values. As the cutoff increases the depth decreases. If the posterior probabilities were arbitrarily assigned to the cases, then the gains chart would be a horizontal line at 1.

The gains chart is widely used in database marketing to decide how deep in a database to go with a promotion. The simplest way to construct this curve is to sort and bin the predicted posterior probabilities for example, deciles. The gains chart is easily augmented with revenue and cost information. A plot of sensitivity versus depth is sometimes called a Lorentz curve, concentration curve, or a lift curve although lift value is not explicitly displayed.

This plot and the ROC curve are very similar because depth and 1specificity are monotonically related. If the proper adjustments were made when the model was fitted, then the predicted posterior probabilities are correct. However, the confusion matrices would be incorrect with regard to the population because the event cases are over-represented.

Sensitivity and specificity, however, are not affected by separate sampling because they do not depend on the proportion of each class in the sample.

For example, if the sample represented the population, then n1 cases are in class 1. The proportion of those that were allocated to class 1 is Se. Thus, there are n1Se true positives. Convergence was not attained in 0 iterations for the intercept-only model.

Results of fitting the intercept-only model are based on the last maximum likelihood iteration. Validity of the model fit is questionable. The validity of the model fit is questionable. Knowledge of the population priors and sensitivity and specificity is sufficient to fill in the confusion matrices.

Several additional statistics can be calculated in a DATA step: The selected cutoffs occur where values of the estimated posterior probability change, provided the posterior probabilities are more than. ROC data set and the various plots can be specified using the point-and-click interface. To determine the optimal cutoff, a performance criterion needs to be defined.

If the goal were to increase the sensitivity of the classifier, then the optimal classifier would allocate all cases to class 1. If the goal were to increase specificity, then the optimal classifier would be to allocate all cases to class 0.

For realistic data, there is a tradeoff between sensitivity and specificity. Higher cutoffs decrease sensitivity and increase specificity. Lower cutoffs decrease specificity and increase sensitivity. The decision-theoretic approach starts by assigning misclassification costs losses to each type of error false positives and false negatives. The optimal decision rule minimizes the total expected cost risk. The Bayes rule is the decision rule that minimizes the expected cost.

In the two-class situation, the Bayes rule can be determined analytically. If you classify a case into class 1, then the cost is FP cost 1 p where p is the true posterior probability that a case belongs to class 1.

Solving for p gives the optimal cutoff probability. Since p must be estimated from the data, the plug-in Bayes rule is used in practice. Note that the Bayes rule only depends on the ratio of the costs, not on their actual values. If the misclassification costs are equal, then the Bayes rule corresponds to a cutoff of 0. Hand commented that The use of error rate often suggests insufficiently careful thought about the real objectives. When the target event is rare, the cost of a false negative is usually greater than the cost of a false positive.

The cost of not soliciting a responder is greater than the cost of sending a promotion to someone who does not respond. The cost of accepting an applicant who will default is greater than the cost of rejecting someone who would pay-off the loan.

The cost of approving a fraudulent transaction is greater than the cost of denying a legitimate one. Such considerations dictate cutoffs that are less often much less than. Examining the performance of a classifier over a range of cost ratios can be useful.

The central cutoff, 1, tends to maximizes the mean of sensitivity and specificity. Because increasing sensitivity usually corresponds to decreasing specificity, the central cutoff tends to equalize sensitivity and specificity. Statistics that summarize the performance of a classifier across a range of cutoffs can also be useful for assessing global discriminatory power. One approach is to measure the separation between the predicted posterior probabilities for each class.

The more the distributions overlap, the weaker the model. The simplest statistics are based on the difference between the means of the two distributions.

In credit scoring, the divergence statistic is a scaled difference between the means Nelson Hand discusses several summary measures based on the difference between the means. The well-known t-test for comparing two distributions is based on the difference between the means. The t-test has many optimal properties when the two distributions are symmetric with equal variance and have light tails. However, the distributions of the predicted posterior probabilities are typically asymmetric with unequal variance.

Many other two-sample tests have been devised for non-normal distributions Conover The test statistic, D, is the maximum vertical difference between the cumulative distributions. If D equals zero, the distributions are everywhere identical. The maximum value of the K-S statistic, 1, occurs when the distributions are perfectly separated.

Use of the K-S statistic for comparing predictive models is popular in database marketing. An oversampled validation data set does not affect D because the empirical distribution function is unchanged if each case represents more than one case in the population.

In the predictive modeling context, it could be argued that location differences are paramount. Because of its generality, the K-S test is not particularly powerful at detecting location differences. The most powerful nonparametric two-sample test is the Wilcoxon-Mann-Whitney test.

The Wilcoxon version of this popular two-sample test is based on the ranks of the data. In the predictive modeling context, the predicted posterior probabilities would be ranked from smallest to largest.

The test statistic is based on the sum of the ranks in the classes. A perfect ROC curve would be a horizontal line at one that is, sensitivity and specificity would both equal one for all cutoffs. In this case, the c statistic would equal one. The c statistic technically ranges from zero to one, but in practice, it should not get much lower than. A perfectly random model, where the posterior probabilities were assigned arbitrarily, would give a 45 angle straight ROC curve that intersects the origin; hence, it would give a c statistic of 0.

Oversampling does not affect the area under the ROC curve because sensitivity and specificity are unaffected. The area under the ROC curve is also equivalent to the Gini coefficient, which is used to summarize the performance of a Lorentz curve Hand The results of the Wilcoxon test can be used to compute the c statistic. To correct for the optimistic bias, a common strategy is to holdout a portion of the development data for assessment. Statistics that measure the predictive accuracy of the model include sensitivity and positive predicted value.

Graphics such as the ROC curve, the gains chart, and the lift chart can also be used to assess the performance of the model. If the assessment data set was obtained by splitting oversampled data, then the assessment data set needs to be adjusted. This can be accomplished by using the sensitivity, specificity, and the prior probabilities.

In predictive modeling, the ultimate use of logistic regression is to allocate cases to classes. To determine the optimal cutoff probability, the plug-in Bayes rule can be used. The information you need is the ratio of the costs of false negatives to the cost of false positives. This optimal cutoff will minimize the total expected cost. For example, the cost of not soliciting a responder is greater than the cost of sending a promotion to someone who does not respond. Such considerations dictate cutoffs that are usually much less than.

A popular statistic that summarizes the performance of a model across a range of cutoffs is the Kolmogorov-Smirnov statistic. However, this statistic is not as powerful in detecting location differences as the Wilcoxon-Mann- Whitney test. Thus, the c statistic should be used to assess the performance of a model across a range of cutoffs. Decision trees Breiman et al.

Trees recursively partition the input space in order to isolate regions where the class composition is homogenous. Trees are flexible, interpretable, and scalable. When the target is binary, these plots are not very enlightening.

A useful plot to detect nonlinear relationships is a plot of the empirical logits. A simple, scalable, and robust smoothing method is to plot empirical logits for quantiles of the input variables. These logits use a minimax estimate of the proportion of events in each bin Duffy and Santner This eliminates the problem caused by zero counts and reduces variability.

The number of bins determines the amount of smoothing the fewer bins, the more smoothing. One large bin would give a constant logit. For very large data sets and intervally-scaled inputs, bins often works well.

If the standard logistic model were true, then the plots should be linear. Sample variability can cause apparent deviations, particularly when the bin size is too small. However, serious nonlinearities, such as nonmonotonicity, are usually easy to detect. The bins will be equal size quantiles except when the number of tied values exceeds the bin size, in which case the bin will be enlarged to contain all the tied values.

The empirical logits are plotted against the mean of the input variable in each bin. This needs to be computed as well. BINS data set. Hand-Crafted New Input Variables 2. Polynomial Models 3. Flexible Multivariate Function Estimators 4. Do Nothing 1. Skilled and patient data analysts can accommodate nonlinearities in a logistic regression model by transforming or discretizing the input variables.

This can become impractical with high-dimensional data and increases the risk of overfitting. Section 5. Methods such as classification trees, generalized additive models, projection pursuit, multivariate adaptive regression splines, radial basis function networks, and multilayer perceptrons Section 5.

Standard linear logistic regression can produce powerful and useful classifiers even when the estimates of the posterior probabilities are poor.

Often more flexible approaches do not show enough improvement to warrant the effort. A linear model first-degree polynomial is planar. A quadratic model second-degree polynomial is a paraboloid or a saddle surface.

Squared terms allow the effect of the inputs to be nonmonotonic. Moreover, the effects of one variable can change linearly across the levels of the other variables. Cubic models third-degree polynomials allow for a minimum and a maximum in each dimension.

Moreover, the effect of one variable can change quadraticly across the levels of the other variables. Higher degree polynomials allow even more flexibility. A polynomial model can be specified by including new terms to the model. The new terms are products of the original variables. A full polynomial model of degree d includes terms for all possible products involving up to d terms.

The squared terms allow the effect of the inputs to be nonmonotonic. The cross product terms allow for interactions among the inputs.

Cubic terms allow for even more flexibility. Chief among them is the curse of dimensionality. The number of terms in a d- degree polynomial increases at a rate proportional to the number of dimensions raised to the d power for example, the number of terms in a full quadratic model increases at a quadratic rate with dimension. For example, the influence that individual cases have on the predicted value at a particular point the equivalent kernel does not necessarily increase the closer they are to that particular point.

Higher-degree polynomials are not reliable smoothers. One suggested approach is to include all the significant input variables and use a forward selection to detect any two-factor interactions and squared terms. This approach will miss the interactions and squared terms with non-significant input variables, but the number of possible two-factor interactions and squared terms is too large for the LOGISTIC procedure to assess.

The c statistic increased from. However, this model should be assessed using the validation data set because the inclusion of many higher order terms may increase the risk of overfitting. The input layer contains input units, one for each input dimension. In a feed- forward neural network, the input layer can be connected to a hidden layer containing hidden units neurons.

The hidden layer may be connected to other hidden layers, which are eventually connected to the output layer. The hidden units transform the incoming inputs with an activation function. The output layer represents the expected target. The output activation function transforms its inputs so that the predictions are on an appropriate scale. A neural network with a skip layer has the input layer connected directly to the output layer.

Neural networks are universal approximators; that is, they can theoretically fit any nonlinear function not necessarily in practice. The price they pay for flexibility is incomprehensibility they are a black box with regard to interpretation. A network diagram is a graphical representation of an underlying mathematical model. Many popular neural networks can be viewed as extensions of ordinary logistic regression models.

A MLP is a feed-forward network where each hidden unit nonlinearly transforms a linear combination of the incoming inputs. The nonlinear transformation activation function is usually sigmoidal, such as the hyperbolic tangent function.

When the target is binary, the appropriate output activation function is the logistic function the inverse logit. A skip layer represents a linear combination of the inputs that are untransformed by hidden units. Consequently, a MLP with a skip layer for a binary target is a standard logistic regression model that includes new variables representing the hidden units. These new variables model the higher-order effects not accounted for by the linear model.

A MLP with no hidden layers is just the standard logistic regression model a brain with no neurons. The parameters can be estimated in all the usual ways, including Bernoulli ML. However, the numerical aspects can be considerably more challenging. The sigmoidal surfaces are themselves combined with a planar surface skip layer.

This combination is transformed by the logistic function to give the estimate of the posterior probability. The first step is to drastically reduce the dimension using a decision tree.

A decision tree is a flexible multivariate model that effectively defies the curse of dimensionality. Retaining only the variables selected by a decision tree is one way to reduce the dimension without ignoring nonlinearities and interactions.

Set the model role of INS to target. Right click in the open area where the prior profiles are activated and select Add. Select Prior vector and change the prior probabilities to represent the true proportions in the population 0.

Right click on Prior vector and select Set to use. Close the target profiler and select Yes to save the changes. The Data Replacement node does imputation for the missing values. Select the Imputation Methods tab and choose median as the method. A variable selection tree can be fit in the Variable Selection node using the chi-squared method under the Target Associations tab. Run the flow and note that 10 inputs were accepted. In the Neural Network node, select the Basic tab.

For the Network architecture, select Multilayer Perceptron. Set the number of hidden neurons to 3 and Yes to Direct connections. Also select 3 for Preliminary runs Default as the Training technique 2 hours for the Runtime limit. Run the flow from the Assessment node. After nonlinear relationships are identified, you can transform the input variables, create dummy variables, or fit a polynomial model.

The problem with hand-crafted input variables is the time and effort it takes. The problem with higher-degree polynomial models is their non-local behavior. Another solution to dealing with nonlinearities and interactions is to fit a neural network in Enterprise Miner.

Neural networks are a class of flexible nonlinear regression models that can theoretically fit any nonlinear function. However, the price neural networks pay for flexibility is incomprehensibility. Neural networks are useful, though, if the only goal is prediction, and understanding the model is of secondary importance.