R: Categorical risk factor binning

cat.bin

R Documentation

Categorical risk factor binning

Description

cat.bin implements three-stage binning procedure for categorical risk factors. The first stage is possible correction for minimum percentage of observations. The second stage is possible correction for target rate (default rate), while the third one is possible correction for maximum number of bins. Last stage implements procedure known as adjacent pooling algorithm (APA) which aims to minimize information loss while iterative merging of the bins.

Usage

cat.bin(
  x,
  y,
  sc = NA,
  sc.merge = "none",
  min.pct.obs = 0.05,
  min.avg.rate = 0.01,
  max.groups = NA,
  force.trend = "modalities"
)

Arguments

`x`	Categorical risk factor.
`y`	Numeric target vector (binary).
`sc`	Special case elements. Default value is `NA`.
`sc.merge`	Define how special cases will be treated. Available options are: `"none", "first", "last", "closest"`. If `"none"` is selected, then the special cases will be kept in separate bin. If `"first"` or `"last"` is selected, then the special cases will be merged with first or last bin. Depending on sorting option `force.trend`, first or last bin will be determined based on alphabetic order (if `force.trend` is selected as `"modalities"`) or on minimum or maximum default rate (if `force.trend` is selected as `"dr"`). If `"closest"` is selected, then the special case will be merged with the bin that is closest based on default rate. Merging of the special cases with other bins is performed at the beginning i.e. before running any of three-stage procedures. Default value is `"none"`.
`min.pct.obs`	Minimum percentage of observations per bin. Default is 0.05 or minimum 30 observations.
`min.avg.rate`	Minimum default rate. Default is 0.01 or minimum 1 bad case for `y` 0/1.
`max.groups`	Maximum number of bins (groups) allowed for analyzed risk factor. If in the first two stages number of bins is less or equal to selected `max.groups` or if `max.groups` is default value (`NA`), no adjustment is performed. Otherwise, APA algorithm is applied which minimize information loss in further iterative process of bin merging.
`force.trend`	Defines how initial summary table will be ordered. Possible options are: `"modalities"` and `"dr"`. If `"modalities"` is selected, then merging will be performed forward based on alphabetic order of risk factor modalities. On the other hand, if `"dr"` is selected, then bins merging will be performed forward based on increasing order of default rate per modality. This direction of merging is applied in the all three stages.

Value

The command cat.bin generates a list of two objects. The first object, data frame summary.tbl presents a summary table of final binning, while x.trans is a vector of new grouping values.

References

Anderson, R. (2007). The credit scoring toolkit: theory and practice for retail credit risk management and decision automation, Oxford University Press

Examples

suppressMessages(library(PDtoolkit))
data(loans)
#prepare risk factor Purpose for the analysis
loans$Purpose <- ifelse(nchar(loans$Purpose) == 2, loans$Purpose, paste0("0", loans$Purpose))
#artificially add missing values in order to show functions' features
loans$Purpose[1:6] <- NA
#run binning procedure
res <- cat.bin(x = loans$Purpose, 
	   y = loans$Creditability, 
	   sc = NA,
	   sc.merge = "none",
	   min.pct.obs = 0.05, 
	   min.avg.rate = 0.05,
	   max.groups = NA, 
	   force.trend = "modalities")
res[[1]]
#check new risk factor against the original 
table(loans$Purpose, res[[2]], useNA = "always")
#repeat the same process with setting max.groups to 4 and force.trend to dr
res <- cat.bin(x = loans$Purpose, 
	   y = loans$Creditability, 
	   sc = NA,
	   sc.merge = "none",
	   min.pct.obs = 0.05, 
	   min.avg.rate = 0.05,
	   max.groups = 4, 
	   force.trend = "dr")
res[[1]]
#check new risk factor against the original 
table(loans$Purpose, res[[2]], useNA = "always")
#example of shrinking number of groups for numeric risk factor
#copy exisitng numeric risk factor to new called maturity
loans$maturity <- loans$"Duration of Credit (month)"
#artificially add missing values in order to show functions' features
loans$maturity[1:10] <- NA
#categorize maturity with MAPA algorithim from monobin package
loans$maturity.bin <- cum.bin(x = loans$maturity, 
				y = loans$Creditability, g = 50)[[2]]
table(loans$maturity.bin)
#run binning procedure to decrease number of bins from the previous step
res <- cat.bin(x = loans$maturity.bin, 
	   y = loans$Creditability, 
	   sc = "SC",
	   sc.merge = "closest",
	   min.pct.obs = 0.05, 
	   min.avg.rate = 0.01,
	   max.groups = 5, 
	   force.trend = "modalities")
res[[1]]
#check new risk factor against the original 
table(loans$maturity.bin, res[[2]], useNA = "always")