xgboost (and many other) modeling functions expect matrix input with factor levels one-hot encoded.

cat_spread will one-hot encode any factor or character variable in data and return a one-hot encoded tibble. Alternatively, cat_gather will apply the inverse operation and convert one-hot encoded columns back into factors.

cat_spread(data, ...)

cat_gather(data, factor_levels)

Arguments

data

data with categorical variables (i.e., factors) that need to be spread or gathered.

...

Arguments passed on to mltools::one_hot

sparsifyNAs

Should NAs be converted to 0s?

naCols

Should columns be generated to indicate the present of NAs? Will only apply to factor columns with at least one NA

dropCols

Should the resulting data.table exclude the original columns which are one-hot-encoded?

factor_levels

This parameter is only relevant for cat_gather. A named list of factor levels, with each name corresponding to the column in the data that the factor levels describe.

Value

a tibble with categorical variables herded as you like.

Examples

df <- data.frame(x = rep(letters[1:2], 50), y = 1:100) one_hot_df <- cat_spread(df) cat_gather(one_hot_df, factor_levels = list(x=c('a','b')))
#> # A tibble: 100 x 2 #> x y #> <fct> <int> #> 1 a 1 #> 2 b 2 #> 3 a 3 #> 4 b 4 #> 5 a 5 #> 6 b 6 #> 7 a 7 #> 8 b 8 #> 9 a 9 #> 10 b 10 #> # ... with 90 more rows