xgboost
(and many other) modeling functions
expect matrix input with factor levels one-hot encoded.
cat_spread
will one-hot encode any factor or character variable
in data
and return a one-hot encoded tibble
. Alternatively,
cat_gather
will apply the inverse operation and convert one-hot
encoded columns back into factors.
cat_spread(data, ...) cat_gather(data, factor_levels)
data | data with categorical variables (i.e., factors) that need to be spread or gathered. |
---|---|
... | Arguments passed on to
|
factor_levels | This parameter is only relevant for
|
a tibble with categorical variables herded as you like.
df <- data.frame(x = rep(letters[1:2], 50), y = 1:100) one_hot_df <- cat_spread(df) cat_gather(one_hot_df, factor_levels = list(x=c('a','b')))#> # A tibble: 100 x 2 #> x y #> <fct> <int> #> 1 a 1 #> 2 b 2 #> 3 a 3 #> 4 b 4 #> 5 a 5 #> 6 b 6 #> 7 a 7 #> 8 b 8 #> 9 a 9 #> 10 b 10 #> # ... with 90 more rows