Oblique Random Forests

Making Leo Breiman’s Masterpiece Accessible and Interpretable with aorsf

Byron C Jaeger

2025-02-19

Hello, my name is Byron

Bottom line up front

  1. Oblique random forests are good at prediction, and they are excellent tools for spectral data (defined later).

  2. aorsf provides a unified, simple, and fast interface for oblique random forests.

Slides

Available online:

Overview

  • Background

    • Supervised learning

    • Decision trees and random forests

  • Oblique random forests

    • What is oblique?

    • aorsf statement of need

    • aorsf demo

Supervised learning

Learners

A learner is a recipe for a prediction model

  • A learner is not the same thing as a prediction model

  • A recipe is not the same thing as food.

  • This distinction is important for cross-validation (defined soon)

Find a good learner for these data

Learner 1: find the line of best fit

Learner 1: find the line of best fit

Learner 2: Use a spline

Learner 3: Loosen the spline

Cross validation

This technique allows you to objectively compare learners

  • Hold some data out as a testing set

  • Apply each learner to the remaining data (training set)

  • Predict the outcome using each model (one per learner)

  • Evaluate prediction accuracy

  • Repeat with different held out data

  • Compare average prediction accuracy by learner

All our data

Select a testing set

Take it away

Apply learner 1

Apply learner 2

Apply learner 3

Assess predictions in testing data

Assess predictions in testing data

Assess predictions in testing data

Assess predictions in testing data

  • Cross-validation highlights learners that overfit.
Learner Training error Testing error
Line 0.41 0.34
Spline 0.28 0.22
Loose spline 0.27 0.35

Decision trees and random forests

Decision trees grow by recursively splitting data.

Splits should create groups with different outcomes.

Splitting continues until stopping criterion are met.

The same splits, visualized as a tree

Decision trees

  • Pros

    • Simple and intuitive visualization.

    • Captures conditional relationships.

  • Cons

    • Difficulty with linear relationships.

    • Overfits when trees grow too deep.

Random forests

Defn: an ensemble of de-correlated decision trees

  • Each tree on its own is fairly weak at prediction.

  • However, the aggregate prediction is usually very good.

  • Why? Consider this example

    # suppose we ask 5000 independent weak learners a yes/no question.
    # Individually, weak learners are right 51% of the time. However,
    # the probability that a majority of weak learners are right is 92%.
    # proof:
    1 - pbinom(q = 2500, size = 5000, prob = 0.51)
    [1] 0.9192858

Random forests

Defn: an ensemble of de-correlated decision trees

  • Each tree on its own is fairly weak at prediction.

  • However, the aggregate prediction is usually very good.

  • How are they de-correlated?

    • Random subset of (bootstrapped) data for each tree.

    • Random subset of predictors considered for each split.

Predictions from a single randomized tree

Predictions from ensemble of 5 randomized trees

Predictions from ensemble of 100 randomized trees

Predictions from ensemble of 500 randomized trees

Oblique random forests

What is oblique?

Predictions from a single oblique tree

Predictions from an oblique random forest

Are oblique splits helpful?

For prediction, the answer is usually yes.

  • Leo Breiman, author of the random forest, noted this:

Are oblique splits helpful?

For computational efficiency, the answer is no.

data_bench <- as.data.frame(
 mutate(drop_na(penguins), species=factor(species))
)

bench <- microbenchmark(
 axis_ranger = ranger(formula = species ~ bill_length_mm + flipper_length_mm, 
                      data = data_bench),
 axis_rfsrc = rfsrc(formula = species ~ bill_length_mm + flipper_length_mm,
                    data = data_bench),
 oblique_aorsf = orsf(formula = species ~ bill_length_mm + flipper_length_mm, 
                      data = data_bench),
 oblique_odrf = ODRF(formula = species ~ bill_length_mm + flipper_length_mm, 
                     data = data_bench),
 times = 10
)

Are oblique splits helpful?

For computational efficiency, the answer is no.

print(bench, signif=3, unit='relative')
Unit: relative
          expr    min     lq   mean median      uq    max neval cld
   axis_ranger   1.00   1.00   1.00   1.00   1.000   1.00    10  a 
    axis_rfsrc   1.78   1.77   1.22   1.75   0.732   1.08    10  a 
 oblique_aorsf   1.57   1.96   1.43   1.93   0.920   1.39    10  a 
  oblique_odrf 554.00 530.00 291.00 463.00 174.000 134.00    10   b

Computational efficiency is important

30-day downloads from CRAN:

  • axis-based packages:

    • ranger: 42,012

    • randomForestSRC: 5,388

  • oblique packages:

    • aorsf: 1,470

    • ODRF: 264

aorsf

Statement of need

  • Oblique random forests are under-utilized, with high computational cost. Existing software focus on specific implementations, limiting scope.

  • aorsf is unifying oblique random forest software.

    • Fast C++ backend based on ranger.
    • Supports survival, regression, and classification.
    • Supports custom functions for oblique splitting.
    • Part of tidymodels and mlr3.
    • Fast variable importance and partial dependence.

“a” stands for “accelerated”

Our approach for linear combinations of predictors:

  • Fit a regression model to data in the current tree node

    • Note: logistic and Cox models iterate until converging
  • Instead of iterating until convergence, stop after one.

  • Use the beta coefficients from the model as coefficients for the linear combination of predictors.

Demo with spectral data

Spectral data include continuous, correlated predictors.

  • Example: modeldata::meats
protein x_001 x_002 x_100
16.7 2.61776 2.61814 2.81920
13.5 2.83454 2.83871 3.17942
20.5 2.58284 2.58458 2.54816
20.7 2.82286 2.82460 2.79622

Demo with spectral data

  • Description: Data are recorded on a Tecator Infratec Food and Feed Analyzer working in the wavelength range 850 - 1050 nm by the Near Infrared Transmission principle. Each sample contains finely chopped pure meat with different moisture, fat and protein contents

  • Details: For each meat sample the data consists of a 100 channel spectrum of absorbances and the contents of moisture (water), fat and protein. The absorbance is -log10 of the transmittance measured by the spectrometer. The three contents, measured in percent, are determined by analytic chemistry

Demo with spectral data

Make train & test sets:

trn_rows <- 
 sample(nrow(meats), 100)

meats_train <- meats[trn_rows, ]

meats_test <- meats[-trn_rows, ]

Train axis & oblique forests:

fit_aorsf <- 
 orsf(protein ~ ., 
      data = meats_train)
fit_ranger <- 
 ranger(protein ~.,
        data = meats_train)

Evaluate \(R^2\) of predictions (higher is better):

prd_aorsf <- predict(fit_aorsf, new_data = meats_test, pred_simplify = T)
prd_ranger <- predict(fit_ranger, data = meats_test)$predictions

rsq_vec(estimate = prd_aorsf, truth = meats_test$protein)
[1] 0.9356059
rsq_vec(estimate = prd_ranger, truth = meats_test$protein)
[1] 0.6872002

Demo with spectral data

aorsf supports variable importance

orsf_vi(fit_aorsf)[1:3]
      fat     x_029     x_028 
0.2355890 0.2328767 0.2315789 

And multivariable-adjusted summaries for each predictor:

orsf_summarize_uni(fit_aorsf, n_variables = 1)

-- fat (VI Rank: 1) ------------------------

        |--------- Expected value ---------|
  Value     Mean   Median   25th %   75th %
 <char>    <num>    <num>    <num>    <num>
   5.90 18.29500 19.33016 16.37537 20.19645
   7.28 18.27809 19.32573 16.36435 20.17653
   15.7 18.12412 19.20343 16.31901 19.95582
   29.2 17.88790 18.98651 15.98738 19.70517
   37.9 17.75646 18.84186 15.91031 19.60750

 Predicted expected value for top 1 predictors 

Demo with spectral data

aorsf can also look for pairwise interactions (details here)

top_preds <- names(orsf_vi(fit_aorsf)[1:10])

# warning: this can get very computationally expensive.
# use subsets of <= 10 predictors to keep it efficient.
vint <- orsf_vint(fit_aorsf, predictors = top_preds)

vint[1:10, ]
     interaction       score          pd_values
          <char>       <num>             <list>
 1: x_029..x_014 0.010221001 <data.table[25x9]>
 2: x_030..x_035 0.010012501 <data.table[25x9]>
 3: x_029..x_034 0.008032039 <data.table[25x9]>
 4: x_028..x_014 0.008020588 <data.table[25x9]>
 5: x_034..x_026 0.007513260 <data.table[25x9]>
 6: x_034..x_035 0.007083606 <data.table[25x9]>
 7: x_029..x_035 0.006955590 <data.table[25x9]>
 8: x_030..x_014 0.006674831 <data.table[25x9]>
 9: x_027..x_014 0.006570676 <data.table[25x9]>
10: x_034..x_030 0.006246572 <data.table[25x9]>

Demo with spectral data

and sometimes it finds them.

Demo with spectral data

Sanity check the result using regression

anova(
 lm(
  protein ~ bs(x_029)*bs(x_014), 
  data = meats_train
 )
)

anova(
 lm(
  protein ~ bs(x_030)*bs(x_035),
  data = meats_train
 )
)
# A tibble: 4 × 2
  term                p.value
  <chr>               <chr>  
1 bs(x_029)           <.001  
2 bs(x_014)           <.001  
3 bs(x_029):bs(x_014) <.001  
4 Residuals           --     
# A tibble: 4 × 2
  term                p.value
  <chr>               <chr>  
1 bs(x_030)           <.001  
2 bs(x_035)           <.001  
3 bs(x_030):bs(x_035) .20    
4 Residuals           --     

More spectral data

  • Here is a much larger spectral data example

  • Example: modeldatatoo::data_chimiometrie_2019()

# A tibble: 6,915 × 7
   soy_oil wvlgth_001 wvlgth_002 wvlgth_003 wvlgth_004 ...   wvlgth_550
     <dbl>      <dbl>      <dbl>      <dbl>      <dbl> <chr>      <dbl>
 1     2.1      0.208      0.207      0.207      0.207 ...        0.592
 2     2.1      0.206      0.206      0.206      0.206 ...        0.600
 3     2.1      0.207      0.207      0.207      0.206 ...        0.602
 4     0.5      0.206      0.206      0.205      0.205 ...        0.608
 5     0.5      0.201      0.200      0.200      0.200 ...        0.601
 6     0.5      0.206      0.205      0.205      0.205 ...        0.613
 7     0.5      0.202      0.202      0.201      0.201 ...        0.614
 8     0.5      0.205      0.205      0.204      0.204 ...        0.585
 9     0.5      0.205      0.205      0.204      0.204 ...        0.594
10     0.5      0.207      0.207      0.207      0.207 ...        0.589
# ℹ 6,905 more rows

More spectral data

  • Description: This data set was published as the challenge at the Chimiometrie 2019 conference held in Montpellier and is available at the conference homepage. The data consist of 6915 training spectra and 600 test spectra measured at 550 (unknown) wavelengths. The target was the amount of soy oil (0-5.5%), ucerne (0-40%) and barley (0-52%) in a mixture

Bigger benchmark

Now we’ll fit 6 learners in addition to aorsf and use nested cross-validation to tune each approach, including:

  • consider 16 data pre-processing approaches (details here)

    • includes option for each learner to use a variable selection step and data transformation (i.e., principal component analysis)
  • implement tuning for each learner (details here)

All code available here

Even with fully developed tuning pipelines, it is difficult to beat oblique random forests in spectral data.

Conclusion

  1. Oblique random forests are good at prediction, and they are excellent tools for spectral data.

  2. aorsf provides a unified, simple, and fast interface for oblique random forests.

  3. Learn more here

Thank you!

Bonus round

Subsetting data by tree allows for out-of-bag prediction.

About 2/3 of the data are in-bag for each tree.

The out-of-bag remainder is external to the tree.

Each observation’s denominator is tracked

Repeat until all trees are grown.

Why out-of-bag predictions matter

They are almost as important as the random forest itself

  1. Unbiased assessment of external prediction accuracy.

  2. The basis for computing permutation variable importance.

  3. A necessity for consistency of causal random forests.

As a bonus, assessing out-of-bag prediction accuracy is also much faster than cross-validation.