Learner | Training error | Testing error |
---|---|---|
Line | 0.41 | 0.34 |
Spline | 0.28 | 0.22 |
Loose spline | 0.27 | 0.35 |
Making Leo Breiman’s Masterpiece Accessible and Interpretable with aorsf
2025-02-19
Oblique random forests are good at prediction, and they are excellent tools for spectral data (defined later).
aorsf
provides a unified, simple, and fast interface for oblique random forests.
Available online:
Google “Byron Jaeger talk”
Background
Supervised learning
Decision trees and random forests
Oblique random forests
What is oblique?
aorsf
statement of need
aorsf
demo
A learner is a recipe for a prediction model
A learner is not the same thing as a prediction model
A recipe is not the same thing as food.
This distinction is important for cross-validation (defined soon)
This technique allows you to objectively compare learners
Hold some data out as a testing set
Apply each learner to the remaining data (training set)
Predict the outcome using each model (one per learner)
Evaluate prediction accuracy
Repeat with different held out data
Compare average prediction accuracy by learner
Learner | Training error | Testing error |
---|---|---|
Line | 0.41 | 0.34 |
Spline | 0.28 | 0.22 |
Loose spline | 0.27 | 0.35 |
Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, a member of the Long Term Ecological Research Network.
Decision trees grow by recursively splitting data.
Splits should create groups with different outcomes.
Splitting continues until stopping criterion are met.
The same splits, visualized as a tree
Pros
Simple and intuitive visualization.
Captures conditional relationships.
Cons
Difficulty with linear relationships.
Overfits when trees grow too deep.
Defn: an ensemble of de-correlated decision trees
Each tree on its own is fairly weak at prediction.
However, the aggregate prediction is usually very good.
Why? Consider this example
Defn: an ensemble of de-correlated decision trees
Each tree on its own is fairly weak at prediction.
However, the aggregate prediction is usually very good.
How are they de-correlated?
Random subset of (bootstrapped) data for each tree.
Random subset of predictors considered for each split.
Predictions from a single randomized tree
Predictions from ensemble of 5 randomized trees
Predictions from ensemble of 100 randomized trees
Predictions from ensemble of 500 randomized trees
Predictions from a single oblique tree
Predictions from an oblique random forest
For prediction, the answer is usually yes.
This result has been replicated in multiple studies:
Evidence on consistency of oblique trees is emerging.
For computational efficiency, the answer is no.
data_bench <- as.data.frame(
mutate(drop_na(penguins), species=factor(species))
)
bench <- microbenchmark(
axis_ranger = ranger(formula = species ~ bill_length_mm + flipper_length_mm,
data = data_bench),
axis_rfsrc = rfsrc(formula = species ~ bill_length_mm + flipper_length_mm,
data = data_bench),
oblique_aorsf = orsf(formula = species ~ bill_length_mm + flipper_length_mm,
data = data_bench),
oblique_odrf = ODRF(formula = species ~ bill_length_mm + flipper_length_mm,
data = data_bench),
times = 10
)
For computational efficiency, the answer is no.
Unit: relative
expr min lq mean median uq max neval cld
axis_ranger 1.00 1.00 1.00 1.00 1.000 1.00 10 a
axis_rfsrc 1.78 1.77 1.22 1.75 0.732 1.08 10 a
oblique_aorsf 1.57 1.96 1.43 1.93 0.920 1.39 10 a
oblique_odrf 554.00 530.00 291.00 463.00 174.000 134.00 10 b
30-day downloads from CRAN:
axis-based packages:
ranger
: 42,012
randomForestSRC
: 5,388
oblique packages:
aorsf
: 1,470
ODRF
: 264
aorsf
Oblique random forests are under-utilized, with high computational cost. Existing software focus on specific implementations, limiting scope.
aorsf
is unifying oblique random forest software.
ranger
.tidymodels
and mlr3
.Our approach for linear combinations of predictors:
Fit a regression model to data in the current tree node
Instead of iterating until convergence, stop after one.
Use the beta coefficients from the model as coefficients for the linear combination of predictors.
Spectral data include continuous, correlated predictors.
modeldata::meats
protein | x_001 | x_002 | … | x_100 |
---|---|---|---|---|
16.7 | 2.61776 | 2.61814 | … | 2.81920 |
13.5 | 2.83454 | 2.83871 | … | 3.17942 |
20.5 | 2.58284 | 2.58458 | … | 2.54816 |
20.7 | 2.82286 | 2.82460 | … | 2.79622 |
Description: Data are recorded on a Tecator Infratec Food and Feed Analyzer working in the wavelength range 850 - 1050 nm by the Near Infrared Transmission principle. Each sample contains finely chopped pure meat with different moisture, fat and protein contents
Details: For each meat sample the data consists of a 100 channel spectrum of absorbances and the contents of moisture (water), fat and protein. The absorbance is -log10 of the transmittance measured by the spectrometer. The three contents, measured in percent, are determined by analytic chemistry
Make train & test sets:
Evaluate \(R^2\) of predictions (higher is better):
aorsf
supports variable importance
And multivariable-adjusted summaries for each predictor:
-- fat (VI Rank: 1) ------------------------
|--------- Expected value ---------|
Value Mean Median 25th % 75th %
<char> <num> <num> <num> <num>
5.90 18.29500 19.33016 16.37537 20.19645
7.28 18.27809 19.32573 16.36435 20.17653
15.7 18.12412 19.20343 16.31901 19.95582
29.2 17.88790 18.98651 15.98738 19.70517
37.9 17.75646 18.84186 15.91031 19.60750
Predicted expected value for top 1 predictors
aorsf
can also look for pairwise interactions (details here)
top_preds <- names(orsf_vi(fit_aorsf)[1:10])
# warning: this can get very computationally expensive.
# use subsets of <= 10 predictors to keep it efficient.
vint <- orsf_vint(fit_aorsf, predictors = top_preds)
vint[1:10, ]
interaction score pd_values
<char> <num> <list>
1: x_029..x_014 0.010221001 <data.table[25x9]>
2: x_030..x_035 0.010012501 <data.table[25x9]>
3: x_029..x_034 0.008032039 <data.table[25x9]>
4: x_028..x_014 0.008020588 <data.table[25x9]>
5: x_034..x_026 0.007513260 <data.table[25x9]>
6: x_034..x_035 0.007083606 <data.table[25x9]>
7: x_029..x_035 0.006955590 <data.table[25x9]>
8: x_030..x_014 0.006674831 <data.table[25x9]>
9: x_027..x_014 0.006570676 <data.table[25x9]>
10: x_034..x_030 0.006246572 <data.table[25x9]>
and sometimes it finds them.
Sanity check the result using regression
# A tibble: 4 × 2
term p.value
<chr> <chr>
1 bs(x_029) <.001
2 bs(x_014) <.001
3 bs(x_029):bs(x_014) <.001
4 Residuals --
# A tibble: 4 × 2
term p.value
<chr> <chr>
1 bs(x_030) <.001
2 bs(x_035) <.001
3 bs(x_030):bs(x_035) .20
4 Residuals --
Here is a much larger spectral data example
Example: modeldatatoo::data_chimiometrie_2019()
# A tibble: 6,915 × 7
soy_oil wvlgth_001 wvlgth_002 wvlgth_003 wvlgth_004 ... wvlgth_550
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 2.1 0.208 0.207 0.207 0.207 ... 0.592
2 2.1 0.206 0.206 0.206 0.206 ... 0.600
3 2.1 0.207 0.207 0.207 0.206 ... 0.602
4 0.5 0.206 0.206 0.205 0.205 ... 0.608
5 0.5 0.201 0.200 0.200 0.200 ... 0.601
6 0.5 0.206 0.205 0.205 0.205 ... 0.613
7 0.5 0.202 0.202 0.201 0.201 ... 0.614
8 0.5 0.205 0.205 0.204 0.204 ... 0.585
9 0.5 0.205 0.205 0.204 0.204 ... 0.594
10 0.5 0.207 0.207 0.207 0.207 ... 0.589
# ℹ 6,905 more rows
Now we’ll fit 6 learners in addition to aorsf
and use nested cross-validation to tune each approach, including:
consider 16 data pre-processing approaches (details here)
implement tuning for each learner (details here)
All code available here
Even with fully developed tuning pipelines, it is difficult to beat oblique random forests in spectral data.
Oblique random forests are good at prediction, and they are excellent tools for spectral data.
aorsf
provides a unified, simple, and fast interface for oblique random forests.
Learn more here
Subsetting data by tree allows for out-of-bag prediction.
About 2/3 of the data are in-bag for each tree.
The out-of-bag remainder is external to the tree.
Each observation’s denominator is tracked
Repeat until all trees are grown.
They are almost as important as the random forest itself
Unbiased assessment of external prediction accuracy.
The basis for computing permutation variable importance.
A necessity for consistency of causal random forests.
As a bonus, assessing out-of-bag prediction accuracy is also much faster than cross-validation.