Skip to content

Merge with MemoryConstrainedTreeBoosting.jl? #91

@brianhempel

Description

@brianhempel

Maybe we can merge projects?

EvoTrees:

  • More loss functions supported
  • Cleaner code
  • Working on integrating with Julia ecosystem

MemoryConstrainedTreeBoosting:

  • 4-5x faster on CPU
  • Early stopping

MemoryConstrainedTreeBoosting.jl is a library I've been working on for a couple years that would allow me to control the loading and binning of data, so I could (a) do feature engineering in Julia and (b) use all the memory on my machine for the binned data. I've also spent a lot of time on speed because 10% faster ≈ training done 1 day sooner for my data sets. I think it is quite fast. I didn't bother documenting the library until today, however.

With our powers combined...!

A benchmark below. 4 threads on my 2013 quad-core i7-4960HQ.


 pkg> add https://github.com/brianhempel/MemoryConstrainedTreeBoosting.jl
using Statistics
using StatsBase:sample
using Revise
using EvoTrees

nrounds = 200

# EvoTrees params
params_evo = EvoTreeRegressor(T=Float32,
        loss=:logistic, metric=:logloss,
        nrounds=nrounds,
        λ=0.5, γ=0.0, η=0.05,
        max_depth=6, min_weight=1.0,
        rowsample=1.0, colsample=0.5, nbins=64)

# MemoryConstrainedTreeBoosting params
params_mctb = (
        weights                 = nothing,
        bin_count               = 64,
        iteration_count         = nrounds,
        min_data_weight_in_leaf = 1.0,
        l2_regularization       = 0.5,
        max_leaves              = 32,
        max_depth               = 6,
        max_delta_score         = 1.0e10, # Before shrinkage.
        learning_rate           = 0.05,
        feature_fraction        = 0.5, # Per tree.
        bagging_temperature     = 0.0,
      )

nobs = Int(1e6)
num_feat = Int(100)
@info "testing with: $nobs observations | $num_feat features."
X = rand(Float32, nobs, num_feat)
Y = Float32.(rand(Bool, size(X, 1)))


@info "evotrees train CPU:"
params_evo.device = "cpu"
@time m_evo = fit_evotree(params_evo, X, Y);
@time fit_evotree(params_evo, X, Y);
@info "evotrees predict CPU:"
@time pred_evo = EvoTrees.predict(m_evo, X);
@time EvoTrees.predict(m_evo, X);


import MemoryConstrainedTreeBoosting

@info "MemoryConstrainedTreeBoosting train CPU:"
@time bin_splits, trees = MemoryConstrainedTreeBoosting.train(X, Y; params_mctb...);
@time MemoryConstrainedTreeBoosting.train(X, Y; params_mctb...);
@info "MemoryConstrainedTreeBoosting predict CPU, JITed:"
save_path = tempname()
MemoryConstrainedTreeBoosting.save(save_path, bin_splits, trees)
unbinned_predict = MemoryConstrainedTreeBoosting.load_unbinned_predictor(save_path)
@time pred_mctb = unbinned_predict(X)
@time unbinned_predict(X)
$ JULIA_NUM_THREADS=4 julia --project=. experiments/benchmarks_v2.jl
[ Info: testing with: 1000000 observations | 100 features.
[ Info: evotrees train CPU:
 98.929771 seconds (64.89 M allocations: 21.928 GiB, 2.12% gc time)
 83.160324 seconds (187.35 k allocations: 18.400 GiB, 1.69% gc time)
[ Info: evotrees predict CPU:
  2.458015 seconds (4.50 M allocations: 246.320 MiB, 38.75% compilation time)
  1.598223 seconds (4.59 k allocations: 4.142 MiB)
[ Info: MemoryConstrainedTreeBoosting train CPU:
  20.320708 seconds (16.04 M allocations: 2.480 GiB, 1.48% gc time, 0.01% compilation time)
  15.954224 seconds (3.10 M allocations: 1.714 GiB, 2.66% gc time)
[ Info: MemoryConstrainedTreeBoosting predict CPU, JITed:
 14.364365 seconds (11.80 M allocations: 692.582 MiB, 25.95% compilation time)
  0.778851 seconds (40 allocations: 30.520 MiB)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions