Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
CatBoost: Open-source gradient boosting library (catboost.ai)
43 points by tosh on March 6, 2024 | hide | past | favorite | 13 comments


I think me trying to understand AI terms that others seem to take for granted is giving me a good idea of what it feels like to my family when I try to talk to them about web development.

"Gradient boosting on decision trees?" Man, that's crazy. Hey, did you catch the game last night?


Believe it or not this stuff is easy mode once you learn to ignore the terminology. It takes relatively little knowledge to know how to get data into the right format, set sensible training parameters and check results -- basically a handful of function calls. The trained models are also highly interpretable, you can literally plot them with Graphviz to understand how the model is treating values of each feature ("column")


It's so easy to understand if every step is dumbed down and visualized. Teachers who write dense math on whiteboards need to learn a thing or two about pedagogy. The math should supplement the pictures, not the other way around, unless your audience is other experts.


Easy to understand but hard to master. Any slip mixing training data in validation or testing data and your metrics are useless. It can be difficult to impossible to spot and give newbies false confidence in their model.


The same is true pretty much everywhere else (cryptography, sport, electrical engineering, aeronautics, ..). There is no greater teacher than a head injury or an electrical fire, the question is whether OP's fear is warranted or whether it should be encouraged. Perhaps they should not set the bit rate on a video codec until completely understanding the fundamentals of that technology too? I'd suggest like in many similar areas, gatekeeping is a not insubstantial force


A year ago, I did a review of the the GBT algorithms and settled on LightGBM for predictive performance and CPU time (not GPU). The benchmark differences are small enough that I'm reluctant to change, but I'd love to hear feedback.

Has anyone ever created a meta-ensemble model of several GBT algorithms?


I worked with catboost, lightgbm, and xgboost when doing the zillow kaggle competition a long time ago, 2016 I think. I used what is called blending, where you give a weight to the models such as:

.5 * xgboost + .25 * catboost + .25 * lightgbm


Interesting - did this work better than a single model? In general, do meta-ensembles work better? My sense was that just xgboost was the main winner of kaggle competitions.


If you're clever about how you blend models you can pretty much ensure that the performance of the (weighted) averaged model is strictly better than the (weighted) average performance of the models.

And increasing the space of possible models pretty much guarantees the performance will improve, provided you can still find a good enough optimum.


XGBoost, LightGBM, and Catboost are all used quite frequently in competitions. LightGBM is actually marginally more popular than the other two now, but it's pretty close. In the M5 forecasting competition a few years back, many of the top solutions used primarily LightGBM.


Meta-ensembles are the new trend for tabular data on Kaggle



Related:

Yandex open sources CatBoost, a gradient boosting ML library - https://news.ycombinator.com/item?id=14795673 - July 2017 (28 comments)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: