XGBoost: A Fast and Accurate Boosting Trees Model
Posted by Tong He
Updated: Oct 15, 2015
The Author: Tong He is a data scientist in SupStat Inc. and a master student in Simon Fraser University. His currently research interests include machine learning, data mining and bioinformatics.
In the work of data analysis, we usually build models to make predictions on the data. Among the choices in R, randomForest, gbm and glmnet are three exceptionally popular packages since they appear in almost all the data mining competitions on Kaggle. In my personal experiences, gbm costs less memory and time than randomForest, and users indeed prefer it. In python's sklearn library, we also have the GradientBoostingClassifier module.
Boosting classifier belongs to ensemble models, the basic idea is to aggregate hundreds of less accurate tree-based models to form a very accurate model. This model usually iteratively generates a new tree-based model at each step. People have proposed various ways to get a reasonable base model. In Friedman's Gradient Boosting Machine, it incorporates gradient descent method to build a tree which decrease the objective along the direction of the gradient. In practice we need to generate thousands of trees to get an excellent result on a relatively large data set. However the current implementation of the algorithm is not fast enough so that we may need to wait for a long time for the result.
Now, we have XGBoost to solve this problem. XGBoost is short for "eXtreme Gradient Boosting". It is a gradient boosting implementation in C++, and its author is Tianqi Chen, a Ph.D. Student in Washington University. He felt limited by the efficiency of the current boosting libraries so he started the project in early 2014. This tools was getting well shaped in the summer of 2014. Its algorithm is improved than the vanilla gradient boosting model, and it automatically parallels on a multi-threaded CPU. The debut of XGBoost is the higgs boson signal competition on Kaggle, and it becomes popular afterwards. Nowadays there are many competition winners using XGBoost in their model.
To make the tool accepted by more users, Tianqi developed its python interface and I developed the R interface and it is on CRAN now. The following sections focus on the general R interface and I suggest readers to get a basic idea of XGBoost's features, and then learn the exact interface from the documentation.
1. Basic functions
First we can install the package from CRAN:
install.packages('xgboost')
to follow the latest version, we can install from github:
devtools::install_github('dmlc/xgboost',subdir='R-package')
Time to code! Run the following code to load the sample:
require(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
This data asks us to judge whether a mushroom is poisonous or not by its attributes. The attributes are denoted as existing by 1, non-existing by 0. Therefore it is stored as a sparse matrix.
Don't worry for it, because XGBoost supports both dense and sparse matrices as input. Here comes the training command:
> bst <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1,
+ nround = 2, objective = "binary:logistic")
[0] train-error:0.046522
[1] train-error:0.022263
We have iterated twice and the information of training error is printed. If the data is too large to load in R, users can set data = 'path_to_file' to read it directly from the disk. Currently XGBoost supports local data files in the libsvm format.
It takes you one line to make prediction:
pred <- predict(bst, test$data)
It is very convenient to do cross validation, since the xgb.cv function only asks for an additional parameter 'nfold' than the XGBoost.
> cv.res <- xgb.cv(data = train$data, label = train$label, max.depth = 2,
+ eta = 1, nround = 2, objective = "binary:logistic",
+ nfold = 5)
[0] train-error:0.046522+0.001102 test-error:0.046523+0.004410
[1] train-error:0.022264+0.000864 test-error:0.022266+0.003450
> cv.res
train.error.mean train.error.std test.error.mean test.error.std
1: 0.046522 0.001102 0.046523 0.004410
2: 0.022264 0.000864 0.022266 0.003450
Its return value is a data.table containing the measurements on training and testing folds. One can easily track the best number of rounds.
2. Fast and accurate
The above code is a very brief introduction and the data is too small to show the power of XGBoost. XGBoost is fast for the following reasons:
- XGBoost utilizes OpenMP which can parallel the code on a multithreaded CPU automatically.
- XGBoost has defined a data structure DMatrix to store the data matrix. This data structure will perform some preprocessing work on the data so that the latter iteration is faster.
We tried our best to keep all the parameters as the same and did the following experiment:
Model and Parameter | gbm | XGBoost | |||
---|---|---|---|---|---|
1 thread | 2 threads | 4 threads | 8 threads | ||
Time (in secs) | 761.48 | 450.22 | 102.41 | 44.18 | 34.04 |
The CPU for this experiment is i7-4700MQ. The sklearn in python has the similar efficiency as gbm. You can try to reproduce the result by downloading the data and run the code here.
Besides the significantly boosted speed, XGBoost also achieves high accuracy in the competitions. In the beginning of the higgs boson competition, people surprisingly found it that the gbm in R and python cannot beat the official benchmark, while xgboost came out and made it into Top 10 at that time. The main reason for the improvement of the accuracy is because the newly-defined regularization term and the pruning approach which makes the learned model more stable. For more details please check the official documentation.
3. Advanced features
Besides the speed and accuracy, XGBoost has a lot of other useful features. The following list contains some of them. Readers can click the demo to the like of the sample code
- As long as you can calculate the first and second derivative of the loss function, you can customize the goal of the training algorithm in XGBoost. demo
- Users are allowed to define the metric in cross validation, for example RMSE, RMSLE for regression and Error rate, AUC or F1-score for classification. Or even the unusual metric AMS in the higgs boson competition. demo
- the cross validation function can generate the prediction result on each test fold to help users build ensemble models easier. demo
- Users can try to iterate for 1000 times first and check the model's strength, then keep doing another 1000 iterations on top of the previous result. demo
- The model can output the id of the leaf for each data sample. It is one part of the model from a facebook paper. demo
- The model can calculate the feature importance and plot the trees. demo
- Users can boost the regularized linear models instead of the trees. demo
These features enable users to use this tool in various of application scenarios. Actually many of them are from the requests of the users.
4. Learning Sources
The information in this article is limited. We have provided several scripts to help you understand the tool better:
- The folder for all the sample scripts
- The script for the higgs boson competition
- The script for the otto competition
If you are interested in understanding deeper of the algorithm or the tool, you may find the following links useful:
- The slides from Tianqi Chen.
- The official documentation of XGBoost, especially the section on the details of the model.
- Our paper on the model of XGBoost.
Apply for the Upcoming NYC Data Science Bootcamp
The first step in becoming a data scientist is to complete your Data Science Bootcamp Application. Just click the button to apply. It's free and will only take you about 5 minutes.
Tong He
View all articlesTopics from this blog: Community Data Science News and Sharing
Subscribe Here
Posts by Tag
- Meetup (101)
- data science (68)
- Community (60)
- R (48)
- Alumni (46)
- NYC (43)
- Data Science News and Sharing (41)
- nyc data science academy (38)
- python (32)
- alumni story (28)
- data (28)
- Featured (14)
- Machine Learning (14)
- data science bootcamp (14)
- Big Data (13)
- NYC Open Data (12)
- statistics (11)
- visualization (11)
- Hadoop (10)
- hiring partner events (10)
- D3.js (9)
- Data Scientist (9)
- NYCDSA (8)
- Web Scraping (8)
- Career (7)
- Data Scientist Jobs (6)
- Data Visualization (6)
- Hiring (6)
- Open Data (6)
- R Workshop (6)
- APIs (5)
- Alumni Spotlight (5)
- Best Bootcamp (5)
- Best Data Science 2019 (5)
- Best Data Science Bootcamp (5)
- Data Science Academy (5)
- Demo Day (5)
- Job Placement (5)
- NYCDSA Alumni (5)
- Tableau (5)
- alumni interview (5)
- API (4)
- Career Education (4)
- Deep Learning (4)
- Get Hired (4)
- Kaggle (4)
- NYC Data Science (4)
- Networking (4)
- Student Works (4)
- employer networking (4)
- prediction (4)
- Data Analyst (3)
- Job (3)
- Maps (3)
- New Courses (3)
- Python Workshop (3)
- R Shiny (3)
- Shiny (3)
- Top Data Science Bootcamp (3)
- bootcamp (3)
- recommendation (3)
- 2019 (2)
- Alumnus (2)
- Book-Signing (2)
- Bootcamp Alumni (2)
- Bootcamp Prep (2)
- Capstone (2)
- Career Day (2)
- Data Science Reviews (2)
- Data science jobs (2)
- Discount (2)
- Events (2)
- Full Stack Data Scientist (2)
- Hiring Partners (2)
- Industry Experts (2)
- Jobs (2)
- Online Bootcamp (2)
- Spark (2)
- Testimonial (2)
- citibike (2)
- clustering (2)
- jp morgan chase (2)
- pandas (2)
- python machine learning (2)
- remote data science bootcamp (2)
- #trainwithnycdsa (1)
- ACCET (1)
- AWS (1)
- Accreditation (1)
- Alex Baransky (1)
- Alumni Reviews (1)
- Application (1)
- Best Data Science Bootcamp 2020 (1)
- Best Data Science Bootcamp 2021 (1)
- Best Ranked (1)
- Book Launch (1)
- Bundles (1)
- California (1)
- Cancer Research (1)
- Coding (1)
- Complete Guide To Become A Data Scientist (1)
- Course Demo (1)
- Course Report (1)
- Finance (1)
- Financial Data Science (1)
- First Step to Become Data Scientist (1)
- How To Learn Data Science From Scratch (1)
- Instructor Interview (1)
- Jon Krohn (1)
- Lead Data Scienctist (1)
- Lead Data Scientist (1)
- Medical Research (1)
- Meet the team (1)
- Neural networks (1)
- Online (1)
- Part-time (1)
- Portfolio Development (1)
- Prework (1)
- Programming (1)
- PwC (1)
- R Programming (1)
- R language (1)
- Ranking (1)
- Remote (1)
- Selenium (1)
- Skills Needed (1)
- Special (1)
- Special Summer (1)
- Sports (1)
- Student Interview (1)
- Student Showcase (1)
- Switchup (1)
- TensorFlow (1)
- Weekend Course (1)
- What to expect (1)
- artist (1)
- bootcamp experience (1)
- data scientist career (1)
- dplyr (1)
- interview (1)
- linear regression (1)
- nlp (1)
- painter (1)
- python web scraping (1)
- python webscraping (1)
- regression (1)
- team (1)
- twitter (1)