XGBoost is a popular algorithm used to deal with structured and semi-structured data. It is widely preferred by companies such as the giant Alibaba. A highly versatile tool, it works with most regression and classification problems and user-built objective functions. 

In this blog, you will gain an introduction to what XGBoost is, learn about its features, obtain a simplified view of its algorithm and companies that use it in their products. 

XGBoost stands for eXtreme Gradient Boosting. An open-source software, it is easily accessible and can be used through different platforms and interfaces. It has been used to win numerous Kaggle competitions.

The area where XGBoost differentiates itself from the other GBMs is system optimization and enhancements over the algorithms. Some other unique features of it include:

  • Parallelized implementation of decision trees.
  • Distributed computing methods for evaluating vast and complex models.
  • Using out-of-core computing to analyze massive datasets.
  • Implementing cache optimization to make the best use of resources.

Apart from these features, what sets it apart from most other models is its algorithm. So how does this algorithm work? 

Step 1: The base algorithm reads the data and assigns equal weights to each sample observation.

Step 2: False predictions made by the base learner are identified. In the next iteration, these incorrect predictions are assigned to the next base learner with a higher weightage. Thus, the model learns from its previous set of mistakes and readjusts itself.

Step 3: Repeat step 2 until the algorithm can correctly classify the output.

The state of the art algorithm and unique features of the model ensures the following benefits from the model:

  1. Speed and Performance: Since the core algorithm is parallelizable, it can automatically do parallel computation on Windows and Linux. It is generally over 10 times faster and accurate than the classical gradient boosting machine.
  2. A plethora of applications: Can be used to solve regression, classification, ranking, and user-built objective functions.
  3. Portability: easily accessible, and can be used on different platforms and interfaces like Windows, Linux, and OS X.
  4. Languages: Supports most programming languages, including C++, Python, R, Java, Scala, and Julia.
  5. Cloud Integration: Supports AWS and Azure and works well with Flink, Spark, and other ecosystems

Despite being based on a simple concept, this model is considered sophisticated and is one of the most preferred models used by competing  teams to win various Kaggle competitions. It is also used by various companies around the world in their product. Here are some of them:

  • XGBoost Distributed is used in ODPS Cloud Service by Alibaba
  • XGBoost is incorporated as part of Graphlab Create for scalable machine learning.
  • Tencent data platform team admitted that they use distributed XGBoost for click-through prediction in WeChat shopping and lookalikes.
  • The autohome.com ad platform team revealed that their click-through rate greatly improved due to the “awesome XGBoost.”