Boruta Feature Selection
Published:
Purpose
Boruta is designed to determine which variables (features) are significant in predicting the outcome with the given dataset. It is particularly useful when dealing with high-dimensional data.
How It Works
- Random Forest Base: It starts by fitting a Random Forest model to the original data.
- Shadow Features: For each feature in the dataset, it creates a duplicate (shadow) feature by shuffling its values. This helps establish a baseline for what “random” looks like.
- Importance Scores: The algorithm calculates importance scores for both original and shadow features based on how well they help in predicting the outcome.
Comparison: The original features are then compared to the best-performing shadow feature. If an original feature has a higher importance score, it is considered significant. Decision Process: Boruta iteratively tests features, deciding whether to confirm them as important, reject them, or keep them under consideration based on their importance scores across multiple iterations.
- Final Output: The result is a clear classification of each feature into three categories: confirmed (important), rejected (not important), or tentative (needs further assessment).
Advantages
Identifies all relevant features, not just the best ones, which can be crucial in some analyses. Reduces the likelihood of overfitting by providing a more comprehensive feature selection process. Implementation: Boruta is available in various programming environments, with a popular implementation in R and Python.
Reference:
[1] 预测变量的选择:森林之神[Boruta] [2] Miron B. Kursa, Witold R. Rudnicki. Feature selection with the Boruta package.J Stat Softw,2010, 36(11): 1-13.