Skip to content

Commit 62b3efa

Browse files
committed
Merge remote-tracking branch 'origin/master'
* origin/master: [MNT] add automatic test workflow (scikit-learn-contrib#146) add error message when user passes decision trees (scikit-learn-contrib#141) migrate from unittest to pytest (scikit-learn-contrib#140) Bump pypa/gh-action-pypi-publish in /.github/workflows (scikit-learn-contrib#144) evaluate features only after 5th iteration (scikit-learn-contrib#137) add gitignore (scikit-learn-contrib#139) fix typo in docstring (scikit-learn-contrib#138)
2 parents af0a12d + 0135a91 commit 62b3efa

File tree

8 files changed

+264
-69
lines changed

8 files changed

+264
-69
lines changed

.github/workflows/publish-to-pypi.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,4 +38,4 @@ jobs:
3838
- name: Build a binary wheel
3939
run: python setup.py sdist bdist_wheel
4040
- name: Publish distribution 📦 to PyPI
41-
uses: pypa/gh-action-pypi-publish@v1.9.0
41+
uses: pypa/gh-action-pypi-publish@v1.13.0

.github/workflows/test_package.yml

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
name: Test boruta
2+
3+
on:
4+
push:
5+
branches: [ "master" ]
6+
pull_request:
7+
branches: [ "master" ]
8+
9+
jobs:
10+
build:
11+
runs-on: ubuntu-latest
12+
13+
strategy:
14+
matrix:
15+
include:
16+
# Regular Python versions (no special package versions)
17+
- python-version: "3.10"
18+
- python-version: "3.12"
19+
- python-version: "3.13"
20+
21+
# Python 3.11 with different scikit-learn versions
22+
- python-version: "3.11"
23+
sklearn-version: "1.5.2"
24+
- python-version: "3.11"
25+
sklearn-version: "1.6.1"
26+
- python-version: "3.11"
27+
sklearn-version: "1.7.0"
28+
29+
# Python 3.11 with different NumPy versions
30+
- python-version: "3.11"
31+
numpy-version: "1.26.4"
32+
- python-version: "3.11"
33+
numpy-version: "2.0.1"
34+
- python-version: "3.11"
35+
numpy-version: "2.1.1"
36+
- python-version: "3.11"
37+
numpy-version: "2.2.2"
38+
- python-version: "3.11"
39+
numpy-version: "2.3.1"
40+
41+
name: >-
42+
Python ${{ matrix.python-version }}
43+
${{ matrix.sklearn-version && format('(scikit-learn {0})', matrix.sklearn-version) || '' }}
44+
${{ matrix.numpy-version && format('(NumPy {0})', matrix.numpy-version) || '' }}
45+
46+
steps:
47+
- uses: actions/checkout@v5
48+
49+
- name: Set up Python ${{ matrix.python-version }}
50+
uses: actions/setup-python@v5
51+
with:
52+
python-version: ${{ matrix.python-version }}
53+
54+
- name: Display Python version
55+
run: python -c "import sys; print(sys.version)"
56+
57+
- name: Install dependencies
58+
run: |
59+
python -m pip install --upgrade pip
60+
pip install -r requirements.txt
61+
pip install -r test_requirements.txt
62+
63+
# Install specific scikit-learn version if defined
64+
if [ -n "${{ matrix.sklearn-version }}" ]; then
65+
echo "Installing scikit-learn==${{ matrix.sklearn-version }}"
66+
pip install scikit-learn==${{ matrix.sklearn-version }}
67+
fi
68+
69+
# Install specific NumPy version if defined
70+
if [ -n "${{ matrix.numpy-version }}" ]; then
71+
echo "Installing numpy==${{ matrix.numpy-version }}"
72+
pip install numpy==${{ matrix.numpy-version }}
73+
fi
74+
75+
- name: Test with pytest
76+
run: |
77+
pip install pytest
78+
pytest

.gitignore

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# Distribution / packaging
7+
.Python
8+
build/
9+
develop-eggs/
10+
dist/
11+
downloads/
12+
eggs/
13+
.eggs/
14+
lib/
15+
lib64/
16+
parts/
17+
sdist/
18+
var/
19+
wheels/
20+
*.egg-info/
21+
.installed.cfg
22+
*.egg
23+
24+
# PyInstaller
25+
# Usually these files are written by a python script from a template
26+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
27+
*.manifest
28+
*.spec
29+
30+
# Installer logs
31+
pip-log.txt
32+
pip-delete-this-directory.txt
33+
34+
35+
# Jupyter Notebook
36+
.ipynb_checkpoints
37+
38+
# pyenv
39+
.python-version
40+
41+
# Environments
42+
.env
43+
.venv
44+
env/
45+
venv/
46+
ENV/
47+
env.bak/
48+
venv.bak/
49+
50+
# Spyder project settings
51+
.spyderproject
52+
.spyproject
53+
54+
# Rope project settings
55+
.ropeproject
56+
57+
# mkdocs documentation
58+
/site
59+
60+
# mypy
61+
.mypy_cache/
62+
63+
# Miscelaneous
64+
.idea
65+
.vscode
66+
*.DS_Store
67+
*.db
68+
*.pptx

boruta/boruta_py.py

Lines changed: 24 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -45,13 +45,13 @@ class BorutaPy(BaseEstimator, SelectorMixin):
4545
crucial parameter. For more info, please read about the perc parameter.
4646
- Automatic tree number:
4747
Setting the n_estimator to 'auto' will calculate the number of trees
48-
in each itartion based on the number of features under investigation.
48+
in each iteration based on the number of features under investigation.
4949
This way more trees are used when the training data has many features
5050
and less when most of the features have been rejected.
5151
- Ranking of features:
5252
After fitting BorutaPy it provides the user with ranking of features.
5353
Confirmed ones are 1, Tentatives are 2, and the rejected are ranked
54-
starting from 3, based on their feautre importance history through
54+
starting from 3, based on their feature importance history through
5555
the iterations.
5656
5757
We highly recommend using pruned trees with a depth between 3-7.
@@ -140,7 +140,7 @@ class BorutaPy(BaseEstimator, SelectorMixin):
140140
support_weak_ : array of shape [n_features]
141141
142142
The mask of selected tentative features, which haven't gained enough
143-
support during the max_iter number of iterations..
143+
support during the max_iter number of iterations.
144144
145145
ranking_ : array of shape [n_features]
146146
@@ -328,7 +328,7 @@ def _fit(self, X, y):
328328

329329
# set n_estimators
330330
if self.n_estimators != 'auto':
331-
self.estimator.set_params(n_estimators=self.n_estimators)
331+
self._set_n_estimators(self.n_estimators)
332332

333333
# main feature selection loop
334334
while np.any(dec_reg == 0) and _iter < self.max_iter:
@@ -337,7 +337,7 @@ def _fit(self, X, y):
337337
# number of features that aren't rejected
338338
not_rejected = np.where(dec_reg >= 0)[0].shape[0]
339339
n_tree = self._get_tree_num(not_rejected)
340-
self.estimator.set_params(n_estimators=n_tree)
340+
self._set_n_estimators(n_estimators=n_tree)
341341

342342
# make sure we start with a new tree in each iteration
343343
if self._is_lightgbm:
@@ -358,13 +358,15 @@ def _fit(self, X, y):
358358
# register which feature is more imp than the max of shadows
359359
hit_reg = self._assign_hits(hit_reg, cur_imp, imp_sha_max)
360360

361-
# based on hit_reg we check if a feature is doing better than
362-
# expected by chance
363-
dec_reg = self._do_tests(dec_reg, hit_reg, _iter)
361+
# Only test after the 5th round.
362+
if _iter > 4:
363+
# based on hit_reg we check if a feature is doing better than
364+
# expected by chance
365+
dec_reg = self._do_tests(dec_reg, hit_reg, _iter)
364366

365-
# print out confirmed features
366-
if self.verbose > 0 and _iter < self.max_iter:
367-
self._print_results(dec_reg, _iter, 0)
367+
# print out confirmed features
368+
if self.verbose > 0 and _iter < self.max_iter:
369+
self._print_results(dec_reg, _iter, 0)
368370
if _iter < self.max_iter:
369371
_iter += 1
370372

@@ -454,6 +456,17 @@ def _transform(self, X, weak=False, return_df=False):
454456
X = X[:, indices]
455457
return X
456458

459+
def _set_n_estimators(self, n_estimators):
460+
try:
461+
self.estimator.set_params(n_estimators=n_estimators)
462+
except ValueError:
463+
raise ValueError(
464+
f"The estimator {self.estimator} does not take the parameter "
465+
"n_estimators. Use Random Forests or gradient boosting machines "
466+
"instead."
467+
)
468+
return self
469+
457470
def _get_support_mask(self):
458471
check_is_fitted(self, 'support_')
459472
return self.support_

boruta/test/test_boruta.py

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
import numpy as np
2+
import pandas as pd
3+
import pytest
4+
from sklearn.ensemble import RandomForestClassifier
5+
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
6+
7+
from boruta import BorutaPy
8+
9+
10+
@pytest.mark.parametrize("tree_n,expected", [(10, 44), (100, 141)])
11+
def test_get_tree_num(tree_n, expected):
12+
rfc = RandomForestClassifier(max_depth=10)
13+
bt = BorutaPy(rfc)
14+
assert bt._get_tree_num(tree_n) == expected
15+
16+
17+
@pytest.fixture(scope="module")
18+
def Xy():
19+
np.random.seed(42)
20+
y = np.random.binomial(1, 0.5, 1000)
21+
X = np.zeros((1000, 10))
22+
23+
z = (y - np.random.binomial(1, 0.1, 1000) +
24+
np.random.binomial(1, 0.1, 1000))
25+
z[z == -1] = 0
26+
z[z == 2] = 1
27+
28+
# 5 relevant features
29+
X[:, 0] = z
30+
X[:, 1] = (y * np.abs(np.random.normal(0, 1, 1000)) +
31+
np.random.normal(0, 0.1, 1000))
32+
X[:, 2] = y + np.random.normal(0, 1, 1000)
33+
X[:, 3] = y**2 + np.random.normal(0, 1, 1000)
34+
X[:, 4] = np.sqrt(y) + np.random.binomial(2, 0.1, 1000)
35+
36+
# 5 irrelevant features
37+
X[:, 5] = np.random.normal(0, 1, 1000)
38+
X[:, 6] = np.random.poisson(1, 1000)
39+
X[:, 7] = np.random.binomial(1, 0.3, 1000)
40+
X[:, 8] = np.random.normal(0, 1, 1000)
41+
X[:, 9] = np.random.poisson(1, 1000)
42+
43+
return X, y
44+
45+
46+
def test_if_boruta_extracts_relevant_features(Xy):
47+
X, y = Xy
48+
rfc = RandomForestClassifier()
49+
bt = BorutaPy(rfc)
50+
bt.fit(X, y)
51+
assert list(range(5)) == list(np.where(bt.support_)[0])
52+
53+
54+
def test_if_it_works_with_dataframe_input(Xy):
55+
X, y = Xy
56+
X_df, y_df = pd.DataFrame(X), pd.Series(y)
57+
bt = BorutaPy(RandomForestClassifier())
58+
bt.fit(X_df, y_df)
59+
assert list(range(5)) == list(np.where(bt.support_)[0])
60+
61+
62+
def test_dataframe_is_returned(Xy):
63+
X, y = Xy
64+
X_df, y_df = pd.DataFrame(X), pd.Series(y)
65+
rfc = RandomForestClassifier()
66+
bt = BorutaPy(rfc)
67+
bt.fit(X_df, y_df)
68+
assert isinstance(bt.transform(X_df, return_df=True), pd.DataFrame)
69+
70+
71+
@pytest.mark.parametrize("tree", [ExtraTreeClassifier(), DecisionTreeClassifier()])
72+
def test_boruta_with_decision_trees(tree, Xy):
73+
msg = (
74+
f"The estimator {tree} does not take the parameter "
75+
"n_estimators. Use Random Forests or gradient boosting machines "
76+
"instead."
77+
)
78+
X, y = Xy
79+
bt = BorutaPy(tree)
80+
with pytest.raises(ValueError) as record:
81+
bt.fit(X, y)
82+
83+
assert str(record.value) == msg

boruta/test/unit_tests.py

Lines changed: 0 additions & 57 deletions
This file was deleted.

requirements.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
numpy>=1.26.4
2+
pandas>=2.2.0
3+
scikit-learn>=1.5.2

test_requirements.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
-r requirements.txt
2+
pytest>=5.4.1
3+
4+
# repo maintenance tooling
5+
black>=21.5b1
6+
flake8>=3.9.2
7+
isort>=5.8.0

0 commit comments

Comments
 (0)