![]() Numpy ndarray with dimensions IMG_WIDTH x IMG_HEIGHT x 3. Of the images in the data directory, where each image is formatted as a Inside each category directory will be some Load image data from directory `data_dir`.Īssume `data_dir` has one directory named after each category, numberedĠ through NUM_CATEGORIES - 1. Model.evaluate(x_test, y_test, verbose=2) Model.fit(x_train, y_train, epochs=EPOCHS) Np.array(images), np.array(labels), test_size=0.2 X_train, x_test, y_train, y_test = train_test_split( # Split data into training and testing sets # Get image arrays and labels for all image files Sys.exit("Usage: python traffic.py data_directory ") I am aware that Boruta is relevant here but as of now still have no solid opinion on it.įrom sklearn.model_selection import train_test_split I have been entertaining the idea of perusing our internal automated data audit tool, removing std aggregates (too synthetic as per above), removing low GINI and low PSI variables, potentially treating for very high (95+%) correlation, and then applying lasso / elastic net and taking it from there. Since GINI is used in deciding on the best split in decision trees in general under the CART (“Classification and Regression Trees”) approach, maybe this is one criterion that it is worth keeping. I suspect monotonicity of trend should not be a concern (as the focus, unlike in regression, is on non-linear relations), hence binning is likely out otherwise I am a bit unsure about handling of u-shaped variables. The regression was initially ran with 44 predictors (the output of a stepwise procedure), whereas the final approved model includes only 10.īecause I am rather new to XGBoost, I was wondering whether the feature selection process differs substantially from what has already been done in preparation for the logistic regression, and what some rules / good practices would be.īased on what I have been reading, perfect correlation and missing values are both automatically handled in XGBoost. We have decided to not use those in XGBoost either. aggregates like min / max / avg of the standard deviation of other predictors) and some have been deemed too synthetic for inclusion. low representativeness - assessed through population stability index, PSI Ī huge number of the variables are derived (incl.low GINI / Information Value - on raw level or after binning.high correlation (>70%) - on raw level or after binning.lack of monotonic trend - for u-shaped variables after attempts at coarse classing.detected excessive number of missings, or incredibly low variance, etc. automated data audit (through internal tool) - i.e.The documentation surrounding the logistic regression is very well prepared, and as such track has been kept of the reasons for exclusion of each variable. I have been asked to look at XGBoost (as implemented in R, and with a maximum of around 50 features) as an alternative to an already existing but not developed by me logistic regression model created from a very large set of credit risk data containing a few thousand predictors. Scaredy_brushwagg Asks: From logistic regression to XGBoost - selecting features to run the model with ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |