Feature selection in machine learning
I need to clarify the following. Sci-kit provides RFE to determine feature importance. If I have categorical data in my dataset, and RFE needs to be applied directly on the without encoding the categorical data with onehot encoding. Then will the result from RFE correct? i.e. If I have Categorical data column named year, then 1990 will be considered less important than 2013 also for a column like city name label encoded numbers do not provide importance for city likewise. Or is there a specific way to determine feature importance for a dataset with categorical features??
See also questions close to this topic
Having trouble with custom loss function for YOLO using keras or tensorflow
I am trying to to define custom loss function for YOLO to detect presence of a single class object and locate its centre( kinda landmark detection) as learned from Andrew NG. Each grid of 49 grids outputs a vector(7*7*3) having depth of only 3 units. First channel indicates probability of object in that grid, and other two predicts coordinates of centre of my object for landmark detection. I have lately been trying to get my head around tensor calculus to avoid a faulty loss function, but having trouble to improve accuracy.
I am just subtracting all three channels of y_true from y_pred but multiplying the result from 2nd and 3rd channel with 1st channel matrix of y_true as we dont want to account coordinates predicted in 2nd and 3rd channel if 1st channel is not predicting the presence of object itself.
def yol_loss(y_true, y_pred):
shape = tf.shape( y_true[:, :, :, 0] ) a=tf.ones([shape, shape, shape ], tf.float32)
loss = K.mean((K.square(tf.multiply(y_true[:, :, :, 0], a) - tf.multiply(y_pred[:, :, :, 0],a ))),axis=(1,2)) +
64*K.mean((K.square(tf.multiply(y_true[:, :, :, 0], y_true[:, :, :, 1]) - tf.multiply(y_true[:, :, :, 0],y_pred[:, :, :, 1]))),axis=(1,2)) +
64*K.mean((K.square(tf.multiply(y_true[:, :, :, 0], y_true[:, :, :, 2]) - tf.multiply(y_true[:, :, :, 0],y_pred[:, :, :, 2]))),axis=(1,2))
Imgaug Module Not Found
I am trying to run a model that uses the imgaug package, I have a line of code that goes
from imgaug import augmenters as iaa.
I believe I have successfully installed imagaug as in Command Prompt when I enter the line
pip install imgaugI get the following message:
Requirement already satisfied: imgaug in c:\python36\lib\site-packages (0.2.6) Requirement already satisfied: scipy in c:\python36\lib\site-packages (from imgaug) (1.1.0) Requirement already satisfied: scikit-image>=0.11.0 in c:\python36\lib\site- packages (from imgaug) (0.14.0) Requirement already satisfied: numpy>=1.7.0 in c:\python36\lib\site-packages (from imgaug) (1.15.1) Requirement already satisfied: six in c:\users\ashok\appdata\roaming\python\python36\site-packages (from imgaug) (1.11.0) Requirement already satisfied: dask[array]>=0.9.0 in c:\python36\lib\site- packages (from scikit-image>=0.11.0->imgaug) (0.19.2) Requirement already satisfied: networkx>=1.8 in c:\python36\lib\site- packages (from scikit-image>=0.11.0->imgaug) (2.2) Requirement already satisfied: cloudpickle>=0.2.1 in c:\python36\lib\site- packages (from scikit-image>=0.11.0->imgaug) (0.5.6) Requirement already satisfied: pillow>=4.3.0 in c:\python36\lib\site- packages (from scikit-image>=0.11.0->imgaug) (5.2.0) Requirement already satisfied: PyWavelets>=0.4.0 in c:\python36\lib\site- packages (from scikit-image>=0.11.0->imgaug) (1.0.0) Requirement already satisfied: toolz>=0.7.3; extra == "array" in c:\python36\lib\site-packages (from dask[array]>=0.9.0->scikit- image>=0.11.0->imgaug) (0.9.0) Requirement already satisfied: decorator>=4.3.0 in c:\python36\lib\site- packages (from networkx>=1.8->scikit-image>=0.11.0->imgaug) (4.3.0)
however when I run the code in Jupyter Notebook I get the error message
ModuleNotFoundError: No module named 'imgaug'.
What is the reason for this error? Do I have to install or run imgaug within a particular environment? I am using Windows so sudo pip install will not work. Thanks very much for your attention and help.
LSTM Training Input Versus Live Evaluation Input - Dynamic RNN?
I am having trouble wrapping my head around RNNs for this problem.
The problem: live binary classification of video using image sequences. Meaning I am receiving a video one image at a time and need to predict either Class A or Class B for the most recent image received.
Training - I use a CNN as feature extractor on a full sequence of images. I then feed multiple images (lstm-len, cnn-feature-size) into the LSTM.
Live Evaluation - I receive 1 frame at a time and run it through the CNN. I add these new features to a queue of length lstm-len, then I take all the features from the queue and feed into the LSTM.
What I don't understand
Why is it that I have to keep track of and feed all of the features into the LSTM at evaluation time? The point of an LSTM is to remember past inputs so it seems redundant for me to input all the previous images at every time step. What I would like to be able to do is simply calculate the features for the most recent images and then feed those new features into the LSTM while the LSTM remembers the last lstm-len number of frames.
Am I using the RNN incorrectly in this case? Should I be able to simply use the previous LSTM state as input into the other LSTM cells and provide feature input for only the newest image?
I'm thinking something like tensorflows DynamicRNN may be the solution to this problem.
Pretty confused about this. Thanks for the help!
Scikit learn KerasClassifer evaluation error
I have created a Keras Classifier for k fold validation. Below is the function for building the classifer.
def build_classifier(): classifier = Sequential() classifier.add(Dense(activation="relu", input_dim=11, units=6, kernel_initializer="uniform")) classifier.add(Dense(activation="relu", units=6, kernel_initializer="uniform")) classifier.add(Dense(activation="sigmoid", units=1, kernel_initializer="uniform")) classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy']) return classifier
I have initialized the classifier as below.
classfier = KerasClassifier(build_fn = build_classifier, batch_size=10, epochs=100)
I am trying to get the accuracy score from this 10 fold validation.
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10, n_jobs = -1)
However it show me a TypeError:
TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator does not.
Order of class in n_support_ (sklearn svm)
In the sklearn SVM SVC documentation I was trying to figure out in what order of classes does the n_support_ attribute give the number of Support Vectors. I couldn't find it mentioned anywhere. Please, can somebody tell me how I can find that out?
Example: For binary classification of classes -1,+1
In : print (svm_fit.n_support_) Out: [6388 6383]
Now here I am not sure which class does the first value belong to.
Python 2.7 cannot import sklearn
sklearn, version: 0.19.2, but cannot import it.
Please see the errors below:
--------------------------------------------------------------------------- ImportError Traceback (most recent call last) <ipython-input-61-b7c74cbf5af0> in <module>() ----> 1 import sklearn c:\python27\lib\site-packages\sklearn\__init__.py in <module>() 131 # process, as it may not be compiled yet 132 else: --> 133 from . import __check_build 134 from .base import clone 135 __check_build # avoid flakes unused variable error ImportError: cannot import name __check_build
Run Specflow Tests in Parallel sorted by Features
I've been using Specflow+ Runner with Selenium for a While now, and I want to run tests in parallel so I can save time.
The problem is, When running in parallel by changing the testThreadCount="2" attribute from the Default.srprofile file, it runs tests from the same feature, which I don't want to happen, because it messes up the ExtentReport at the end of the test.
Is there a way to apply some sort of filter so It only runs tests from different Feature files in parallel? Also, Is there a way to only execute [BeforeTestRun] and [AfterTestRun] tags just once, instead of once per thread?
Python - Suggestions on using model in production 1 test at a time
I have created an Artificial Neural Network with 4 categorical features and a binary outcome either 1 for suspicious or 0 for non-suspicious:
ParentPath ParentExe 0 C:\Program Files (x86)\Wireless AutoSwitch wrlssw.exe 1 C:\Program Files (x86)\Wireless AutoSwitch WrlsAutoSW.exs 2 C:\Program Files (x86)\Wireless AutoSwitch WrlsAutoSW.exs 3 C:\Windows\System32 svchost.exe 4 C:\Program Files (x86)\Wireless AutoSwitch WrlsAutoSW.exs ChildPath ChildExe Suspicious C:\Windows\System32 conhost.exe 0 C:\Program Files (x86)\Wireless AutoSwitch wrlssw.exe 0 C:\Program Files (x86)\Wireless AutoSwitch wrlssw.exe 0 C:\Program Files\Common Files OfficeC2RClient.exe 0 C:\Program Files (x86)\Wireless AutoSwitch wrlssw.exe 1 C:\Program Files (x86)\Wireless AutoSwitch wrlssw.exe 0
I have used sklearn for label encoding and one hot encoding on the data:
#Import the dataset X = DBF2.iloc[:, 0:4].values #X = DBF2[['ParentProcess', 'ChildProcess']] y = DBF2.iloc[:, 4].values#.ravel() #Encoding categorical data from sklearn.preprocessing import LabelEncoder, OneHotEncoder #Label Encode Parent Path labelencoder_X_1 = LabelEncoder() X[:, 0] = labelencoder_X_1.fit_transform(X[:, 0]) #Label Encode Parent Exe labelencoder_X_2 = LabelEncoder() X[:, 1] = labelencoder_X_2.fit_transform(X[:, 1]) #Label Encode Child Path labelencoder_X_3 = LabelEncoder() X[:, 2] = labelencoder_X_3.fit_transform(X[:, 2]) #Label Encode Child Exe labelencoder_X_4 = LabelEncoder() X[:, 3] = labelencoder_X_4.fit_transform(X[:, 3]) #Create dummy variables onehotencoder = OneHotEncoder(categorical_features = [0,1,2,3]) X = onehotencoder.fit_transform(X)
I have split the data into a training and test set and run it on my gpu box with a nvidia 1080. I have tuned the hyperparameters and am now ready to use the model that is trained in a production environment with one test sample being tested at a time. Lets say I just want to test one sample:
ParentPath ParentExe ChildPath ChildExe 0 C:\Windows\Malicious badscipt.exe C:\Windows\System cmd.exe
The issue that I am running into is the training set has seen the ChildPath "C:\Windows\System" and the ChildExe "cmd.exe" which are normal, but the training set has not seen the ParentPath "C:\Windows\Malicous" or ParentExe "badscipt.exe" so these have not been label or one hot encoded. My big question is how do I handle one test feature where part of it has not been trained?
I have seen examples using feature hashing but im not sure how to apply that or if that would even solve this problem. Any help or pointers would be greatly appreciated.
How to best use zipcodes in Random Forest model training?
I have a dataset with zipcode column. They have some significance in output and I want to use it as a feature. I am using random forest model.
I need a suggestions on best way to use zipcode column as a feature. (For example should I get lat/long for that zipcode rather than directly feeding zipcodes etc.)
Thanks in advance !!