Implementation of Isolated forest in python
I am new to machine learning and trying to learn and implement the isolation forest algorithm in python
My Input contains 40 features and train contains 4000 and test contains 1000 records.Can someone help with a sample code and how to plot the output
See also questions close to this topic
Detecting face like patterns using CNN based face detector
I have a CNN-based object detector trained on WIDER Face data set. It can successfully detect human faces in a given image.
Now, I am trying to detect abstract faces and minimalist face patterns in clouds, houses, etc., but having no success.
Initially, I thought neural-network-based object detectors would generalize somehow, and I could lower the detection threshold to detect such patterns, but such a scheme didn't work.
Is there any way other than collecting and labeling such training examples (face-like patterns) to solve this problem?
Improving the accuracy of Random Forest Classifier on Titanic Dataset(Kaggle)
How can I improve the accuracy of prediction of my Random Forest model. I have already tried parameter tuning through Grid Search.
H2o error on parsing a file
I am parsing a file which contains UUID type too. I cannot parse the file and get this error.
DistributedException from /127.0.0.1:54321: 'NewChunk has type Numeric, but the Vec is of type UUID', caused by java.lang.AssertionError: NewChunk has type Numeric, but the Vec is of type UUID
Anyone know what this means?
Using mlxtend's StackingClassifier with Keras Classifier neural net
I've used mlxtend's stacking classifier with success on most of sklearn's classifiers, but I can't seem to get it to work when I use a keras classifier sklearn wrapper. I think it has to do with the way the data is transformed for the neural net (using the values attribute of the dataset), but I can't figure out how to change the data so that it can be used in the keras classifier and in the stacking classifier.
Here is my code:
nn_data = training_data.values nn = prediction_data.drop(['id', 'era', 'data_type'], axis=1) nn_prediction = nn.values x = nn_data[:,3:53] y = nn_data[:,53] clf1 = KerasClassifier(build_fn=nn_model, epochs=9, batch_size=2000, verbose=2) lr = LR.LogisticRegression() sclf = StackingClassifier(classifiers=[clf1], meta_classifier=lr, use_probas=True) sclf.fit(x, y)
and here is my error:
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-15-cb767df93cc9> in <module>() 4 mlp = MLPClassifier() 5 sclf = StackingClassifier(classifiers=[clf1], meta_classifier=lr, use_probas=True) ----> 6 sclf.fit(x, y) 7 y_prediction_sclf = sclf.predict_proba(x_pred) 8 print ('final_model logloss = ' + str(metrics.log_loss(y_pred, y_prediction_sclf))) /Users/wahabkazi/anaconda/lib/python3.6/site-packages/mlxtend/classifier/stacking_classification.py in fit(self, X, y) 118 119 if not self.use_features_in_secondary: --> 120 self.meta_clf_.fit(meta_features, y) 121 else: 122 self.meta_clf_.fit(np.hstack((X, meta_features)), y) /Users/wahabkazi/anaconda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py in fit(self, X, y, sample_weight) 1215 X, y = check_X_y(X, y, accept_sparse='csr', dtype=_dtype, 1216 order="C") -> 1217 check_classification_targets(y) 1218 self.classes_ = np.unique(y) 1219 n_samples, n_features = X.shape /Users/wahabkazi/anaconda/lib/python3.6/site-packages/sklearn/utils/multiclass.py in check_classification_targets(y) 170 if y_type not in ['binary', 'multiclass', 'multiclass-multioutput', 171 'multilabel-indicator', 'multilabel-sequences']: --> 172 raise ValueError("Unknown label type: %r" % y_type) 173 174 ValueError: Unknown label type: 'unknown'
Here is a sample value of x: array([0.49282, 0.58077, 0.48948, 0.56762, 0.56107, 0.51168, 0.47458999999999996, 0.56968, 0.47402, 0.40326, 0.54119, 0.5319699999999999, 0.31899, 0.43153, 0.35538000000000003, 0.6613100000000001, 0.42477, 0.65484, 0.49437, 0.6126699999999999, 0.60285, 0.38813000000000003, 0.49818999999999997, 0.59332, 0.63041, 0.40815, 0.47767, 0.4869, 0.51394, 0.5371600000000001, 0.49223999999999996, 0.44978, 0.49446999999999997, 0.46531999999999996, 0.51057, 0.52177, 0.54243, 0.61623, 0.56988, 0.66293, 0.50138, 0.40333, 0.52337, 0.60795, 0.35748, 0.49677, 0.28295, 0.65342, 0.57915, 0.51136], dtype=object)
Thanks in advance for any help!
GridSearchCV with scoring='f1' returns error
I'm trying to find the best parameter for the classifier for highest precision/recall score. But GridSearchCV always output error:
ValueError: Target is multiclass but average='binary'. Please choose another average setting.
Here's the part of my classifier code:
### Pipleline to improve workflow ### selectKbest then classifier from sklearn.pipeline import Pipeline from sklearn.feature_selection import SelectKBest from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report from sklearn.tree import DecisionTreeClassifier from sklearn.naive_bayes import GaussianNB select = SelectKBest(k=5) clf0 = RandomForestClassifier() steps = [('feature_selection', select), ('random_forest', clf0)] pipeline = Pipeline(steps) pipeline.fit(features_train, labels_train) prediction = pipeline.predict(features_test) report = classification_report(prediction, labels_test) print(report) ### Gridseach to fine tune the classifier from sklearn.grid_search import GridSearchCV parameters = dict(feature_selection__k=[5, 7, 10, 15, 17, 'all'], random_forest__n_estimators=[10, 20, 30], random_forest__min_samples_split=[2, 3, 4, 5, 10]) cv = GridSearchCV(pipeline, param_grid = parameters, scoring='f1') cv.fit(features_train, labels_train) prediction = cv.predict(features_test) report = classification_report(prediction, labels_test) print(report) print("Best score: %0.3f" % cv.best_score_) print("Best parameters set:") best_parameters = cv.best_estimator_.get_params() for param_name in sorted(parameters.keys()): print("\t%s: %r" % (param_name, best_parameters[param_name]))
Gridsearch runs fine without scoring = 'f1'. What can I do to fix this issue?
How to use ada boosting as an ensemble method in R?
I am trying to learn ensemble methods and came across that ada-boosting can be built on top of the ordinary machine learning methods such as Random forest. the method can use the misclassified data in training set to build more accurate classification models.
However, I searched online but I couldn't find answers for the implementation.
I am wondering how to build a ada-boosting on top of a random forest for a classification problem to minimize the errors?
let's just say I have a
training set(df): Train
test set(df): Test
and a number of features called: Feature.
and the classifier is called: Outcome (Train$Outcome)
my normal model would be(assuming using caret package):
mymodel_rf<-train(Train[,Outcome], Train[,Feature], method ="rf", trControl= ...)
Then how to move forward? to build the ada-boosting methods using the outcome of this?