outofsample data classification
A question from an online course:
A difficulty that arises from trying to classify outofsample data is that the actual classification may not be known, therefore making it hard to produce an accurate result.
True or False?
See also questions close to this topic

Shape Error in Andrew NG Logistic Regression using Scipy.opt
I've been trying to write Andrew NG's Logistic Regression Problem Using python and Scipy.opt for optimizing the function. However, I get a VALUE ERROR that says I have mismatching dimensions. I've tried to flatten() my theta array as scipy.opt doesn't seem to work very well with single column/row vector, however the problem still persists.
Kindly point me in the right direction as to what is causing the problem and how to avoid it.
Thanks a million!
import pandas as pd import numpy as np import matplotlib.pyplot as plt import scipy.optimize as opt dataset = pd.read_csv("Students Exam Dataset.txt", names=["Exam 1", "Exam 2", "Admitted"]) print(dataset.head()) positive = dataset[dataset["Admitted"] == 1] negative = dataset[dataset["Admitted"] == 0] #Visualizing Dataset plt.scatter(positive["Exam 1"], positive["Exam 2"], color="blue", marker="o", label="Admitted") plt.scatter(negative["Exam 1"], negative["Exam 2"], color="red", marker="x", label="Not Admitted") plt.xlabel("Exam 1 Score") plt.ylabel("Exam 2 Score") plt.title("Admission Graph") plt.legend() #plt.show() #Preprocessing Data dataset.insert(0, "x0", 1) col = len(dataset.columns) x = dataset.iloc[:,0:col1].values y = dataset.iloc[:,col1:col].values b = np.zeros([1,col1]) m = len(y) print(f"X Shape: {x.shape} Y Shape: {y.shape} B Shape: {b.shape}") #Defining Functions def hypothesis(x, y, b): h = 1 / (1+np.exp(x @ b.T)) return h def cost(x, y, b): first = (y.T @ np.log(hypothesis(x, y, b))) second = (1y).T @ np.log(1  hypothesis(x, y, b)) j = (1/m) * np.sum(first+second) return j def gradient(x, y, b): grad_step = ((hypothesis(x, y, b)  y) @ x.T) / m return b #Output initial_cost = cost(x, y, b) print(f"\nInitial Cost = {initial_cost}") final_cost = opt.fmin_tnc(func=cost, x0=b.flatten() , fprime=gradient, args=(x,y)) print(f"Final Cost = {final_cost} \nTheta = {b}")
Dataset Used: ex2.txt
34.62365962451697,78.0246928153624,0 30.28671076822607,43.89499752400101,0 35.84740876993872,72.90219802708364,0 60.18259938620976,86.30855209546826,1 79.0327360507101,75.3443764369103,1 45.08327747668339,56.3163717815305,0 61.10666453684766,96.51142588489624,1 75.02474556738889,46.55401354116538,1 76.09878670226257,87.42056971926803,1 84.43281996120035,43.53339331072109,1 95.86155507093572,38.22527805795094,0 75.01365838958247,30.60326323428011,0 82.30705337399482,76.48196330235604,1 69.36458875970939,97.71869196188608,1 39.53833914367223,76.03681085115882,0 53.9710521485623,89.20735013750205,1 69.07014406283025,52.74046973016765,1 67.94685547711617,46.67857410673128,0 70.66150955499435,92.92713789364831,1 76.97878372747498,47.57596364975532,1 67.37202754570876,42.83843832029179,0 89.67677575072079,65.79936592745237,1 50.534788289883,48.85581152764205,0 34.21206097786789,44.20952859866288,0 77.9240914545704,68.9723599933059,1 62.27101367004632,69.95445795447587,1 80.1901807509566,44.82162893218353,1 93.114388797442,38.80067033713209,0 61.83020602312595,50.25610789244621,0 38.78580379679423,64.99568095539578,0 61.379289447425,72.80788731317097,1 85.40451939411645,57.05198397627122,1 52.10797973193984,63.12762376881715,0 52.04540476831827,69.43286012045222,1 40.23689373545111,71.16774802184875,0 54.63510555424817,52.21388588061123,0 33.91550010906887,98.86943574220611,0 64.17698887494485,80.90806058670817,1 74.78925295941542,41.57341522824434,0 34.1836400264419,75.2377203360134,0 83.90239366249155,56.30804621605327,1 51.54772026906181,46.85629026349976,0 94.44336776917852,65.56892160559052,1 82.36875375713919,40.61825515970618,0 51.04775177128865,45.82270145776001,0 62.22267576120188,52.06099194836679,0 77.19303492601364,70.45820000180959,1 97.77159928000232,86.7278223300282,1 62.07306379667647,96.76882412413983,1 91.56497449807442,88.69629254546599,1 79.94481794066932,74.16311935043758,1 99.2725269292572,60.99903099844988,1 90.54671411399852,43.39060180650027,1 34.52451385320009,60.39634245837173,0 50.2864961189907,49.80453881323059,0 49.58667721632031,59.80895099453265,0 97.64563396007767,68.86157272420604,1 32.57720016809309,95.59854761387875,0 74.24869136721598,69.82457122657193,1 71.79646205863379,78.45356224515052,1 75.3956114656803,85.75993667331619,1 35.28611281526193,47.02051394723416,0 56.25381749711624,39.26147251058019,0 30.05882244669796,49.59297386723685,0 44.66826172480893,66.45008614558913,0 66.56089447242954,41.09209807936973,0 40.45755098375164,97.53518548909936,1 49.07256321908844,51.88321182073966,0 80.27957401466998,92.11606081344084,1 66.74671856944039,60.99139402740988,1 32.72283304060323,43.30717306430063,0 64.0393204150601,78.03168802018232,1 72.34649422579923,96.22759296761404,1 60.45788573918959,73.09499809758037,1 58.84095621726802,75.85844831279042,1 99.82785779692128,72.36925193383885,1 47.26426910848174,88.47586499559782,1 50.45815980285988,75.80985952982456,1 60.45555629271532,42.50840943572217,0 82.22666157785568,42.71987853716458,0 88.9138964166533,69.80378889835472,1 94.83450672430196,45.69430680250754,1 67.31925746917527,66.58935317747915,1 57.23870631569862,59.51428198012956,1 80.36675600171273,90.96014789746954,1 68.46852178591112,85.59430710452014,1 42.0754545384731,78.84478600148043,0 75.47770200533905,90.42453899753964,1 78.63542434898018,96.64742716885644,1 52.34800398794107,60.76950525602592,0 94.09433112516793,77.15910509073893,1 90.44855097096364,87.50879176484702,1 55.48216114069585,35.57070347228866,0 74.49269241843041,84.84513684930135,1 89.84580670720979,45.35828361091658,1 83.48916274498238,48.38028579728175,1 42.2617008099817,87.10385094025457,1 99.31500880510394,68.77540947206617,1 55.34001756003703,64.9319380069486,1 74.77589300092767,89.52981289513276,1

Are there pretrained model to recognize character?
I extracted from numberplate into individual characters as picture.
https://i.stack.imgur.com/k09UB.jpg
I wonder there are pretrained CNN model to this recognize characters. If any, please recommend for me API to perform this. Thanks a lot.

How to tune hyper parameters of LightGBM using bayesian optimization with rmse as metric
I'm trying to tune hyperparameters of lightgbm model with rmse as metric. lgbBO = BayesianOptimization(lgb_eval, {'num_leaves': (24, 45), 'feature_fraction': (0.1, 0.9), 'bagging_fraction': (0.8, 1), 'max_depth': (5, 8.99), 'lambda_l1': (0, 5), 'lambda_l2': (0, 3), 'min_split_gain': (0.001, 0.1), 'min_child_weight': (5, 50)}, random_state=0) after this what should i do?

forcing a column to be primary split column in RandomForest
I have data for various institutes such that certain institutes provide us more fields than others. These additional data fields seem to have a high correlation to the binary outcome we are trying to predict, so ignoring them is not an option. Also, we don't want build institute specific models.
One of the options we are considering is including institute value as a feature with that idea that a single model will consider it to be the feature used for splitting primarily. Thus, if we imagine a tree based model, each institute gets it's own tree in a single model.
How could we force a feature to be the primary split feature?

Probabilistic classification with Gaussian Bayes Classifier vs Logistic Regression
I have a binary classification problem where I have a few great features that have the power to predict almost 100% of the test data because the problem is relatively simple.
However, as the nature of the problem requires, I have no luxury to make mistake(let's say) so instead of giving a prediction I am not sure of, I would rather have the output as probability, set a threshold and would be able to say, "if I am less than %95 sure, I will call this "NOT SURE" and act accordingly". Saying "I don't know" rather than making a mistake is better.
So far so good.
For this purpose, I tried Gaussian Bayes Classifier(I have a cont. feature) and Logistic Regression algorithms, which provide me the probability as well as the prediction for the classification.
Coming to my Problem:
GBC has around 99% success rate while Logistic Regression has lower, around 96% success rate. So I naturally would prefer to use GBC. However, as successful as GBC is, it is also very sure of itself. The odds I get are either 1 or very very close to 1, such as 0.9999997, which makes things tough for me, because in practice GBC does not provide me probabilities now.
Logistic Regression works poor, but at least gives better and more 'sensible' odds.
As nature of my problem, the cost of misclassifying is by the power of 2 so if I misclassify 4 of the products, I lose 2^4 more (it's unitless but gives an idea anyway).
In the end; I would like to be able to classify with a higher success than Logistic Regression, but also be able to have more probabilities so I can set a threshold and point out the ones I am not sure of.
Any suggestions?
Thanks in advance.

Python: Imbalanced data for XGBoost Multilabel classification
I have a dataset of a stock's returns where the Ylabel is price change direction (= 2 if upward tick, = 1 if downward tick, and = 0 if no move. Some of the features, X, include the lagged label values (i.e. the previous day's price direction change).
I am trying to run the XGBoost classification model, however my data is highly imbalanced. Most of the Y label values are = 0 meaning the stock price did not move.
How can I incorporate this imbalance in a multilabel XGBoost classification problem?
My code is the following:
X = df[["ret_D_lag_1", "ret_D_lag_2", "ret_D_lag_3"]] y = df["ret_D_t1"] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123) # use DMatrix for xgboost dtrain = xgb.DMatrix(X_train, label=y_train) dtest = xgb.DMatrix(X_test, label=y_test) # set xgboost params param = { 'max_depth': 3, # the maximum depth of each tree 'eta': 0.3, # the training step for each iteration 'silent': 1, # logging mode  quiet 'objective': 'multi:softprob', # error evaluation for multiclass training 'num_class': 3} # the number of classes that exist in this datset num_round = 20 # the number of training iterations # Train the model bst = xgb.train(param, dtrain, num_round) # Predict and choose highest probability for each label preds = bst.predict(dtest) best_preds = np.asarray([np.argmax(line) for line in preds])