Splitting dataset into train, cross-valiation nad test having a constraint
I need to split the feature set in 2 parts - (a) Train (50%) (b) Test (50%) which is kept aside.
The Train set is used in the training and tuning of the classifier. Hence, I have again split the 50% Train Set into another train set (25%) and validation set (25%).
My dataset is imbalanced. So, I have to carefully split the feature set so that the train set, the validation set and test data do not have empty data from the minority class. This is the constraint. I have tried to apply this constraint but have got stuck.
- The total number of datapoints =1000 out of which
a=950datapoints belong to majority class (label 0) and
b=50datapoints belong to the minority class (label 1).
- The untouched test set is denoted by the variable
MTestand should contain 500 datapoints.
- The training set is denoted by the variable
allDatashould contain the remaining 500 data points.
allDatais further split into a validation set inside the
kfoldvalidation for loop.
PROBLEMS: After splitting, I checked and saw that the number of elements in the train set and test set (
MTest) set got increased to
1004 instead of 1000. Also, There are repeated datapoints in the test set
MTest from the training set
allData. Can somebody please help with the implementation and correct me where any concepts are wrong? Thank you.
clear all Data1 =; Data2 =; Data =; featSize=2; y = zeros(1,featSize); s = 1:featSize; t = 0.105; a=950; b=50; for i = 1:a x = randn(1,featSize); Data1 = [Data1; x]; end for i = 1:b y = randn(1,featSize) + t.*s; Data2 = [Data2; y]; end Data=[Data1; Data2]; % Data is created % label the data, gives 0 to Normal data and 1 to abnormal data Data(:,featSize+1) = [zeros(1,a) ones(1,b)]; M11 = Data(Data(:,end) == 0,:); %a M22 = Data(Data(:,end) == 1,:); %b aClass = size(M11,1); bClass = size(M22,1); DataSet = [M11; M22]; %featureSet; r1=randperm(numel(DataSet(:,featSize+1))); tot=floor(numel(DataSet(:,featSize))*0.5); rarray = randperm(bClass )+aClass ; rand1 = rarray(1); rand2 = rarray(2); trainSetIdx = r1(1:tot); trainSetIdx = [trainSetIdx rand1 rand2]; allData=DataSet(trainSetIdx,1:featSize); targets = DataSet(r1(1:tot),featSize+1 ); aClass_Train = size((DataSet(trainSetIdx, featSize )==0),1); bClass_Train = size((DataSet(trainSetIdx, featSize )==1),1); testIdx = (r1(tot+1:end) ); testSetTarg = DataSet(r1(tot+1:end),featSize+1); % This avoids the test data to have empty set of abnormal data rarray = randperm(bClass )+aClass ; rand1 = rarray(1); rand2 = rarray(2); testIdx = [testIdx rand1 rand2]; testSetTarg = [testSetTarg; 0]; testSetTarg = [testSetTarg; 0]; MTest = DataSet(testIdx,1:featSize ); sizeTrain = size(allData,1); sizeTest = size(MTest,1); nTotal = sizeTrain+sizeTest kFolds =5; for k = 1:kFolds r=randperm(numel(targets)); tot=floor(numel(targets)*0.5); trainData=allData(r(1:tot),:); trainTarg = targets(r(1:tot) ); trainIdx = r(1:tot); valIdx = (r(tot+1:end) ); valTarg = targets(r(tot+1:end)); % This avoids the validation data to have empty set of abnormal data rarray = randperm(bClass_Train)+aClass_Train; rand1 = rarray(1); rand2 = rarray(2); valIdx = [valIdx rand1 rand2]; valTarg = [valTarg; 0]; valTarg = [valTarg; 0]; valData = allData(valIdx,:); size_trainingData = 100*(length(trainData)/size(allData,1)); size_testTarg = 100*(length(valTarg)/size(allData,1)); disp(['TrainingSet Size: ',num2str(size_trainingData), '%;', ' CV size: = ', num2str(size_testTarg), '%']) end