Splitting dataset into train, cross-valiation nad test having a constraint

I need to split the feature set in 2 parts - (a) Train (50%) (b) Test (50%) which is kept aside.

The Train set is used in the training and tuning of the classifier. Hence, I have again split the 50% Train Set into another train set (25%) and validation set (25%).

My dataset is imbalanced. So, I have to carefully split the feature set so that the train set, the validation set and test data do not have empty data from the minority class. This is the constraint. I have tried to apply this constraint but have got stuck.

  • The total number of datapoints =1000 out of which a=950 datapoints belong to majority class (label 0) and b=50datapoints belong to the minority class (label 1).
  • The untouched test set is denoted by the variable MTest and should contain 500 datapoints.
  • The training set is denoted by the variable allData should contain the remaining 500 data points. allData is further split into a validation set inside the kfold validation for loop.

PROBLEMS: After splitting, I checked and saw that the number of elements in the train set and test set (allData and MTest) set got increased to 1004 instead of 1000. Also, There are repeated datapoints in the test set MTest from the training set allData. Can somebody please help with the implementation and correct me where any concepts are wrong? Thank you.

    clear all
Data1 =[];
Data2 =[];
Data =[];
featSize=2;
y = zeros(1,featSize);
s = 1:featSize;
t = 0.105;
a=950;
b=50;
for i = 1:a
    x = randn(1,featSize);
    Data1 = [Data1; x];
end

for i = 1:b

    y = randn(1,featSize) + t.*s;

    Data2 = [Data2; y];
end

Data=[Data1; Data2];                                 % Data is created

% label the data, gives 0 to Normal data and 1 to abnormal data
Data(:,featSize+1) = [zeros(1,a) ones(1,b)];


M11 = Data(Data(:,end) == 0,:); %a
M22 = Data(Data(:,end) == 1,:);  %b
aClass = size(M11,1);
bClass = size(M22,1);

DataSet  = [M11; M22]; %featureSet;


r1=randperm(numel(DataSet(:,featSize+1)));
tot=floor(numel(DataSet(:,featSize))*0.5);
rarray = randperm(bClass )+aClass ;
rand1 = rarray(1);
rand2 = rarray(2);
trainSetIdx = r1(1:tot);
trainSetIdx = [trainSetIdx rand1  rand2];

allData=DataSet(trainSetIdx,1:featSize);
targets = DataSet(r1(1:tot),featSize+1  );


aClass_Train =  size((DataSet(trainSetIdx, featSize )==0),1);
bClass_Train =  size((DataSet(trainSetIdx, featSize )==1),1);



testIdx =  (r1(tot+1:end) );
testSetTarg = DataSet(r1(tot+1:end),featSize+1);
% This avoids the test data to have empty set of abnormal data
rarray = randperm(bClass )+aClass ;
rand1 = rarray(1);
rand2 = rarray(2);
testIdx = [testIdx rand1  rand2];
testSetTarg = [testSetTarg; 0];
testSetTarg = [testSetTarg; 0];

MTest = DataSet(testIdx,1:featSize );

sizeTrain = size(allData,1);
sizeTest = size(MTest,1);
nTotal = sizeTrain+sizeTest
kFolds =5;

for k = 1:kFolds
    r=randperm(numel(targets));
    tot=floor(numel(targets)*0.5);
    trainData=allData(r(1:tot),:);
    trainTarg = targets(r(1:tot) );

    trainIdx = r(1:tot);
    valIdx =  (r(tot+1:end) );
    valTarg = targets(r(tot+1:end));
    % This avoids the validation data to have empty set of abnormal data
    rarray = randperm(bClass_Train)+aClass_Train;
    rand1 = rarray(1);
    rand2 = rarray(2);
    valIdx = [valIdx rand1  rand2];
    valTarg = [valTarg; 0];
    valTarg = [valTarg; 0];
    valData = allData(valIdx,:);
    size_trainingData = 100*(length(trainData)/size(allData,1));
    size_testTarg = 100*(length(valTarg)/size(allData,1));
    disp(['TrainingSet Size: ',num2str(size_trainingData), '%;', ' CV size: = ', num2str(size_testTarg), '%'])
end