# Splitting dataset into train, cross-valiation nad test having a constraint

I need to split the feature set in 2 parts - (a) Train (50%) (b) Test (50%) which is kept aside.

The Train set is used in the training and tuning of the classifier. Hence, I have again split the 50% Train Set into another train set (25%) and validation set (25%).

My dataset is imbalanced. So, I have to carefully split the feature set so that the train set, the validation set and test data do not have empty data from the minority class. This is the constraint. I have tried to apply this constraint but have got stuck.

- The total number of datapoints =1000 out of which
`a=950`

datapoints belong to majority class (label 0) and`b=50`

datapoints belong to the minority class (label 1). - The untouched test set is denoted by the variable
`MTest`

and should contain 500 datapoints. - The training set is denoted by the variable
`allData`

should contain the remaining 500 data points.`allData`

is further split into a validation set inside the`kfold`

validation for loop.

**PROBLEMS**: After splitting, I checked and saw that the number of elements in the train set and test set (`allData`

and `MTest`

) set got increased to `1004`

instead of 1000. Also, There are repeated datapoints in the test set `MTest`

from the training set `allData`

. Can somebody please help with the implementation and correct me where any concepts are wrong? Thank you.

```
clear all
Data1 =[];
Data2 =[];
Data =[];
featSize=2;
y = zeros(1,featSize);
s = 1:featSize;
t = 0.105;
a=950;
b=50;
for i = 1:a
x = randn(1,featSize);
Data1 = [Data1; x];
end
for i = 1:b
y = randn(1,featSize) + t.*s;
Data2 = [Data2; y];
end
Data=[Data1; Data2]; % Data is created
% label the data, gives 0 to Normal data and 1 to abnormal data
Data(:,featSize+1) = [zeros(1,a) ones(1,b)];
M11 = Data(Data(:,end) == 0,:); %a
M22 = Data(Data(:,end) == 1,:); %b
aClass = size(M11,1);
bClass = size(M22,1);
DataSet = [M11; M22]; %featureSet;
r1=randperm(numel(DataSet(:,featSize+1)));
tot=floor(numel(DataSet(:,featSize))*0.5);
rarray = randperm(bClass )+aClass ;
rand1 = rarray(1);
rand2 = rarray(2);
trainSetIdx = r1(1:tot);
trainSetIdx = [trainSetIdx rand1 rand2];
allData=DataSet(trainSetIdx,1:featSize);
targets = DataSet(r1(1:tot),featSize+1 );
aClass_Train = size((DataSet(trainSetIdx, featSize )==0),1);
bClass_Train = size((DataSet(trainSetIdx, featSize )==1),1);
testIdx = (r1(tot+1:end) );
testSetTarg = DataSet(r1(tot+1:end),featSize+1);
% This avoids the test data to have empty set of abnormal data
rarray = randperm(bClass )+aClass ;
rand1 = rarray(1);
rand2 = rarray(2);
testIdx = [testIdx rand1 rand2];
testSetTarg = [testSetTarg; 0];
testSetTarg = [testSetTarg; 0];
MTest = DataSet(testIdx,1:featSize );
sizeTrain = size(allData,1);
sizeTest = size(MTest,1);
nTotal = sizeTrain+sizeTest
kFolds =5;
for k = 1:kFolds
r=randperm(numel(targets));
tot=floor(numel(targets)*0.5);
trainData=allData(r(1:tot),:);
trainTarg = targets(r(1:tot) );
trainIdx = r(1:tot);
valIdx = (r(tot+1:end) );
valTarg = targets(r(tot+1:end));
% This avoids the validation data to have empty set of abnormal data
rarray = randperm(bClass_Train)+aClass_Train;
rand1 = rarray(1);
rand2 = rarray(2);
valIdx = [valIdx rand1 rand2];
valTarg = [valTarg; 0];
valTarg = [valTarg; 0];
valData = allData(valIdx,:);
size_trainingData = 100*(length(trainData)/size(allData,1));
size_testTarg = 100*(length(valTarg)/size(allData,1));
disp(['TrainingSet Size: ',num2str(size_trainingData), '%;', ' CV size: = ', num2str(size_testTarg), '%'])
end
```