How to perform undersampling for class imbalance in Matlab
Article titled, "FraudMiner: A Novel Credit Card Fraud Detection Model Based on Frequent Itemset Mining" (download link: https://www.hindawi.com/journals/tswj/2014/252797/) and many resources suggest that undersampling the majority class is an approach to handle class imbalance. My problem in understanding is that suppose out of 1500 examples, only 30 examples belong to the minority class. Then would undersampling make the size of the dataset 60? This means I am losing a lot of the data.
Another approach is to use Weighted SVM by assigning some penalty as the inverse of the class frequency. This approach does not need any sampling. I tried this approach and it did not work. I have posted a new question on this here (Training by stratified cross-validation `cvpartition` yields poor performance on test data)
Questions: Which data set should be dowsampled? In the following code,
Data is the entire feature set. I applied
cvpartition with stratified approach to get
CVO = cvpartition( targets,'k', kFolds,'Stratify',true); %he main outer loop will run for as many folds you specified in kFolds and %will prepare a training set and a testing set for k = 1:CVO.NumTestSets trainIdx = CVO.training(k); testIdx = CVO.test(k); trainData=Data(trainIdx,1:featSize); trainTarg = Data(trainIdx,featSize+1 ); testTarg = Data(testIdx,featSize+1); testData = Data(testIdx,1:featSize); end
Should I downsample the dataset denoted by the variable
trainData and how do I do that? In future I will be using
trainData for feature selection and hyperparameter tuning and
testData for checking the performance of the model in each fold. This part is not in the Question but in the other one that I have posted.
I have another dataset that is used to test the performance of the model. This is super imbalanced and has never been used in training. Please find below the full code. Thank you for helping.
clear all rng('default'); data1 =; data2 =; allData =; featSize=3; % random simulation of some data y = zeros(1,featSize); s = 1:featSize; t = 0.105; a=1470; %number of examples of the majority class b=30; %number of examples of the minority (rare event) class for i = 1:a x = randn(1,featSize); data1 = [data1; x]; end for i = 1:b y = randn(1,featSize) + t.*s; data2 = [data2; y]; end allData=[data1; data2]; % label the data, gives 0 to Normal data and 1 to abnormal data allData(:,featSize+1) = [zeros(1,a) ones(1,b)]; targets = allData(:,featSize+1); % these are the labels RARE_DATA = allData(allData(:,end)==1,:); NORMAL_DATA = allData(allData(:,end)==0,:); aClass = size(NORMAL_DATA,1) ;%the size of Normal class label 0 bClass = size(RARE_DATA,1);%the size of abnormal class label 1 data = [NORMAL_DATA;RARE_DATA] ; %asign weights as inverse of class frequency data(:,featSize+2) = [1/aClass*ones(1,aClass) 1/bClass*ones(1,bClass)]; weight = data(:,featSize+2); indx = randperm(numel(targets)); Data = data(indx,:); kFolds = 5; % this is where you specify your number of folds CVO = cvpartition( targets,'k', kFolds,'Stratify',true); %he main outer loop will run for as many folds you specified in kFolds and %will prepare a training set and a testing set for k = 1:CVO.NumTestSets trainIdx = CVO.training(k); testIdx = CVO.test(k); trainData=Data(trainIdx,1:featSize); trainTarg = Data(trainIdx,featSize+1 ); testTarg = Data(testIdx,featSize+1); testData = Data(testIdx,1:featSize); size_training = size(trainTarg,1); size_testTarg = size(testTarg,1); disp(['TrainingSet Size: ',num2str(size_training ), ' CV size: = ', num2str(size_testTarg) ]) bestSVM = fitcsvm(trainData, trainTarg, ... 'KernelFunction', 'RBF', 'KernelScale', 'auto'); end %Generate TEST SET which is a new simulated new set imbalanced set % apply `bestSVM` model to predict on this set