Outlier Treatment using Python
I am a novice in Data Science and in the problem which I am trying to solve, I am stuck up with the outlier detection and treatment. Some of the insights about the dataset below:
 It's a regression problem
 Having both numerical and categorical features
 Numerical features include both discrete and continuous data columns
 Categorical features include mostly nominal & Ordinal data columns
 I've done the missing value imputation and categorical data transformation
I am stuck up since I don't know the way of outlier detection and treatment of numerical data. I request any of your valuable help in proceeding further.
Please let me know if you want any snapshot of the numerical data in order to give a solution.
I haven't added it since it's a generic doubt as I don't even know how and what to use for outlier detection and treatment.
See also questions close to this topic

Spring validator custom HTTP status
I'd like to return a custom HTTP status 422 instead of a default 400 on a spring validation.
My validator:
@Component @RequiredArgsConstructor public class EmailUpdateDtoValidator implements Validator { private Errors errors; private EmailUpdateDto emailUpdateDto; @Override public boolean supports(Class<?> clazz) { return EmailUpdateDto.class.equals(clazz); } @Override public void validate(Object object, Errors errors) { this.errors = errors; this.emailUpdateDto = (EmailUpdateDto) object; validateEmail(); } private void validateEmail() { if (!Email.isValid(emailUpdateDto.getEmail())) { errors.rejectValue("email", UserValidationErrorCodes.EMAIL_NOT_VALID.name()); } } }
How I setup the validation in the Controller:
@Slf4j @RestController @RequiredArgsConstructor public class UserController { private final EmailUpdateDtoValidator emailUpdateDtoValidator; @InitBinder("emailUpdateDto") protected void initEmailValidationBinder(final WebDataBinder binder) { binder.addValidators(emailUpdateDtoValidator); } @RequestMapping(value = "/users/{hashedId}/email", method = RequestMethod.PUT) public void updateEmail(@RequestBody @Valid EmailUpdateDto emailUpdateDto) { ... } }
Using this setup I always get a 400. How could I customize the HTTP status on the return?
Thanks

Form Validation multiple checkbox with text input (React)
I would like to make the table form like I attached.
So I can check multiple checkbox.
if I checked one or mores, I must put reason right side.
So I need to handle both.
if I didn't check anything, should show error, 'need to check at least one'.
if I didn't put reason, even I checked some ' need to put reason'
then if I submit it. it should show on next page.
Don't worry about next page.
How to make these checkboxes with text area.
And how can I handle it?
It should be a table form.
Thanks

Turning a list of sheet names into a data validation
Below is a code I am trying to turn into a part of a larger code. A little background is that I'm trying to make a list of sheet names, that I can make into a data validation list. Then go into the sheet I've picked and create another data validation list from that sheet. (this all happens in my master sheet) The code below is a custom formula that I pair with googleclock. It creates a list and then I use that list to create my data validation list. The code I'm trying to write is my way of skipping the middle man and making it dynamic.
function sheetnames() { var out = new Array() var sheets = SpreadsheetApp.getActiveSpreadsheet().getSheets(); for (var i=2 ; i<sheets.length ; i++) out.push( [ sheets[i].getName() ] ) return out }
I also have all of the sheet names auto populate cell A1 in each sheet if that can help.

Building a model on Keras correctly
I'm new to neural networks and Keras, and I want to build a CNN that predicts certain values of an image. (the three values predict the size, length, and width of a blur put on top of the image). All 3 values can range from 0 to 1, and I have a large data set.
I am not exactly sure how to build the CNN to do this though, as all the prototype codes that I have built so far give me predictions of the format
[1.,0.,0.]
instead of ranges between 0 and 1 for each value. On top of that, despite changing the number of epochs and decay value in the SGD optimizer, I don't get any change in my loss function at all. Can you please tell me where I am going wrong? Here is what I have so far:images, labels = load_dataset("images") # function that loads images images = np.asarray(images) # images are flattened 424*424 arrays (grayscale) labels = np.asarray(labels) # Lables are 3arrays, each value is float from 01 # I won't write this part but here I split into train_imgs and test_imgs model = keras.Sequential() # explicitly define SGD so that I can change the decay rate sgd = keras.optimizers.SGD(lr=0.01, decay=1e6, momentum=0.9, nesterov=True) model.add(keras.layers.Dense(32, input_shape=(424*424,) )) model.add(keras.layers.Activation('relu')) model.add(keras.layers.Dense(3, activation='softmax')) model.compile(loss='mean_squared_error',optimizer=sgd) # note: I also tried defining a weighted binary crossentropy but it changed nothing checkpoint_name = 'Weights{epoch:03d}{val_loss:.5f}.hdf5' checkpoint = ModelCheckpoint(checkpoint_name, monitor='val_loss', verbose = 0, save_best_only = True, mode ='auto') callbacks_list = [checkpoint] model.fit(train_imgs, train_labls, epochs=20, batch_size=32, validation_split = 0.2, callbacks=callbacks_list) predictions = model.predict(test_imgs) # make predictions on same test set!
Now I know that I am leaving out dropout layers, but I WANT the CNN to overfit my data, at this point I just want it to do anything. When I predict on the same set of images, I would hopefully get exact results, no? I'm not quite sure what I'm missing. Thanks for the help!

I ask for help in a problem in the choice of techniques and algorithm of machine learning:
A college student must enroll in courses 001 and 002 in the university. Knowing the data from last year's students with their performances, he wants to know what the chances of simultaneous approval are.
Dataset:
Student Course Average Situation 0125489 DIM001 7.0 Approved 0125489 DIM002 8.0 Approved 0125455 DIM001 3.0 Disapproved 0225455 DIM003 9.0 Approved 0225455 DIM002 2.0 Disapproved

can we extract data like account number, cheque number from a input raw text file using machine learning techniques like crf or ner?
say i have a string "hi this a cheque with acct no. 1234567" and a cheque no. 0198 " i want to extract 'account no: 1234567' and 'cheque no. 0198'. Using NER or crf techniques we can extract only entities , how do I go on with this problem ? can anyone suggest any algorithms or approach ?

Function to remove outliers by group from dataframe
I am trying to remove the outliers from my dataframe containing
x
andy
variables grouped by variablecond
.I have created a function to remove the outliers based on a boxplot statistics, and returning
df
without outliers. The function works well when applied for a raw data. However, if applied on grouped data, the function does not work and I got back an error:Error in mutate_impl(.data, dots) : Evaluation error: argument "df" is missing, with no default.
Please, how can I correct my function to take vectors
df$x
anddf$y
as arguments, and correctly get rid of outliers by group?
My dummy data:
set.seed(955) # Make some noisily increasing data dat < data.frame(cond = rep(c("A", "B"), each = 22), xvar = c(1:10+rnorm(20,sd=3), 40, 10, 11:20+rnorm(20,sd=3), 85, 115), yvar = c(1:10+rnorm(20,sd=3), 200, 60, 11:20+rnorm(20,sd=3), 35, 200)) removeOutliers<function(df, ...) { # first, identify the outliers and store them in a vector outliers.x<boxplot.stats(df$x)$out outliers.y<boxplot.stats(df$y)$out # remove the outliers from the original data df<df[which(df$x %in% outliers.x),] df[which(df$y %in% outliers.y),] } # REmove outliers (try if function works) removeOutliers(dat) # Apply the function to group # Not working!!! dat_noOutliers< dat %>% group_by(cond) %>% mutate(removeOutliers)
I have found this function to remove the outliers from a vector data . However, I would like to remove outliers from both
df$x
anddf$y
vectors in a dataframe.remove_outliers < function(x, na.rm = TRUE, ...) { qnt < quantile(x, probs=c(.25, .75), na.rm = na.rm, ...) H < 1.5 * IQR(x, na.rm = na.rm) y < x y[x < (qnt[1]  H)] < NA y[x > (qnt[2] + H)] < NA y }

Flagging outliers
I'm trying to : 1 Define an outlier criteria ( up (hi) and low (lo) ). 3 Compute every observation ( per column) 2 Flag the outliers
My dataset,h_median (pandas dataframe) has 30 columns and 4 rows, so I need a loop for it. I'm at the point of defining the criteria:
def remove_outlier(h_median,variables): q1 = h_median[variables].quantile(0.25) q3 = h_median[variables].quantile(0.75) iqr = q3q1 #Interquartile range fence_low = q11.5*iqr fence_high = q3+1.5*iqr df_out = h_median.loc[(h_median[variables]> fence_low) & (h_median[variables]< fence_high)] return df_out`
Thank you!

Remove outliers in many columns using Mean and Standard Deviation in python
My csv file contains dataframe with more than 400 columns and datetime index. I want to remove outliers in each column using Mean and Standard Deviation (SD). Rows to be removed are those containing values that lie beyond (Mean  2* SD) and (Mean + 2*SD).
For a single column, I can use the following code to list data to be kept:
col1 = df['Col_1'] mean = np.mean(col1, axis=0) sd = np.std(col1, axis=0) include = [i for i in col1 if (i > mean  2 * sd)] include = [i for i in col1 if (i < mean + 2 * sd)] print(include)
My questions: how to do it by a single python shot to the entire df, returning a final df of inliers only, and keeping the original table format with datetime index and columns.
Other articles found so far deal with a single array.