Outliers for all numerical values to mean SAS
I am working in SAS with a dataset with a lot of numeric values which I have standardised as follows:
proc standard data=df mean=0 std=1
out=df;
run;
Is there any easy way to deal with outliers (+/ 3standard deviation) for all numeric values? Ideally I would want to change all of those to + or  3x standard deviation, or in worst case remove them.
1 answer

You have to run through the data twice. There are may ways you can adjust your output. Here's a simple way using a datastep:
Assuming your dataset has a standardized variable called 'test':
Data adjusted; set df; if test > 3 then test=3; if test < 3 then test =3; run;
just remember your new dataset will no longer have a mean of 0 and a standard deviation of 1
See also questions close to this topic

AS 9.4: Proc Transpose for Height variable, dont know how to include another column
I am using SAS 9.4 and trying to transpose an Excel table with 13 observations and 5 variables. Specifically, I have something that looks like this:
Tree ID Latin Name Height Pt1 Pinus Taeda 169 Pt1 Pinus Taeda 52 Pt1 Pinus Taeda 34 Pt2 Pinus Taeda 148.7 Id1 Ilex Decidua 25 Id1 Ilex Decidua 198 Id2 Ilex Decidua 141
I am trying to transpose it so that individual trees have a single record recording their multiple height measurements. Specifically, I’m looking to have something like this:
Tree ID Latin Name Height 1 Height 2 Height 3 Pt1 Pinus taeda 169 52 34 Pt2 Pinus taeda 148.7 . . ... ... ... ... ... Id1 Ilex Decidua 25 198 .
My code so far only gives me the Tree ID column and the Height 1 Height2 Height3 columns with the correct values but I cannot figure out how to get it to also show the Latin Name column?

trying to concatenate values from data sets into an array in sas
I am trying to add a Data step that creates the
work.orders_fin_qtr_tot
data set from thework.orders_fin_tot
data set. This new data set should contain new variables for quarterly sales and profit. Use two arrays to create the new variables:QtrSales1QtrSales4
andQtrProfit1QtrProfit4
. These represent total sales and total profit for the quarter (14). Use the quarter number of the year in which the order was placed to index into the correct variable to add either the TotalSales or TotalProfit to the new appropriate variable.Add a Proc step that displays the first 10 observations of the
work.orders_fin_qtr_tot
data set.My issue is that I can't seem to get the two diff arrays to meld with out spaces
proc sort data=work.orders_fin_tot_qtr; by workqtr; run; data work.orders_fin_tot_qtr; set work.orders_fin_tot_qtr; array QtrSales{4} quarter1quarter4 ; do i = 1 by 1 until (last.order_id); if workqtr=i then QtrSales{i}=totalsales; end; drop totalsales totalprofit _TYPE_ _FREQ_; run; proc print data=work.orders_fin_tot_qtr; run;

reading in data in sas works on some platforms and not on others
Why does this code not work in SAS University Edition even though it works on SAS 9.4 for Windows?
DATA HealthPlan; INFILE DATALINES FIRSTOBS = 2; INPUT Department :$10. HealthPlan $; DATALINES; Department HealthPlan Accounting A Food A Security B ; RUN;
It seems to be something with the treatment of delimiters. If I add
DLM = '09'x
, then it will work on University Edition, but otherwise the output from my log file is as follows:1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 72 73 DATA HealthPlan; 74 INFILE DATALINES FIRSTOBS = 2; 75 INPUT Department :$10. HealthPlan $; 76 DATALINES; NOTE: LOST CARD. RULE: +1+2+3+4+5+6+7+8+9+0 81 ; Department=SecurityB HealthPlan= _ERROR_=1 _N_=2 NOTE: SAS went to a new line when INPUT statement reached past the end of a line. NOTE: The data set WORK.HEALTHPLAN has 1 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.01 seconds 81 ; 82 RUN; 83 84 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 97
Could it be something with the encoding?

Function to remove outliers by group from dataframe
I am trying to remove the outliers from my dataframe containing
x
andy
variables grouped by variablecond
.I have created a function to remove the outliers based on a boxplot statistics, and returning
df
without outliers. The function works well when applied for a raw data. However, if applied on grouped data, the function does not work and I got back an error:Error in mutate_impl(.data, dots) : Evaluation error: argument "df" is missing, with no default.
Please, how can I correct my function to take vectors
df$x
anddf$y
as arguments, and correctly get rid of outliers by group?
My dummy data:
set.seed(955) # Make some noisily increasing data dat < data.frame(cond = rep(c("A", "B"), each = 22), xvar = c(1:10+rnorm(20,sd=3), 40, 10, 11:20+rnorm(20,sd=3), 85, 115), yvar = c(1:10+rnorm(20,sd=3), 200, 60, 11:20+rnorm(20,sd=3), 35, 200)) removeOutliers<function(df, ...) { # first, identify the outliers and store them in a vector outliers.x<boxplot.stats(df$x)$out outliers.y<boxplot.stats(df$y)$out # remove the outliers from the original data df<df[which(df$x %in% outliers.x),] df[which(df$y %in% outliers.y),] } # REmove outliers (try if function works) removeOutliers(dat) # Apply the function to group # Not working!!! dat_noOutliers< dat %>% group_by(cond) %>% mutate(removeOutliers)
I have found this function to remove the outliers from a vector data . However, I would like to remove outliers from both
df$x
anddf$y
vectors in a dataframe.remove_outliers < function(x, na.rm = TRUE, ...) { qnt < quantile(x, probs=c(.25, .75), na.rm = na.rm, ...) H < 1.5 * IQR(x, na.rm = na.rm) y < x y[x < (qnt[1]  H)] < NA y[x > (qnt[2] + H)] < NA y }

Flagging outliers
I'm trying to : 1 Define an outlier criteria ( up (hi) and low (lo) ). 3 Compute every observation ( per column) 2 Flag the outliers
My dataset,h_median (pandas dataframe) has 30 columns and 4 rows, so I need a loop for it. I'm at the point of defining the criteria:
def remove_outlier(h_median,variables): q1 = h_median[variables].quantile(0.25) q3 = h_median[variables].quantile(0.75) iqr = q3q1 #Interquartile range fence_low = q11.5*iqr fence_high = q3+1.5*iqr df_out = h_median.loc[(h_median[variables]> fence_low) & (h_median[variables]< fence_high)] return df_out`
Thank you!

Remove outliers in many columns using Mean and Standard Deviation in python
My csv file contains dataframe with more than 400 columns and datetime index. I want to remove outliers in each column using Mean and Standard Deviation (SD). Rows to be removed are those containing values that lie beyond (Mean  2* SD) and (Mean + 2*SD).
For a single column, I can use the following code to list data to be kept:
col1 = df['Col_1'] mean = np.mean(col1, axis=0) sd = np.std(col1, axis=0) include = [i for i in col1 if (i > mean  2 * sd)] include = [i for i in col1 if (i < mean + 2 * sd)] print(include)
My questions: how to do it by a single python shot to the entire df, returning a final df of inliers only, and keeping the original table format with datetime index and columns.
Other articles found so far deal with a single array.