Outliers for all numerical values to mean SAS
I am working in SAS with a dataset with a lot of numeric values which I have standardised as follows:
proc standard data=df mean=0 std=1
out=df;
run;
Is there any easy way to deal with outliers (+/ 3standard deviation) for all numeric values? Ideally I would want to change all of those to + or  3x standard deviation, or in worst case remove them.
1 answer

You have to run through the data twice. There are may ways you can adjust your output. Here's a simple way using a datastep:
Assuming your dataset has a standardized variable called 'test':
Data adjusted; set df; if test > 3 then test=3; if test < 3 then test =3; run;
just remember your new dataset will no longer have a mean of 0 and a standard deviation of 1
See also questions close to this topic

SAS if variable exist and less than 1, then do
I need to write condition for Variables, which can be or be absent in my dataset.
The Expression such as "IF
Variable
with name starting withVAR1
(a few first letters of the Name of the variable) exists and it is less than 1, then do:error=error1
, elseerror =0
Is there a function for it? Or what is the best way to do it?

Data step multiple sets with column
I am new to SAS. Got a rather simple issue could not solve it myself.
I have 12 monthly datasets named data_1404 to data_1503, all with same set of variables. My task is to append them as the yearly dataset and add the month column to the yearly dataset.
I tried the below.
DATA data_yearly; SET data_1404 data_1405...; RUN;
It works however I cannot figure out how to add the month column to it. I have tried macro with &do however with no luck.
I believe this should be something very simple, will be super appreciated if someone can help me out with this. Thank you!!

How to write ods latex template in SAS
Based on this document:
https://support.sas.com/resources/papers/proceedings14/20332014.pdf
I created the attached result and using the attached code,
but I want to produce my own template that creates output as this (tables with multirows and partial grids):
Does anyone know how to achieve this? Or a good source to learn writing the language used in ods template. Much thanks!
ods path(prepend) work.templat(update); proc template; define tagset Tagsets.event1; define event colspec_entry; put just "" /if ^cmp( just, "d"); put "r" /if cmp( just, "d"); end; define event table; start: put NL; put NL; put "\begin{longtable}"; finish: put "\hline \end{longtable}" NL; put NL; end; define event stacked_cell; start: put NL; put "\begin{tabular}"; trigger alignment; finish: put "\end{tabular}" NL; end; define event colspecs; start: put "{"; finish: put "}\hline" NL; end; define event colspec_entry; put just /if ^cmp( just, "d"); put "r" /if cmp( just, "d"); end; define event row; finish: put "\\" NL; end; define event header; start: trigger data; finish: trigger data; end; define event data; start: put VALUE /if cmp( $sascaption, "true"); break /if cmp( $sascaption, "true"); put %nrstr(" & ") /if ^cmp( COLSTART, "1"); put " "; unset $colspan; set $colspan colspan; do /if exists( $colspan)  exists ( $cell_align ); put "\multicolumn{"; put colspan /if $colspan; put "1" /if ^$colspan; put "}{"; put "" /if ^$instacked; put just; put "" /if ^$instacked; put "}{"; done; put tranwrd(VALUE,"","$$") /if contains( HTMLCLASS, "data"); put VALUE /if ^contains( HTMLCLASS, "data"); finish: break /if cmp( $sascaption, "true"); put "}" /if exists( $colspan)  exists ( $cell_align ); end; define event rowspanfillsep; put %nrstr(" & "); end; define event rowspancolspanfill; put " ~"; end; define event image; put "\includegraphics{"; put BASENAME /if ^exists( NOBASE); put URL; put "}" NL; end; parent = tagsets.latex; end; run; data t; length v1 $20; input v1 $ v2; datalines; f=ma 21 sqrt(abd) 22 ; run; ods tagsets.simplelatex file="/scratch/columbia/zz89/t1.tex"; proc print data=t; run; ods tagsets.simplelatex close; ods tagsets.event1 file="/scratch/columbia/zz89/t2.tex"; proc print data=t; run; ods tagsets.event1 close; data re1; infile "/scratch/columbia/zz89/t1.tex" pad missover; file "/scratch/columbia/zz89/t1_out.tex"; length c1 $200; input c1 $ 1200; p1=index(c1,'sqrt'); if p1>0 then do; c2=substr(c1,(p1+5)); p2=index(c2,')'); c3=substr(c1,1,(p11))'$\sqrt{'scan(c2,1,')') '}$'substr(c2,(p2+1)); put c3; end; else put c1; run;

Deleting the same outliers in two timeseries
I have a question about eliminating outliers from twotime series. One time series includes spot market prices and the other includes power outputs. The two series are from 2012 to 2016 and are both CSV files with the with a timestamp and then a value. As example for the power output: 20120101 00:00:00,2335.2152646951617 and for the price: 20120101 00:00:00,17.2
Because the spot market prices are very volatile and have a lot of outliers, I have filtered them. For the second time series, I have to delete the values with the same timestamp, which were eliminated in the time series of the prices. I thought about generating a list with the deleted values and writing a loop to delete the values with the same timestamp in the second time series. But so far that has not worked and I'm not really on. Does anyone have an idea?
My python code looks as follow:
import pandas as pd import matplotlib.pyplot as plt power_output = pd.read_csv("./data/external/power_output.csv", delimiter=",", parse_dates=[0], index_col=[0]) print(power_output.head()) plt.plot(power_output) spotmarket = pd.read_csv("./data/external/spotmarket_dhp.csv", delimiter=",", parse_dates=[0], index_col=[0]) print(spotmarket.head()) r = spotmarket['price'].pct_change().dropna() * 100 print(r) plt.plot(r) Q1 = r.quantile(.25) Q3 = r.quantile(.75) q1 = Q12*(Q3Q1) q3 = Q3+2*(Q3Q1) a = r[r.between(q1, q3)] print(a) plt.plot(a)
Can somebody help me?

Removing outlier from dataframe by filtering single column
I have a dataframe like this:
A B C 1 10 121 5 6 122 7 8 123 9 10 124 12 23 125 10 24 1500 13 36 1600
By applying mean+/ 2std.deviation method to the column C, I wish to remove the outliers from C and filter the dataframe where I finally expect to get
A B C 1 10 121 5 6 122 7 8 123 9 10 124 12 23 125
This is my code:
target=df['C'] mean = target.mean() sd = target.std() lower_boundary = [x for x in target if (x < mean  2 * sd)] upper_boundary= [x for x in target if (x > mean  2 * sd)] selected_df=df[(target==lower_boundary) & (target==upper_boundary)] selected_df
But it shows
TypeError: invalid type comparison
error. Could you tell me where I make a mistake, please?

Converting a random dataset into a Gaussian distribution
I am having to mark outliers in a pandas dataset and everywhere I have looked for research they have advised to use ZScore. My dataset doesn't fit the Gaussian distribution and I was wondering what the best method to fit this was in python.
dataset:
20180513 33 68 30 64 20180514 31 43 17 56 20180515 31 42 17 48 20180516 37 40 20 63 20180517 24 37 33 67 20180518 36 31 20 89 20180519 36 34 21 61 20180520 46 54 26 65 20180521 30 40 15 67 20180522 30 38 15 70 20180523 21 37 13 47 20180524 26 58 9 62 20180525 27 54 7 64 20180526 25 48 17 66 20180527 43 71 16 74 20180528 33 72 19 75 20180529 35 42 13 47 20180530 26 36 15 53 20180531 16 40 10 65 20180601 32 35 14 61 20180602 27 45 11 52 20180603 50 56 20 49 20180604 36 42 16 63 20180605 33 39 11 53 20180606 28 36 13 54 20180607 35 38 12 72 20180608 24 40 12 55 20180609 30 48 12 67 20180610 42 37 11 55 20180611 32 32 16 61
Thanks for the help.