高清免费视频|成都冻货格|我可以再往深处一点吗视频|舌头伸进去添的我好爽高潮欧美|性都花花世家|无人区卡一卡二卡三乱码网站|草莓看视频在线观看免费

數(shù)據(jù)分析中異常值在SAS中的處理
發(fā)布時(shí)間:2015-08-10

 

數(shù)據(jù)分析少不了和數(shù)據(jù)中的異常值打交道,Winsorize處理在SAS中經(jīng)常使用。

 

Winsorize即極值處理,原理是將數(shù)據(jù)中的異常值修建,使之與正常分布的最大值與最小值相同。例如,你的數(shù)據(jù)整體位于[7090]這個(gè)區(qū)間,而分析的數(shù)據(jù)中有些值特別大或者特別小,比如出現(xiàn)了606595125這種數(shù)值,這時(shí)Winsorize處理就能夠?qū)⑦@些特別大或者特別小的值進(jìn)行調(diào)整,讓這些異常值變成你自己定義的一個(gè)合理范圍中。對于上限,如果定義比90高出10%記為異常值,那么95這個(gè)值就會(huì)被SAS處理,放在Winsorize處理后的數(shù)據(jù)集里,而125將被看做異常值,不會(huì)放入Winsorize處理后的數(shù)據(jù)集里;同理,對于下限也是如此。

 

數(shù)據(jù)中含有缺失值和重復(fù)值時(shí),進(jìn)行Winsorize處理稍微會(huì)復(fù)雜一些。可以先對數(shù)據(jù)排序,但是缺失值首先會(huì)對計(jì)算造成不小的影響,所以Winsorize處理很方便解決這些常見難題。

 

SAS?Winsorize?處理過程:

 

%let?DSName?=sashelp.heart;

proc?iml;

/*?SAS/IML?moduleto?Winsorize?each?column?of?a?matrix.

Input?proportion?of?observations?toWinsorize:?prop?<?0.5.

Ex:?y=?Winsorize(x,?0.1)?computes?the?two-side?10%?Winsorized?data?*/

start?Winsorize(x,prop);

p?=?ncol(x);?/*?number?of?columns?*/

w?=?x;?/*?copy?of?x?*/

do?i?=?1?to?p;

z?=?x[,i];?/*?copy?i_th?column?*/

n?=?countn(z);?/*?count?nonmissing?values?*/

k?=?ceil(prop*n);?/*?number?of?obs?to?trim?from?each?tail?*/

r?=?rank(z);?/*?rank?values?in?i_th?column?*/

/*?find?target?values?and?obs?with?smaller/largervalues?*/

lowIdx?=?loc(r<=k?&?r^=.);

lowVal?=?z[loc(r=k+1)];

highIdx?=?loc(r>=n-k+1);

highVal?=?z[loc(r=n-k)];

/*?Winsorize?(replace)?k?smallest?and?klargest?values?*/

w[lowIdx,i]?=?lowVal;

w[highIdx,i]?=?highVal;

end;

return(w);

finish;

 

/*?test?thealgorithm?on?numerical?vars?in?a?data?set?*/

use?&DSName;

read?all?var?_NUM_into?X[colname=varNames];

close;

winX?=?Winsorize(X,0.1);

 

 

代碼中,矩陣winX包含經(jīng)過Winsorize處理過的數(shù)據(jù),如果你想輸出SASWinsorize處理后的數(shù)據(jù),數(shù)據(jù)集屬于小數(shù)據(jù)集,可以使用代碼:%letDSName?=?sashelp.class;?進(jìn)行實(shí)現(xiàn)。

 

大批量數(shù)據(jù)處理之前,想驗(yàn)證SAS?Winsorize過程是否正確,可以借助SAS/IML計(jì)算出來的縮尾均值(?Winsorized?means),與SAS?PROC?UNIVARIATE?計(jì)算出來的縮尾均值進(jìn)行比較。

 

/*?Compute?Winsorized?mean,?which?is?mean?of?the?Winsorized?data?*/

winMean?=?mean(winX);

print?winMean[c=varNames?f=8.4];

 

 

/*?Validation:?compute?Winsorized?means?byusing?UNIVARIATE?*/

ods?exclude?all;

proc?univariate?data=&dsname?winsorized=0.1;

ods?output?WinsorizedMeans=winMeans;

run;

ods?exclude?none;

 

proc?print?data=winMeans;

var?VarName?Mean;

run;

 

——SAS中文論壇