Subsetting for proportional representation in R
I can't wrap my tiny brain around this one. One dataframe contains observations, each with a gender and an age bracket. I'm trying to write a function that returns a subset of the rows of this dataframe where each agegender combination appears in a proportion roughly equal to the value in the "props" dataframe. Ideally, the function will trim as few observations as possible. The results can be approximate (By approximate/roughly equal, I mean that each group's representation in the output should be at least within 5% of the desired proportion, and generally as low as possible).
ages < c("1829", "3039", "4049", "5059","60+")
genders < c("M","F")
set.seed(101)
df < data.frame("id" = paste0("p",c(1:500)),
"gender" = sample(genders, replace=TRUE, size=500),
"age" = sample(ages, replace=T, size=500))
props < data.frame("age" = c(ages, ages),
"gender" = genders,
"pcts" = c(.0835, .1145, .1145, .1145, .073, .0835, .1145,
.1145, .1145, .073))
select_max < function(df, props) {
....
return(subset)
}
I experimented with solutions using least common multiples and greatest common divisors, but these fell apart when the proportions didn't work nicely together. I'm considering a solution which adds and subtracts rows one at a time until it gets close enough to the desired proportions, but I feel there must be some more elegant solution. All help is appreciated. This is a fun one, for sure.
See also questions close to this topic

Group several columns then aggregate a set of columns in Pandas (It crashes badly compared to R's data.table)
I am relatively new to the world of Python and trying to use it as a backup platform to do data analysis. I generally use
data.table
for my data analysis needs.The issue is that when I run groupaggregate operation on big CSV file (randomized, zipped, uploaded at http://www.filedropper.com/ddataredact_1), Python throws:
grouping pandas return getattr(obj, method)(*args, **kwds) ValueError: negative dimensions are not allowed
OR (I have even encountered...)
File "C:\Anaconda3\lib\sitepackages\pandas\core\reshape\util.py", line 65, in cartesian_product for i, x in enumerate(X)] File "C:\Anaconda3\lib\sitepackages\pandas\core\reshape\util.py", line 65, in for i, x in enumerate(X)] File "C:\Anaconda3\lib\sitepackages\numpy\core\fromnumeric.py", line 445, in repeat return _wrapfunc(a, 'repeat', repeats, axis=axis) File "C:\Anaconda3\lib\sitepackages\numpy\core\fromnumeric.py", line 51, in _wrapfunc return getattr(obj, method)(*args, **kwds) MemoryError
I have spent three days trying to reduce the file size (I was able to reduce the size by 89%), adding breakpoints, debugging it, but I was not able to make any progress.
Surprisingly, I thought of running the same group/aggregate operation in
data.table
in R, and it hardly took 1 second. Moreover, I didn't have to do any data type conversion etc., suggested at https://www.dataquest.io/blog/pandasbigdata/.I also researched other threads: Avoiding Memory Issues For GroupBy on Large Pandas DataFrame, Pandas: df.groupby() is too slow for big data set. Any alternatives methods?, and pandas groupby with sum() on large csv file?. It seems these threads are more about matrix multiplication. I'd appreciate if you wouldn't tag this as duplicate.
Here's my Python code:
finaldatapath = "..\Data_R" ddata = pd.read_csv(finaldatapath +"\\"+"ddata_redact.csv", low_memory=False,encoding ="ISO88591") #before optimization: 353MB ddata.info(memory_usage="deep") #optimize file: Objecttypes are the biggest culprit. ddata_obj = ddata.select_dtypes(include=['object']).copy() #Now convert this to category type: #Float type didn't help much, so I am excluding it here. for col in ddata_obj: del ddata[col] ddata.loc[:, col] = ddata_obj[col].astype('category') #release memory del ddata_obj #after optimization: 39MB ddata.info(memory_usage="deep") #Create a list of grouping variables: group_column_list = [ "Business", "Device_Family", "Geo", "Segment", "Cust_Name", "GID", "Device ID", "Seller", "C9Phone_Margins_Flag", "C9Phone_Cust_Y_N", "ANDroid_Lic_Type", "Type", "Term", 'Cust_ANDroid_Margin_Bucket', 'Cust_Mobile_Margin_Bucket', # # 'Cust_Android_App_Bucket', 'ANDroind_App_Cust_Y_N' ] print("Analyzing data now...") def ddata_agg(x): names = { 'ANDroid_Margin': x['ANDroid_Margin'].sum(), 'Margins': x['Margins'].sum(), 'ANDroid_App_Qty': x['ANDroid_App_Qty'].sum(), 'Apple_Margin':x['Apple_Margin'].sum(), 'P_Lic':x['P_Lic'].sum(), 'Cust_ANDroid_Margins':x['Cust_ANDroid_Margins'].mean(), 'Cust_Mobile_Margins':x['Cust_Mobile_Margins'].mean(), 'Cust_ANDroid_App_Qty':x['Cust_ANDroid_App_Qty'].mean() } return pd.Series(names) ddata=ddata.reset_index(drop=True) ddata = ddata.groupby(group_column_list).apply(ddata_agg)
The code crashes in above
.groupby
operation.Can someone please help me? Compared to my other posts, I have probably spent the most amount of time on this StackOverflow post, trying to fix it and learn new stuff about Python. However, I have reached saturationit even more frustrates me because
R
'sdata.table
package processes this file in <2 seconds. This post is not about pros and cons of R and Python, but about using Python to be more productive.I am completely lost, and I'd appreciate any help.
Here's my
data.table
R
code:path_r = "../ddata_redact.csv" ddata<data.table::fread(path_r,stringsAsFactors=FALSE,data.table = TRUE, header = TRUE) group_column_list <c( "Business", "Device_Family", "Geo", "Segment", "Cust_Name", "GID", "Device ID", "Seller", "C9Phone_Margins_Flag", "C9Phone_Cust_Y_N", "ANDroid_Lic_Type", "Type", "Term", 'Cust_ANDroid_Margin_Bucket', 'Cust_Mobile_Margin_Bucket', # # 'Cust_Android_App_Bucket', 'ANDroind_App_Cust_Y_N' ) ddata<ddata[, .(ANDroid_Margin = sum(ANDroid_Margin,na.rm = TRUE), Margins=sum(Margins,na.rm = TRUE), Apple_Margin=sum(Apple_Margin,na.rm=TRUE), Cust_ANDroid_Margins = mean(Cust_ANDroid_Margins,na.rm = TRUE), Cust_Mobile_Margins = mean(Cust_Mobile_Margins,na.rm = TRUE), Cust_ANDroid_App_Qty = mean(Cust_ANDroid_App_Qty,na.rm = TRUE), ANDroid_App_Qty=sum(ANDroid_App_Qty,na.rm = TRUE) ), by=group_column_list]
I have a 4core 16GB RAM Win10x64 machine. I can provide any details needed by experts.

Installing RODBC in R
I am trying to install RODBC and when I do so i get this error:
Installing package into ‘\\lakesh/Documents/R/winlibrary/3.5’ (as ‘lib’ is unspecified) trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/RODBC_1.315.zip' Content type 'application/zip' length 879575 bytes (858 KB) downloaded 858 KB package ‘RODBC’ successfully unpacked and MD5 sums checked Warning in install.packages : cannot remove prior installation of package ‘RODBC’
Then I tried to remove the package:
> remove.packages("RODBC", lib = NULL) Removing package from ‘\\lakesh/Documents/R/winlibrary/3.5’ (as ‘lib’ is unspecified) Error in remove.packages : there is no package called ‘RODBC’
Need some guidance on installing RODBC package properly.

Rearranging columns in R
I have this table:
ID Gene1 Gene2 Gene3 a. 1a. 2a. 3a. b. 1b. 2b. 3b. c. 1c. 2c. 3c.
I want two new columns "
Gene
" and "effect
" to generate this table:ID Gene effect a. 1. 1a. a. 2. 2a. a. 3. 3a. b. 1. 1b. b. 2. 2b. b. 3. 3b. c. 1. 1c. c. 2. 2c. c. 3. 3c.

Is a[i] really the same as *(a + i) in C?
#include <stdio.h> int sum2d(int row, int col, int p[row][col]); int main(void) { int a[2][3] = {{1, 2, 3}, {4, 5, 6}}; printf("%d\n", sum2d(2, 3, a)); return 0; } int sum2d(int row, int col, int p[row][col]) { int total = 0; for (int i = 0; i < row; i++) for (int j = 0; j < col; j++) total += (*(p + i))[j]; return total; }
Look at the above code. It works perfectly.
However, after I changed p[row] into *(p + row),
#include <stdio.h> int sum2d(int row, int col, int (*(p + row))[col]); int main(void) { int a[2][3] = {{1, 2, 3}, {4, 5, 6}}; printf("%d\n", sum2d(2, 3, a)); return 0; } int sum2d(int row, int col, int (*(p + row))[col]) { int total = 0; for (int i = 0; i < row; i++) for (int j = 0; j < col; j++) total += (*(p + i))[j]; return total; }
it can't be compiled and displays the following error message :
test.c:2:38: error: expected ‘)’ before ‘+’ token int sum2d(int row, int col, int (*(p + row))[col]); ^ test.c: In function ‘main’: test.c:7:2: warning: implicit declaration of function ‘sum2d’ [Wimplicitfunctiondeclaration] printf("%d\n", sum2d(2, 3, a)); ^ test.c: At top level: test.c:12:38: error: expected ‘)’ before ‘+’ token int sum2d(int row, int col, int (*(p + row))[col])
At my current level, I barely understand it.
In C, I thought a[i] = *(a + i) .
Why is my code not correct ?

Lambdas assigned to variables in Kotlin. Why?
I noticed that I get the same effect if I define this trivial function:
fun double ( i: Int ) = i*2
and if I define a variable and assign a lambda (with an identical body) to it:
var double = { i : Int > i*2 }
I get the same result if I call
double(a)
with either declaration. This leaves me confused. When is it needed, recommended, advantageous to define a variable as a lambda rather than define a function to it? 
how to give value only to second parameter of a function with default values in python
def func(a=2,b=3): print(a*b)
so this code prints 6 if you call:
func()
and prints 10 if you call
func(5,2)
so how do you give the second parameter a value and leave the first one with the default? I already tried
func(,4)
but doesn't work this looks good:
funk(b=4)
but i want to read the values that might or might not be there from an xml file like this:
{"a,b":"2,4"}
or this:
{"a,b":",4"}
what do you suggest?

Creating an algorithm for selecting multiple lottery numbers (mathematics and statistics)
In the spanish bet system there is a concept called
multiple
which means that if the game you want to play has bets of 6 numbers, you can create a special bet of 7 or 8 numbers, or even 9, 10 or 11 numbers. That special bet will traduce in X normal bets of 6 numbers which will combine the given numbers.The multiple bet of 7 numbers will traduce in 7 bets of 6 numnbers. The multiple bet of 8 numbers will traduce in 28 bets of 6 numbers. The multiple bet of 9 numbers will traduce in 84 bets of 6 numbers. The multiple bet of 10 numbers will traduce in 210 bets of 6 numbers. The multiple bet of 11 numbers will traduce in 462 bets of 6 numbers.
Sample of multiple of 7 with the numbers 1,2,3,4,5,6,7:
234567 134567 124567 123567 123467 123457 123456
Sample of multiple of 8 with the numbers 1,2,3,4,5,6,7,8:
123456 123457 123458 123467 123468 123478 123567 123568 123578 123678 124567 124568 124578 124678 125678 134567 134568 134578 134678 135678 145678 234567 234568 234578 234678 235678 245678 345678
My first goal is to achieve an algorithm in Java for generate
multiples
. I mean, each bet has a cost of 1 coin, so, given for example 30 numbers and 800 coins, waste the 800 coins in X multiple bets of X numbers. The multiple bets must combine the 30 numbers in more or less equal cuantity of appereances. The total cost of the multiples must be near of 800 euros, can be a little less but never be more than 800 euros. The algorithm will offer different proposals, for example, can offer a result near of 800 euros with multiples of 7, a result near of 800 with multiples of 8, etc... and the user will select which one prefeers. I have no idea of how to achieve this, I am not good in mathematics or statistics so I will appreciate help with this problemIn this website there is a web multiple generator which can generate multiples of 7 and of 8, but it's code is not public: http://www.miramiprimi.miraestudio.es/MetodoMultiplePrimitiva.php
Thanks a lot.

Why is Modulo in MySQL (with negative number) giving unexpected results?
While I'll admit I'm a bit sketchy on modulo operations of negative numbers, I do know that
2 mod 50 = 48
in just about every online modulo calculator as well as LibreOffice Calc:
=MOD(2,50)
And Python:
2 % 50
This number (
48
) conveniently is the answer I need for a function I am writing in a MySQL procedure. However, in MySQL:SELECT MOD(2,50)
Gives me
2
as a result.What is the reason for this, and how do I get the result I am looking for in MySQL?
For Posterity:
As of 2018 the following languages provide these different results from
2 % 50
:48
 Python
 R
 Google Calculator
 Google Sheets
 LibreOffice Calc
 Ubuntu Calculator
2
 Javascript
 MySQL
 Microsoft Calculator
 PHP

Evaluating Polish Expressions and Reversing them
I have a polish expression but I'm unsure how I would go on about doing this. How would I go on about this and also in reverse form?
↑+/−172∗516

Running T Tests on a Subset of Multiple Subsets
I need to subset a dataframe based on var1 into four separate data sets. Then, I need to run a ttest(and a few other things) on a var2 within each of them, conditioned on a var3. Basically, I first thought to subset the original dataframe and do a loop.
library(effsize) subsetteddata1<filter(na.omit(dataframe),var1<=2.5) for(i in 1:4){ data<subsetteddata[i] pVals[i]<t.test(var2~var3,data=data)$p.value TStat[i]<t.test(var2~var3,data=data)$statistics effectSize[i]<cohen.d(var2~var3,data=data, hedges.correction=T) }
That didn't work. Then, I decided to try and write a function and use sapply. This didn't work either.
RunItAll<function(dependent,conditionedOn,datadf){ pVal<t.test(dependent~conditionedOn,data = datadf)$p.value t.test<t.test(dependent~conditionedOn, data = datadf)$statistics effectSize<cohen.d(dependent~conditionedOn,data= datadf, hedges.correction=T) }
Functions in R can only return one thing, though. So, I guess I could do sapply on each of these as a function rather than altogether. I also think that I might be able to use split() to help initially subset the data, which might be helpful, but not sure which is the best route.
Thank you!

subsetting data based with the condition of the current and previous entity in r
I have data with the
status
column. I want to subset my data to the condition of'f'
status, and previous condition of'f'
status. to simplify:df id status time 1 n 1 1 n 2 1 f 3 1 n 4 2 f 1 2 n 2 3 n 1 3 n 2 3 f 3 3 f 4
my result should be:
id status time 1 n 2 1 f 3 2 f 1 3 n 2 3 f 3 3 f 4
How can I do this in R?

Filter data.table with another data.table with different column names
I have this dataset:
library(data.table) dt < data.table( record=c(1:20), area=rep(LETTERS[1:4], c(4, 6, 3, 7)), score=c(1,1:3,2:3,1,1,1,2,2,1,2,1,1,1,1,1:3), cluster=c("X", "Y", "Z")[c(1,1:3,3,2,1,1:3,1,1:3,3,3,3,1:3)] )
and I have used the solution from this post to create this summary:
dt_summary = dt[ , .N, keyby = .(area, score, cluster) ][ , { idx = frank(N, ties.method = 'min') == 1 NN = sum(N) .( cluster_mode = cluster[idx], cluster_pct = 100*N[idx]/NN, cluster_freq = N[idx], record_freq = NN ) }, by = .(area, score)] dt_score_1 < dt_summary[score == 1] setnames(dt_score_1, "area", "zone")
I would like to use the results from
dt_score_1
to filterdt
based on the area/zone and cluster/cluster_mode. So in a new data.table, the only rows taken fromdt
for area A should belong to cluster X, for area D they should be cluster Z etc.