Subsetting for proportional representation in R
I can't wrap my tiny brain around this one. One dataframe contains observations, each with a gender and an age bracket. I'm trying to write a function that returns a subset of the rows of this dataframe where each agegender combination appears in a proportion roughly equal to the value in the "props" dataframe. Ideally, the function will trim as few observations as possible. The results can be approximate (By approximate/roughly equal, I mean that each group's representation in the output should be at least within 5% of the desired proportion, and generally as low as possible).
ages < c("1829", "3039", "4049", "5059","60+")
genders < c("M","F")
set.seed(101)
df < data.frame("id" = paste0("p",c(1:500)),
"gender" = sample(genders, replace=TRUE, size=500),
"age" = sample(ages, replace=T, size=500))
props < data.frame("age" = c(ages, ages),
"gender" = genders,
"pcts" = c(.0835, .1145, .1145, .1145, .073, .0835, .1145,
.1145, .1145, .073))
select_max < function(df, props) {
....
return(subset)
}
I experimented with solutions using least common multiples and greatest common divisors, but these fell apart when the proportions didn't work nicely together. I'm considering a solution which adds and subtracts rows one at a time until it gets close enough to the desired proportions, but I feel there must be some more elegant solution. All help is appreciated. This is a fun one, for sure.
See also questions close to this topic

R. How to check if a dataset contains the same elements in another dataset
I have 2 datasets "Dataset2016_17" and "PlayOffDataset2016_17". Dataset2016_17$TEAM looks like the following.. [1] "Atlanta Hawks" "Boston Celtics" "Brooklyn Nets", etc. So I would like to know if values in Dataset2016_17$TEAM occurs in PlayOffDataset2016_17$TEAM. If so I want something like a table of true and false.
I have already tried something like this
highlight_flag < grepl(PlayOffDataset2016_17$TEAM, Dataset2016_17$TEAM)
But it did not work. Please let me know if there are any suggestions.

how to capture a repeated group
I'am trying to create a regular expression to capture repeated groups using the package
stringr
.my_text < c("LPC 14:0", "PC 16:0_18:1", "TAG 18:0_20:1_22:2")
I'am trying to capture all the numbers:
 from LPC I want the 14 and 0,
 from PC I want the 16, 0, 18 and 1
 from TAG I want 18, 0, 20, 1, 22 and 2
So far I tried:
str_match_all(string = my_text, pattern = "^[AZ]+ (([09]{2}):([09]{1})_?)*")
and several variations on this. I only succeed to capture the first match or the last match. On 101regex.com I get the message:
A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations.
But I just can't get it to work. Any help appreciated!!
Cheers, Rico

Merge dataframes of list and obtain names of dataframes as column
I merged all data frames from a list in just one data frame.
The dataframes inside the list are called
TAI NAM HEE
and each data frame looks like this
YrM Compound1 Compound 2 201501 0.002 0.15 201502 0.004 0.02 201503 0.01 0.09
when I merge all dataframes with
meanall<do.call(rbind, meaneach)
I getYrM Compound1 Compound2 TAI.1 201501 0.002 0.15 TAI.2 201502 0.004 0.02 TAI.3 201503 0.01 0.09 . . . NAM.1 201501 0.03 0.4 NAM.2 201502 0.001 0.005
I would like to get a column with the names of the list and not as rownames (like above), and without the numbers (TAI.1, TAI.2...), I just want the name TAI
So that I get this:
List YrM Compound1 Compound2 TAI 201501 0.002 0.15 TAI 201502 0.004 0.02 TAI 201503 0.01 0.09 . . . NAM 201501 0.03 0.4 NAM 201502 0.001 0.005
How can I do this?

Neural Network  problem with back propagation  invalid syntax
The following code that I pulled from here in an effort to better understand how Machine Learning and Neural Networks work, isn't working. It keeps producing an "invalid syntax" error at line 31.
self.weights1 = self.weights1 + d_weights1
Here is the full code, any suggestions would be increadibly helpful as I'm pushing the limits of what I understand with python.
import numpy as np def sigmoid(x): # ACTIVATION FUNCTION  dictates if the numaric output is true of false  grades each layer return 1.0/(1+ np.exp(x)) def sigmoid_derivative(x): # BACKPROPAGATION  tells the appropriate amount to adjust the weights and biases return x * (1.0 x) class NeuralNetwork: # NEURAL NETWORK  where the training through generations happens def __init__(self, x, y): self.input = x self.weights1 = np.random.rand(self.input.shape[1], 4) self.weights2 = np.randow.rand(4, 1) self.y = y self.output = np.zeros(y.shape) def feedforward(self): # calculates through each layer of the network self.layer1 = sigmoid(np.dot(self.input, self.weights1)) self.output = sigmoid(np.dot(self.layer1, self.weights2)) def backprop(self): # application of the chain rule to find derivative of the loss function with respect to weights2 and weights1 d_weights2 = np.dot(self.layer1.T, (2*(self.y  self.output) * sigmoid_derivative(self.output))) d_weights1 = np.dot(self.input.T, (np.dot(2*(self.y  self.output) * sigmoid_derivative(self.output), self.weights2.T) * sigmoid_derivative(self.layer1)) # update the weights with the derivative (slope) of the loss function self.weights1 += d_weights1 self.weights2 += d_weights2 if __name__ == "__main__": X = np.array([ [0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1] ]) y = np.array([[0], [1], [1], [0]]) nn = NeuralNetwork(X,y) for i in range(1500): # GENERATIONS = range(generations): nn.feedforward() nn.backprop() print(nn.output)

slow function by groups in data.table r
My experimental design has trees measured in various forests, with repeated measurements across years.
DT < data.table(forest=rep(c("a","b"),each=6), year=rep(c("2000","2010"),each=3), id=c("1","2","3"), size=(1:12)) DT[,id:=paste0(forest,id)] > DT forest year id size 1: a 2000 a1 1 2: a 2000 a2 2 3: a 2000 a3 3 4: a 2010 a1 4 5: a 2010 a2 5 6: a 2010 a3 6 7: b 2000 b1 7 8: b 2000 b2 8 9: b 2000 b3 9 10: b 2010 b1 10 11: b 2010 b2 11 12: b 2010 b3 12
For each tree i, I want to calculate a new variable, equal to the summatory of the size of all the other individuals in the same group/year that are bigger than the tree i.
I have created the following function:
f.new < function(i,n){ DT[forest==DT[id==i, unique(forest)] & year==n # select the same forest & year of the tree i & size>DT[id==i & year==n, size], # select the trees larger than the tree i sum(size, na.rm=T)] # sum the sizes of all such selected trees }
When applied within the data table, I got the correct results.
DT[,new:=f.new(id,year), by=.(id,year)] > DT forest year id size new 1: a 2000 a1 1 5 2: a 2000 a2 2 3 3: a 2000 a3 3 0 4: a 2010 a1 4 11 5: a 2010 a2 5 6 6: a 2010 a3 6 0 7: b 2000 b1 7 17 8: b 2000 b2 8 9 9: b 2000 b3 9 0 10: b 2010 b1 10 23 11: b 2010 b2 11 12 12: b 2010 b3 12 0
Note that I have a large dataset with several forests (40) & repeated years (6) & single individuals (20,000), for a total of almost 50,000 measurements. When I carry out the above function it takes 810 minutes (Windows 7, i56300U CPU @ 2.40 GHz 2.40 GHz, RAM 8 GB). I need to repeat it often with several small modifications and it takes a lot of time.
 Is there any faster way to do it? I checked the *apply functions but cannot figure out a solution based on them.
 Can I make a generic function that doesn't rely on the specific structure of the dataset (i.e. I could use as "size" different columns)?

Create columns like some sort of contingency table based on a solution variable
As my previous question about this topic from a month ago was not yet answered completely and I am missing 1 reputation point to add a bounty, I decided to ask it again with some added information.

I have a list of stores and I have a product (apples). I ran a system of linear equations to get the column 'var'; this value represents the amount of apples you will receive or have to give to another store. I can't figure out how to make an 'actionable dataframe' from it. I can't figure out the correct terms to correctly explain what I want so I hope below helps:
Data:
df < data.frame(store = c('a', 'b', 'c', 'd', 'e', 'f'), sku = c('apple', 'apple', 'apple', 'apple', 'apple', 'apple'), var = c(1,4,6,1,5,3))
Output I want (or something similar):
output < data.frame(store = c('a', 'b', 'c', 'd', 'e', 'f'), sku = c('apple', 'apple', 'apple', 'apple', 'apple', 'apple'), var = c(1,4,6,1,5,3), ship_to_a = c(0,0,1,0,0,0), ship_to_b = c(0,0,4,0,0,0), ship_to_c = c(0,0,0,0,0,0), ship_to_d = c(0,0,0,0,0,0), ship_to_e = c(0,0,1,1,0,3), ship_to_f = c(0,0,0,0,0,0))
Bonus: Ideally, I would like to fill the ship_to_store columns until all ()minus values are 'gone' when sum(df$var) doesn't count up to zero.
This function was created by another user:
fun < function(DF){ n < nrow(DF) mat < matrix(0, nrow = n, ncol = n) VAR < DF[["var"]] neg < which(DF[["var"]] < 0) for(k in neg){ S < 0 Tot < abs(DF[k, "var"]) for(i in seq_along(VAR)){ if(i != k){ if(VAR[i] > 0){ if(S + VAR[i] <= Tot){ mat[k, i] < VAR[i] S < S + VAR[i] VAR[i] < 0 }else{ mat[k, i] < Tot  S S < Tot VAR[i] < VAR[i]  Tot + S } } } } } colnames(mat) < paste0("ship_to_", DF[["store"]]) cbind(DF, mat) }
The function worked in my specific example above, but it doesnt work in all cases as it does not save the already received number of apples per store and therefore results in the store receiving too many apples. For example:
df < data.frame(store = c('a', 'b', 'c', 'd', 'e') sku = c('apple', 'apple', 'apple', 'apple', 'apple') var = c(44,151,100,52,43))
Output has store B giving 100 apples to store C and store A 44 apples to C. That makes 144 instead of the 100 they should get.

Speeding up a series of calculations across a large matrix in R
Any suggestions, programmatically or mathematically, for speeding up this calculation in R? I have included some generated data that closely match the real data scenario that I am working with. I have also attempted to use apply and parApply and tried to turn it into a sparse matrix since it has so many 0's, but so far this is the fastest method I have come up with. Any suggestions for making it faster? I need to do this calculations 10,000's of times.
Data that closely match my scenario:
set.seed(7) # same size matrix as my real data data puzzle A < matrix(rbeta((13163*13163),1,1), ncol = 13163, nrow = 13163) # turn a bunch to 0 to more closely match that I have a lot of 0's in real data A[A < 0.5] < 0 # create binary matrix z < matrix(rbinom((13163*13163), 1, 0.25), ncol = 13163, nrow = 13163)
I have found that Rfast::rowsums gives me the quickest results.
start1 < Sys.time() testA < 1  exp(Rfast::rowsums(log(1A*z))) stop1 < Sys.time() stop1  start1
Pardon my clunky benchmarking approach...

Problem Defining Variable in Python Net Worth Update Function
I am trying to write code that will allow someone to input original net worth, input a change and receive updated net worth but I am having trouble with one of the variables, everything else works. Any help would be appreciated.
def net_worth_updater(): print("This tool will allow you to enter your last known net worth and a value that has changed and your net worth will be updated for you.") def update(): net_worth = int(input("What is your last known net worth? ")) type1 = input("Is the change in the category asset or liability? ") if type1 == "asset": first = int(input("Input Original Value ")) second = int(input("Input New Value ")) change = second  first net_worth = net_worth + change return net_worth else: first = int(input("Input Original Value ")) second = int(input("Input New Value ")) change = second  first net_worth = net_worth  change return net_worth while input("Would you like to report a change? ") == "yes": update() print("Your updated net worth is $" + str(net_worth))
When I'm running the code the net_worth variable fails out at the end.

StackOverflowException when I call the same function many times
I am creating a function that creates a point in specific coordinates, which calls itself moving to each of the cardinal points (until a spicific limit).
I have a StackOverflowException error when more than 5000 positions are stored.
More easy: I have created points with coordinates moving only to the north and still giving the same error
*NorthLimit, LatitudeDeviation and LongitudeDeviation are constants.
public void CreatePosition(decimal latitude, decimal longitude) { boolean end = true; Positions.Add(new Position(latitude, longitude)); if (NorthLimit > (latitude + LatitudeDeviation)) { CreatePosition(latitude + LatitudeDeviation, longitude); end = false; } if (end == true) { // It ends :) } }
What measures should I take?

Creating a subset in a table in R from an existing data frame?
Hi all, I am trying to create a table below from an existing data frame I have on movies that can be found from www.thenumbers.com. I there are columns for mpaa_rating, rpd (not the average, just rpd), and for the distributor (or studio) from the existing data frame. However, I am not sure how to create this table structure in R. I have tried the following:
mpaaTable < data.table(movies$distributor < sample(c('Universal', 'Disney', 'Paramount')), movies$mpaa_rating < sample(c('G', 'PG', 'PG13', 'R')), movies$rpd < mean(sample(20)), key=c('a', 'b'))
with no avail. Can someone please guide me in how to set up this table structure or give me an example of how to subset my table like this? Thank you in advance.

Removing row if number of NA's is larger than 2 (or any number) in a certain amount of rows
I have the following panel data frame:
X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5 Ind 1 7 NA NA NA NA 1 4 6 8 6 Ind 2 2 NA 16 NA NA 5 16 12 3 4 Ind 3 NA NA NA 19 92 13 NA 12 NA NA Ind 4 32 5 12 3 5 NA NA NA NA 4 Ind 5 44 3 46 3 47 3 2 NA 3 4 Ind 6 NA 34 NA 8 NA 14 15 12 3 4 Ind 7 49 55 67 49 89 6 17 2 3 4 Ind 8 NA NA 49 NA NA 11 20 6 NA 4 Ind 9 1 1 5 NA 9 NA NA NA NA NA
In pastable format:
df < read.table(text="Index_name,X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5 Ind_1 7 NA NA NA NA 1 4 6 8 6 Ind_2 2 NA 16 NA NA 5 16 12 3 4 Ind_3 NA NA NA 19 92 13 NA 12 NA NA Ind_4 32 5 12 3 5 NA NA NA NA 4 Ind_5 44 3 46 3 47 3 2 NA 3 4 Ind_6 NA 34 NA 8 NA 14 15 12 3 4 Ind_7 49 55 67 49 89 6 17 2 3 4 Ind_8 NA NA 49 NA NA 11 20 6 NA 4 Ind_9 1 1 5 NA 9 NA NA NA NA NA",row.names=1, header=TRUE, stringsAsFactors=FALSE)
I want to filter out all rows that don't have at least 2 non
NA
values in both the columns that start withX
and the columns that start withY
.For example:
 Ind1: Drop (only 1 value in X1X5)
 Ind2: Keep (cause here there are at least 2 numbers in X)
 Ind3: Keep cause both X and Y have 2 or more observations.
 Ind4: Delete (only 1 value in Y1Y5)
 Ind5: Keep
 Ind6: Keep
 Ind7: Keep
 Ind8: Delete (Only 1 value in X1X5)
 Ind9: Delete (though X is ok, Y is not okay.)

Expand dataframe by ID to generate a special column
I have the following dataframe
df<data.frame("ID"=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"), 'A_Frequency'=c(1,2,3,4,5,1,2,3,4,5), 'B_Frequency'=c(1,2,NA,4,6,1,2,5,6,7))
The dataframe appears as follows
ID A_Frequency B_Frequency 1 A 1 1 2 A 2 2 3 A 3 NA 4 A 4 4 5 A 5 6 6 B 1 1 7 B 2 2 8 B 3 5 9 B 4 6 10 B 5 7
I Wish to create a new dataframe df2 from df that looks as follows
ID CFreq 1 A 1 2 A 2 3 A 3 4 A 4 5 A 5 6 A 6 7 B 1 8 B 2 9 B 3 10 B 4 11 B 5 12 B 6 13 B 7
The new dataframe has a column CFreq that takes unique values from A_Frequency, B_Frequency and groups them by ID. Then it ignores the NA values and generates the CFreq column
I have tried dplyr but am unable to get the required response
df2<df%>%group_by(ID)%>%select(ID, A_Frequency,B_Frequency)%>% mutate(Cfreq=unique(A_Frequency, B_Frequency))
This yields the following which is quite different
ID A_Frequency B_Frequency Cfreq <fct> <dbl> <dbl> <dbl> 1 A 1 1 1 2 A 2 2 2 3 A 3 NA 3 4 A 4 4 4 5 A 5 6 5 6 B 1 1 1 7 B 2 2 2 8 B 3 5 3 9 B 4 6 4 10 B 5 7 5
Request someone to help me here