Matrix calculation (in R)  least squares
I have a matrix in R as per the following:
(72  90  60)
(60 83   )
(50  77 83 )
( 40  84 65)
Reproducible as follows:
B = matrix(c(72,60,50,NA,NA,83,NA,40,90,NA,77,NA,NA,NA,83,84,60,NA,NA,65),nrow=4,ncol=5)
It is a sparse matrix in the sense that not all elements are populated. What i want to do is find an adjustment matrix (or vector, that is added to each column) so that the average of the rows is equal to the average of columns. The answer is:
(13.04, 10.93, 14.68, 22.70, 5.90)
I computed this iteratively and is also rounded to 2dp so doesn't quite equate, but, for example, it should be the case that if you add the adjustments to the original matrix:
A = t(t(B)+c(13.04,10.93,14.68,21.70,5.9))
then the average of the row averages should equal the average of the column averages:
mean(colMeans(A, na.rm = TRUE, dims = 1))
mean(rowMeans(A, na.rm = TRUE, dims = 1))
However, I think this is a least squares problem and can be solved algebraicly but i don't know how to do it or how to set it out. The question is, is there a way to find the adjustment algebraicly using matrix calculation rather than iteratively. Can anyone help?
See also questions close to this topic

Group several columns then aggregate a set of columns in Pandas (It crashes badly compared to R's data.table)
I am relatively new to the world of Python and trying to use it as a backup platform to do data analysis. I generally use
data.table
for my data analysis needs.The issue is that when I run groupaggregate operation on big CSV file (randomized, zipped, uploaded at http://www.filedropper.com/ddataredact_1), Python throws:
grouping pandas return getattr(obj, method)(*args, **kwds) ValueError: negative dimensions are not allowed
OR (I have even encountered...)
File "C:\Anaconda3\lib\sitepackages\pandas\core\reshape\util.py", line 65, in cartesian_product for i, x in enumerate(X)] File "C:\Anaconda3\lib\sitepackages\pandas\core\reshape\util.py", line 65, in for i, x in enumerate(X)] File "C:\Anaconda3\lib\sitepackages\numpy\core\fromnumeric.py", line 445, in repeat return _wrapfunc(a, 'repeat', repeats, axis=axis) File "C:\Anaconda3\lib\sitepackages\numpy\core\fromnumeric.py", line 51, in _wrapfunc return getattr(obj, method)(*args, **kwds) MemoryError
I have spent three days trying to reduce the file size (I was able to reduce the size by 89%), adding breakpoints, debugging it, but I was not able to make any progress.
Surprisingly, I thought of running the same group/aggregate operation in
data.table
in R, and it hardly took 1 second. Moreover, I didn't have to do any data type conversion etc., suggested at https://www.dataquest.io/blog/pandasbigdata/.I also researched other threads: Avoiding Memory Issues For GroupBy on Large Pandas DataFrame, Pandas: df.groupby() is too slow for big data set. Any alternatives methods?, and pandas groupby with sum() on large csv file?. It seems these threads are more about matrix multiplication. I'd appreciate if you wouldn't tag this as duplicate.
Here's my Python code:
finaldatapath = "..\Data_R" ddata = pd.read_csv(finaldatapath +"\\"+"ddata_redact.csv", low_memory=False,encoding ="ISO88591") #before optimization: 353MB ddata.info(memory_usage="deep") #optimize file: Objecttypes are the biggest culprit. ddata_obj = ddata.select_dtypes(include=['object']).copy() #Now convert this to category type: #Float type didn't help much, so I am excluding it here. for col in ddata_obj: del ddata[col] ddata.loc[:, col] = ddata_obj[col].astype('category') #release memory del ddata_obj #after optimization: 39MB ddata.info(memory_usage="deep") #Create a list of grouping variables: group_column_list = [ "Business", "Device_Family", "Geo", "Segment", "Cust_Name", "GID", "Device ID", "Seller", "C9Phone_Margins_Flag", "C9Phone_Cust_Y_N", "ANDroid_Lic_Type", "Type", "Term", 'Cust_ANDroid_Margin_Bucket', 'Cust_Mobile_Margin_Bucket', # # 'Cust_Android_App_Bucket', 'ANDroind_App_Cust_Y_N' ] print("Analyzing data now...") def ddata_agg(x): names = { 'ANDroid_Margin': x['ANDroid_Margin'].sum(), 'Margins': x['Margins'].sum(), 'ANDroid_App_Qty': x['ANDroid_App_Qty'].sum(), 'Apple_Margin':x['Apple_Margin'].sum(), 'P_Lic':x['P_Lic'].sum(), 'Cust_ANDroid_Margins':x['Cust_ANDroid_Margins'].mean(), 'Cust_Mobile_Margins':x['Cust_Mobile_Margins'].mean(), 'Cust_ANDroid_App_Qty':x['Cust_ANDroid_App_Qty'].mean() } return pd.Series(names) ddata=ddata.reset_index(drop=True) ddata = ddata.groupby(group_column_list).apply(ddata_agg)
The code crashes in above
.groupby
operation.Can someone please help me? Compared to my other posts, I have probably spent the most amount of time on this StackOverflow post, trying to fix it and learn new stuff about Python. However, I have reached saturationit even more frustrates me because
R
'sdata.table
package processes this file in <2 seconds. This post is not about pros and cons of R and Python, but about using Python to be more productive.I am completely lost, and I'd appreciate any help.
Here's my
data.table
R
code:path_r = "../ddata_redact.csv" ddata<data.table::fread(path_r,stringsAsFactors=FALSE,data.table = TRUE, header = TRUE) group_column_list <c( "Business", "Device_Family", "Geo", "Segment", "Cust_Name", "GID", "Device ID", "Seller", "C9Phone_Margins_Flag", "C9Phone_Cust_Y_N", "ANDroid_Lic_Type", "Type", "Term", 'Cust_ANDroid_Margin_Bucket', 'Cust_Mobile_Margin_Bucket', # # 'Cust_Android_App_Bucket', 'ANDroind_App_Cust_Y_N' ) ddata<ddata[, .(ANDroid_Margin = sum(ANDroid_Margin,na.rm = TRUE), Margins=sum(Margins,na.rm = TRUE), Apple_Margin=sum(Apple_Margin,na.rm=TRUE), Cust_ANDroid_Margins = mean(Cust_ANDroid_Margins,na.rm = TRUE), Cust_Mobile_Margins = mean(Cust_Mobile_Margins,na.rm = TRUE), Cust_ANDroid_App_Qty = mean(Cust_ANDroid_App_Qty,na.rm = TRUE), ANDroid_App_Qty=sum(ANDroid_App_Qty,na.rm = TRUE) ), by=group_column_list]
I have a 4core 16GB RAM Win10x64 machine. I can provide any details needed by experts.

Installing RODBC in R
I am trying to install RODBC and when I do so i get this error:
Installing package into ‘\\lakesh/Documents/R/winlibrary/3.5’ (as ‘lib’ is unspecified) trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/RODBC_1.315.zip' Content type 'application/zip' length 879575 bytes (858 KB) downloaded 858 KB package ‘RODBC’ successfully unpacked and MD5 sums checked Warning in install.packages : cannot remove prior installation of package ‘RODBC’
Then I tried to remove the package:
> remove.packages("RODBC", lib = NULL) Removing package from ‘\\lakesh/Documents/R/winlibrary/3.5’ (as ‘lib’ is unspecified) Error in remove.packages : there is no package called ‘RODBC’
Need some guidance on installing RODBC package properly.

Rearranging columns in R
I have this table:
ID Gene1 Gene2 Gene3 a. 1a. 2a. 3a. b. 1b. 2b. 3b. c. 1c. 2c. 3c.
I want two new columns "
Gene
" and "effect
" to generate this table:ID Gene effect a. 1. 1a. a. 2. 2a. a. 3. 3a. b. 1. 1b. b. 2. 2b. b. 3. 3b. c. 1. 1c. c. 2. 2c. c. 3. 3c.

LAPACK matrix inversion: unable to program
I am developing embedded software on ARM CortexA53 (armv7/v8) for image processing, and I need some help for a specific algorithm.
This later involves complex (real/imaginary parts) matrix inversion with large size (up to 24x24). The most significant constraint is the timing execution.
I am currently looking for a library, optimized for ARM (using NEON SIMD?), which offers this kind of operation (matrix inversion of large size, with complex numbers).
I did try Eigen library: it works fine, but it is too slow. Our processor runs at 1 GHz, and I got an average duration of 0,82 ms for a complex double matrix inversion of size 24x24.
I also had a look at:
 Ne10 library: it does not propose matrix inversion with complex number (no template used) and large size (maximum size is 4x4).
 ARM Compute Library: cannot find matrix inversion function. I am not sure it is implemented.
 ARM Performance Libraroes: I am still looking for the matrix inversion, does anyone knows which function/class I should have a look to?
Please, any C++ libraries (should be templated for use of complex), optimized for ARM, could you suggest?
Thanking you in advance!
Laurent.
UPDATE
I am currently testing the C LAPACK (=LAPACKE) library to complete a matrix inversion.
I do have problem to link my code. Look at C++ code (example of code got from internet, very simple):
#include "test_lapacke.hh" extern "C" { #include <lapacke.h> } lapack_int matInv(double *A, unsigned n) { int ipiv[n+1]; lapack_int ret; ret = LAPACKE_dgetrf(LAPACK_COL_MAJOR, n, n, A, n, ipiv); if (ret !=0) return ret; ret = LAPACKE_dgetri(LAPACK_COL_MAJOR, n, A, n, ipiv); return ret; } /**********************************************************************************************************************/ void Test_lapacke() { double A[] = { 0.378589, 0.971711, 0.016087, 0.037668, 0.312398, 0.756377, 0.345708, 0.922947, 0.846671, 0.856103, 0.732510, 0.108942, 0.476969, 0.398254, 0.507045, 0.162608, 0.227770, 0.533074, 0.807075, 0.180335, 0.517006, 0.315992, 0.914848, 0.460825, 0.731980 }; for (int i=0; i<25; i++) { if ((i%5) == 0) putchar('\n'); printf("%+12.8f ",A[i]); } putchar('\n'); matInv(A,5); for (int i=0; i<25; i++) { if ((i%5) == 0) putchar('\n'); printf("%+12.8f ",A[i]); } putchar('\n'); }
I did add all blas/lapack libraries to C++ linker on my project:
llapack llapacke lrefblas lcblas ltmglib
But I still got the error:
/path/lib/liblapacke.a(lapacke_dgetrf_work.o): In function
LAPACKE_dgetrf_work': lapacke_dgetrf_work.c:(.text+0xd8): undefined reference to
dgetrf_' lapacke_dgetrf_work.c:(.text+0x160): undefined reference todgetrf_' /path/lib/liblapacke.a(lapacke_dgetri_work.o): In function
LAPACKE_dgetri_work': lapacke_dgetri_work.c:(.text+0xe4): undefined reference todgetri_' lapacke_dgetri_work.c:(.text+0x174): undefined reference to
dgetri_' collect2: error: ld returned 1 exit status make: *** [Test_matrice] Erreur 1Any ideas to solve this?
Many thanks! Regards, Laurent.

R  Comparing datas of two data.frames with a loop, constant by constant
Basically, I want to compare the same constants (same [X,Y] values) from two data frames and applied them a few operations afterward, before stocking the result in a new data frame. The tricky part is that I need to treat all the values of these data frames. In other words, I need to compare the value of dataA[1,1] with dataB[1,1] and if they respect certain conditions, I'll perform an operation, else another opeartion, then stock the result in a new data.frame. After, rinse and repeat for dataA[1,2] and dataB[1,2] up to dataA[100,100] and dataB[100,100].
Obviously, I've to use a loop here (and some if/else), but I can't seems to figure out the proper structure. Being used to php, I tried the foreach package in R, but it only return FALSE (and it do so in vector format instead of a matrix with multiple columns). If I do the operation by hand, there are more TRUE than FALSE, so obviously, something's wrong here :
x < foreach(dataIDH, dataPIB) %do% { if (dataPIB <= dataIDH+5 & dataPIB >= rankIDH5) { x < mean(dataPIB, dataIDH) } else { x < FALSE} } x
I did tried a for loop, but I'm simply unable to put the results in a data.frame (even less a ones that match the layout of the dataframes used here, which I need to do) :
x < for(idh in 1:nrow(dataIDH)) { for(idh in 1:ncol(dataIDH)) { for(pib in 1:nrow(dataPIB)) { for(pib in 1:ncol(dataPIB)) { if (pib<=idh+5 & pib>=idh5) { x < mean(pib,idh) } else { x < FALSE} } } } } x
For informations : the data frames contain numeric values for a set of countries (rows) for a few years (columns).
Any ideas on how to get out of this mess?
Edit 1 : an extract of the two dataframes used (1st row and col displayed here are actually headers) :
dataIDH
CountryCode,2005,2006,2007,2008 AFG,14,14,16,16 ALB,100,98,99,98 DZA,85,86,90,86
dataPIB
CountryCode,2005,2006,2007,2008 AFG, 69, 18, 70, 71 ALB, 102, 98, 97, 63 DZA, 85, 89, 91, 137
Edit 2 : and the final result should be a new data.frame, on the same layout:
x
CountryCode,2005,2006,2007,2008 AFG, FALSE, 16, FALSE, FALSE ALB, 101, 98, 98, FALSE DZA, 85, 87.5, 90.5, FALSE

Speed up matrix operations
I have two matrices of the same dimensions (which in reality are of 2,500 x 15,000):
set.seed(1) a.mat < matrix(rnorm(25*150),nrow=25,ncol=150,dimnames=list(paste0("p",1:25),paste0("c",1:150))) b.mat < matrix(rnorm(25*150),nrow=25,ncol=150,dimnames=list(paste0("p",1:25),paste0("c",1:150)))
And I'm computing this with them:
res.mat < do.call(cbind,lapply(1:ncol(a.mat),function(i){ t.mat < a.mata.mat[,i] t.mat < log10(abs(t.mat)+1) * sign(t.mat) return(suppressWarnings(cor(t.mat,b.mat[,i]))) }))
Any idea if and how this can done faster than the way I'm currently doing it? Perhaps running it in parallel with
multidplyr
?Per `multidplyr, here's what I'm attempting:
library(dplyr) df < do.call(rbind,lapply(1:ncol(a.mat),function(i){ cbind(reshape2::melt(a.mat) %>% dplyr::rename(id=Var1,cell=Var2,a.value=value), do.call("rbind",replicate(ncol(a.mat),data.frame(cell.i=colnames(a.mat)[i],a.value.i=a.mat[,i]),simplify=F)), do.call("rbind",replicate(ncol(a.mat),data.frame(b.value.i=b.mat[,i]),simplify=F))) })) %>% dplyr::mutate(t.value=a.valuea.value.i) %>% dplyr::mutate(t.value=log10(abs(t.value)+1)*sign(t.value)) %>% dplyr::group_by(cell.i)
And then:
group.size < 3 n.groups < ceiling(ncol(a.mat)/group.size) for(i in 1:n.groups){ start.idx < (i1)*group.size+1 end.idx < min(i*group.size,ncol(a.mat)) current.df < df %>% dplyr::filter(cell.i %in% colnames(a.mat)[start.idx:end.idx]) current.df < current.df %>% multidplyr::partition(cell.i) %>% multidplyr::cluster_library("tidyverse") %>% multidplyr::cluster_library("MASS") %>% multidplyr::cluster_assign_value("myFunction", myFunction) %>% do(results = myFunction(.)) %>% dplyr::collect() %>% .$results %>% dplyr::bind_rows() }
Where:
myFunction < function(df) { return(df %>% dplyr::group_by(e.cell) %>% dplyr::mutate(cor=cor(t.value,b.value.i))) }
However, this hits this error message:
Error in checkForRemoteErrors(lapply(cl, recvResult)) : 3 nodes produced errors; first error: Can't convert an environment to function Call `rlang::last_error()` to see a backtrace In addition: Warning message: group_indices_.grouped_df ignores extra arguments
So, is using
multidplyr
this way even worth it? and if so any idea what I'm doing wrong here? 
How to get weights for rbf kernel or any custom kernel in SVM?
i need to get the weights from a sklearn svm fitted with a custom kernel or rbf kernel. i saw that dual_coeff gives few values but not sure of what they are.
thanks in advance.

predict system of circular reference equations
If I have the data frame
x
x = data.frame(y1 = c(1, 5, 8, 9, 10, 0, NA), y2 = c(2, 6, 9, 0, 1, 7, NA), x1 = c(3, 5, 8, 2, 5, 11, 12), x2 = c(0, 1, 2, 8, 0, 1, 12), x3 = 1:7)
And found that the best regression for
y1
relies ony2
and visversa such that:fit1 = lm(y1 ~ y2 + x1, data = x[1:6, ]) fit2 = lm(y2 ~ y1 + x2 + x3, data = x[1:6, ])
How could I come up with a prediction for
y1
andy2
for the 7th observation? 
Does linear regression work with a categorical independent variable & continuous dependent variable?
I have a dataset where:
X1  categorical independent variable
X2  continuous independent variable
y  continuous dependent variable
And I'm looking to use X1 and X2 to predict y. Is linear regression appropriate for this (does it even make sense to regress over a categorical independent variable?)? If so, how can I use linear regression when X1 is a categorical independent variable (e.g. eye colour)?
Should I create a separate linear regression model for each of the categories in X1? Or try to create a multiple linear regression model?
Taking a look online there are mostly resources concerning continuous independent > continuous dependent (linear regression), or continuous independent > categorical dependent (logistic regression).
Would appreciate being pointed to any resources/tools that could help me.

cvxpy contrained normalization equations (abs)
I am working in an optimization problem (A*v = b) where I would like to rank a set of alternatives X = {x1,x2,x3,x4}. However, I have the following normalization constraint: v[i]  v[j] <= 1, which can be in the form 1 <= v[i]  v[j] <= 1. My code is as follows:
import cvxpy as cp n = len(X) #set of alternatives v = cp.Variable(n) objective = cp.Minimize(cp.sum_squares(A*v  b)) constraints = [0 <= v] #Normalization condition 1 <= v[i]  v[j] <= 1 for i in range(n): for j in range(n): constraints = [1 <= v[i]v[j], 1 >= v[i]v[j]] prob = cp.Problem(objective, constraints) # The optimal objective value is returned by `prob.solve()`. result = prob.solve() # The optimal value for v is stored in `v.value`. va2 = v.value
Which outputs:
[0.15 0.45 0.35 0.05]
Result, which is not close to what should be and even have negative values. I think, my code for the normalization contraint most probably is wrong.

How to broom::glance or summarize several phylogenetic gls and save as tibble?
I have several phylogenetic GLS and would like to find a better way to summarize the results. I am not using loops to obtain several PGLS since I will be clumping several traits from different matrices. Example below:
library(caper) # for the pgls method library(dplyr) library(broom) library(purrr) data(shorebird) birdie < comparative.data(shorebird.tree, shorebird.data, Species) pgls.EggVsMass < pgls( Egg.Mass ~ M.Mass, birdie) pgls.EggVsClutch < pgls( Egg.Mass ~ Cl.size, birdie) SUM.EggVsMass < summary(pgls.EggVsMass)$coefficients SUM.EggVsClutch < summary(pgls.EggVsClutch)$coefficients GL < mget(ls(pattern = "SUM.*")) tidier < GL %>% purrr::map(., glance)
But the object 'tidier' is not a simple tibble and I can't properly save it. How can I improve this strategy and obtain a tibble with the information in the summary or the coefficients ?
A similar code but for lm worked so I would like to solve the PGLS in a similar way. Using map_df in the case above does not work.
LMTRAIT < names(data)[1] %>% paste('trait1 ~', .) %>% map(lm, data) %>% purrr::map_df(., broom::glance) # as_tibble(LMTRAIT)

Weighted Least Squares with scipy.least_squares
I'm trying to leverage the scipy.least_squares library for solving large, sparse, nonlinear problems. So far, it's been great!
However, one shortcoming that i've run into and so far not been able to work around is utilizing a weight matrix for observations and/or the residual vector. NOTE: I know that it's possible to set up parameter bounds, however that is not what i'm trying to accomplish.
This example was taken from a previous stack overflow question, but the answer involved scipy.optimize, not scipy.optimize.least_squares. While this specific example is somewhat trivial, the problems i'm attempting to solve can involve hundreds of thousands of observations and tens of thousands of parameters. Hence why it's imperative to utilize least_squares.
In this case, i would like to utilize the weights from the residuals of the original fit and attempt a refitting via least squares using the weights. The weights should be the inverse of the residuals (or the inverse of the residuals multiplied by a scaling factor).
import numpy as np from scipy.optimize import least_squares ydata = [9.7372923, 10.0587245, 10.3838510, 10.6931371, 10.9616260, 11.1833220, 11.3806770, 11.5248917, 11.7353000] xdata = np.array([j+5 for j in range(len(ydata))]) def get_weights(resid): """ This function calculates the weights per (x,y) by using the inverse of the squared residuals divided by the total sum of the inverse of the squared residuals. This might be incorrect but should work for the sake of example. """ total = sum([abs(resid[i]) for i in range(len(resid))]) fract = np.array([resid[i]/total for i in range(len(resid))]) return fract def get_least_squares_fit(xs, ys): """ This function finds alpha (1/slope) and the yintercept in the equation of a line y = x / alpha + intercept = mx + b """ ## SPECIFY INITIAL GUESS OF OPTIMIZED PARAMETERS params_guess = [1/3, 9] ## ps = [alpha, intercept] ## DEFINE FITTING FUNCTIONS fit_func = lambda ps,x : x/ps[0] + ps[1] err_func = lambda ps,x,y : fit_func(ps,x)  y ## GET OPTIMIZED PARAMETERS, RESIDUALS & WEIGHTS res = least_squares(err_func, params_guess, args=(xs, ys)) alpha, intercept, residuals = res.x[0], res.x[1], res.fun weights = get_weights(residuals) return alpha, intercept, residuals, weights alpha, intercept, residuals, weights = get_least_squares_fit(xdata, ydata) print(alpha, intercept) >> 4.03378447722 8.6198247828 print(residuals) >> [ 0.12206326 0.04853721 0.02868313 0.09006308 0.11064582 0.08443567 0.03388451 0.06980694 0.1073048 ]