Calculating the row wise cosine similarity using matrix multiplication in R
In the below example I have calculated the row wise cosine similarity for data in a matrix using a custom function and a for loop. The output that I would like is a symmetric matrix.
I would like to implement this calculation using matrix multiplication (linear algebra) without a for loop as the actual input matrix I need to work on is much larger and a loop will be too slow.
x = c(0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1)
x = matrix(x, nrow = 3, byrow = TRUE)
cosine_similarity = function(a, b){
y = crossprod(a, b) / sqrt(crossprod(a) * crossprod(b))
return(y)
}
N_row = dim(x)[1]
similarity_matrix = matrix(0, nrow = N_row, ncol = N_row)
for (i in 1:(N_row1)) {
for (j in (i + 1):N_row) {
similarity_matrix[i,j] = cosine_similarity(x[i,], x[j,])
}
}
similarity_matrix = similarity_matrix + t(similarity_matrix)
1 answer

We could use
outer
to make this fasterouter(seq_len(nrow(x)), seq_len(nrow(x)), FUN = Vectorize(function(i, j) cosine_similarity(x[i,], x[j, ])))
output
# [,1] [,2] [,3] #[1,] 1.0000000 0.5000000 0.4082483 #[2,] 0.5000000 1.0000000 0.4082483 #[3,] 0.4082483 0.4082483 1.0000000
or another option is
combn
out < diag(nrow(x)) * 0 out[upper.tri(out)] < combn(seq_len(nrow(x)), 2, FUN = function(i) c(cosine_similarity(x[i[1], ], x[i[2],]))) out < out + t(out) diag(out) < 1
See also questions close to this topic

dplyr: group_by, sum various columns, and apply a function based on grouped row sums?
I'm trying to use dplyr to summarize a dataframe of bird species abundance in forests which are fragmented to some degree.
The first column, percent_cover, has 4 possible values: 10, 25, 50, 75. Then there are ten columns of bird species counts: 'species1' through 'species10'.
I want to group by percent_cover, then sum the other columns and calculate these sums as a percentage of the 4 row sums.
To get to the column sums is easy enough:
%>% group_by(Percent_cover) %>% summarise_at(vars(contains("species")), sum)
...but what I need is sum/rowSum*100. It seems that some kind of 'rowwise' operation is needed.
Also, out of interest, why does the following not work?
%>% group_by(Percent_cover) %>% summarise_at(vars(contains("species")), sum*100)
At this point, it's tempting to go back to 'for' loops....or Excel pivot tables.

Group by and sum columns together
I have created a list that contains 64 dataframes. As you can see in the code below they contain the results of a sum (PST[rownames(pct_kritiek),]*pct_kritiek[,i] *1000000). This works fine. Next i am using an inner join to merge my data with a conversion table. With some editing and the group_by function i managed to aggregate the rows from 380 to about 10.
My PROBLEM is that I need to do the same thing for the columns.
The following code creates my list of dataframes and the vertical group by:
library(tidyr) supply_agg = vector(length = 64, mode = 'list') for (i in 1:64) { supply_agg[[i]] < PST[rownames(pct_kritiek),]*pct_kritiek[,i] *1000000 supply_agg[[i]] < cbind("gg" = rownames(supply_agg[[i]]), supply_agg[[i]]) supply_agg[[i]] < inner_join(Conversiontablerows, supply_agg[[i]], by= "gg") supply_agg[[i]] < supply_agg[[i]][,1] supply_agg[[i]] < supply_agg[[i]] %>% group_by(gg_mm) %>% summarise_each(funs(sum))
I have another dataset including 2 columns as a conversion table:
a Cat1 b Cat2 c Cat1 d Cat1 e Cat3 etc.
I have my data (supply_agg[[i]] and I have a dataframe with conversion info. How to sum my columns from 147 to 10 categories??
I started working with R on monday, so i need some help with this problem. (cant publish sample of data as its confidential...)

Assigning colors to edges using edge weight in ggraph
I am quite new to network analysis. I have a code that creates a network basically as I want it except that I cannot manage to assign colors to the edges based on the weight they have. I would like to color edges with a negative weight red, with a weight of zero blue and positive weights green. Below is an example of the data and the code I already have. As I said, except for the colors is everything working.
Thanks a lot for your help :)
Code and data:
> dput(twonw_78_87_nor) structure(c(0, 0, 0, 0.0625, 0.333333333333333, 0.181818181818182, 0.333333333333333, 0.1875, 0, 0, 0, 0.0625, 0.111111111111111, 0.272727272727273, 0.166666666666667, 0.25, 0, 0, 0.166666666666667, 0.125, 0.111111111111111, 0.0909090909090909, 0, 0.0625, 0.111111111111111, 0.0909090909090909, 0, 0.0625, 0.111111111111111, 0.363636363636364, 0.166666666666667, 0.0625, 0, 0, 0.166666666666667, 0), .Dim = c(4L, 9L), .Dimnames = list(c("Bündnis90/die Grünen", "CDU/CSU", "FDP", "SPD"), c("ABzÖ: Neugestaltung d. Wirtschaftens", "ABzÖ: VW höhere Prio als ökol. Probleme", "ABzÖ: Ökol. Anpassung d. Wirtschaft", "ABzÖ: Ökol. Umbau Zukunft", "SH: Anreize+Subventionen", "SH: Selbstverantwortung", "SH: Verbote_Bürger:innen", "SH: Verbote_Industrie", "SH: ökol. Steuerreform")), start = structure(252457200, tzone = "", class = c("POSIXct", "POSIXt")), stop = structure(567989999, tzone = "", class = c("POSIXct", "POSIXt")), call = dna_network(connection = conn, networkType = "twomode", statementType = "DNA Statement", variable1 = "organization", variable2 = "concept", qualifier = "agreement", qualifierAggregation = "subtract", normalization = "activity", duplicates = "document", start.date = "01.01.1978", stop.date = "31.12.1987", excludeValues = list(concept = codes_vp, concept = codes_probleme, concept = codes_ep, concept = codes_t), verbose = TRUE), class = c("dna_network_twomode", "matrix" )) dna_plotNetwork(twonw_78_87_nor, truncate = 20, label_repel = 0.25, node_size = 4, font_size = 7) + coord_flip() + scale_edge_colour_discrete( h = c(0, 360) + 15, c = 100, l = 65, h.start = 0.2, direction = 1, na.value = "grey50", limits = c(0.2, 0.2) )

Matrix multiplication in Fixed Point for 16 bits
I need perform the matrix multiplicatión between differents layers in a neural network. That is:
W0, W1, W2, ... Wn
are the weights of the neural netwotk and the input isdata
. Resulting matrices are:Out1 = data * W0 Out2 = Out1 * W1 Out3 = Out2 * W2 . . . OutN = Out(N1) * Wn
I Know the absolute max value in the weights matrices and also I know that the input data range values are from 0 to 1 (input are normalizated). The matrix multiplication is in fixed point with 16 bits. The weights are scalated to the optimal format point. For example: if the absolute maximun value in
W0
is 2.5 I know that the minimun number of bits in the integer part is 2 and the bits in fractional part will be 14. Because the data input is in the range [0,1] also I know the integer and fractional bits are 1.15.My question is: How can I know the mininum number of bits in the integer part in the resultant matrix to avoid overflow? Is there anyway to study and infer the maximun value in a matrix multiplication? I know about determinant and norm of a matrix, but, I think the problem is in the consecutive negatives or positives values in the matrix rows an columns. For example, if I have this row vector and this column vector, and the result is in 8 bits fixed point:
A = [1, 2, 3, 4, 5, 6, 7, 8] B = [1, 2, 3, 4, 5, 6, 7, 8] A * B = (1*1) + (2*2) + (3*3) + (4*4) + (5*5) + (6*6) + (7*7) + (8*8) = 90  49 + 68
When the sum accumulator is below than 64, occurs overflow altough the final result be contained between [64,63].
Another example: If I have have this row vector and this column vector, and the result is in 8 bits fixed point:
A = [1, 2, 3, 4, 5, 6, 7, 8] B = [1, 2, 3, 4, 5, 6, 7, 8] A * B = (1*1)  (2*2) + (3*3)  (4*4) + (5*5)  (6*6) + (7*7)  (8*8) = 36
The sum accumulator in any moment exceeds the maximun range for 8 bits.
To sum up: I'm looking for a way to analize the weights matrices to avoid the overflow in the sum accumulator. The way that I do the matrix multiplication is (only a example if matrices A and B has been scalated to 1.15 format):
A1 > 1.15 bits B1 > 1.15 bits A2 > 1.15 bits B2 > 1.15 bits mult_1 = (A1 * B1) >> 2^15; // Right shift to alineate the operands mult_2 = (A2 * B2) >> 2^15; // Right shift to alineate the operands sum_acc = mult_1 + mult_2; // Sum accumulator

Multiplication of matrix rows with each other and with another list
I am trying to multiply a matrix containing voxel sizes with a list of single numbers (containing the amount of voxels). Something like this:
a = [(1, 2, 3), (2, 3, 4)] b = [5, 6] hocuspocus = [1 * 2 * 3 * 5, 2 * 3 * 4 * 6] = [30, 144]
Because I need to provide the voxels in cubic millimeters, I need to multiply the content of each matrix row with each other and then with list b. Haven't yet figured out how to do that in python. Does anyone have any suggestions? Thanks.

Calculate viewmatrix from opengl data
I wanna calculate my player position to pixels so I could display it on screen. 1)Has already found my player position. 2) I found also some kind of viewmatrix. Viewmatrix
Game is definitely opengl, but it uses direct3d format by the looks of it. xyz is at bottom row? How would one go calculating it?

Compute cosine similarity between every pair of sentences and add average scores of sentences in new column
I want to compute cosine similarity between every pair of sentences as bert embeddings and add average score of each sentences in new column as rank. I wrote following code to compute cosine similarity:
from scipy import spatial from sent2vec.vectorizer import Vectorizer for i in range(0, len(features)): for j in range(i+1, len(features)): sentence_1 = i sentence_2 = j temp_sim_value = spatial.distance.cosine(features[i],features[j])
features is
bert embedding
for each sentences in df['sents'] withnumpy.ndarray
type. now I want compute average cosine scores of each pair sentences and add this to related sentences in dataframe as following:sents rank s1 0.6 s2 0.3 ...
how can I do it?

Tensorflow1，How to calculate tensor's cosinsimilarity to form a similarity matrix?
First,I have a tensor like this,
a = [[A B],[C D]]
I'd like to calculate cosinsimilarity between each other,I mean calculate cos([A B],[A B]),cos([A B],[C D]),cos([C D],[A B]),cos([C D],[C D]) to form a similarity matrix like this,
[[cos([A B],[A B]),cos([A B],[C D])], [cos([C D],[A B]),cos([C D],[C D])]]
I want to use follow code to get similarity matrix,it did't work.
`tf.losses.cosine_distance(tf.expand_dims(a, 0), tf.expand_dims(a, 1), axis = 2)`
How to use efficient vectorization to do this work in TF1?thank your reply.

Calculating cosine similarity in pandas
I want to plot a heatmap visualizing the cosine similarity of two dataframes from csv files. I have two datasets: a.csv that contains information about how many GitHub repositories use specific programming language and b.csv that contains information about how many times two programming languages are used in a common repository. I want to plot a heatmap visualizing the cosine similarity of pairs of programming languages with respect to their cousage in GitHub repositories
I used pivot to make them use the same but not sure if this is what I have to do to get the cosine similarity.
data1= pd.read_csv ('a.csv') data1.head <bound method NDFrame.head of cnt lang 0 1160725 JavaScript 1 871264 CSS 2 814370 HTML 3 671755 Shell 4 567150 Python .. ... ... 333 4 Omgrofl 334 4 Befunge 335 4 RUNOFF 336 3 NetLinx+ERB 337 0 NaN [338 rows x 2 columns]> data2= pd.read_csv ('b.csv') data2.head <bound method NDFrame.head of lgn t2_lgn cnt 0 JavaScript CSS 716441 1 JavaScript HTML 602955 2 HTML CSS 589971 3 Shell JavaScript 221484 4 Shell Python 217501 .. ... ... ... 995 Gnuplot C 2199 996 Ruby AppleScript 2192 997 XS C 2192 998 SQLPL Batchfile 2189 999 Smarty C++ 2188 [1000 rows x 3 columns]> a= pd.pivot_table(data1, values=['cnt'], index=['lang']) a.head <bound method NDFrame.head of cnt lang 1C Enterprise 315 ABAP 483 AGS Script 730 AMPL 832 ANTLR 2666 ... ... mupad 79 nesC 510 ooc 130 wisp 25 xBase 356 [337 rows x 1 columns]> b = pd.pivot_table(data2, values=['cnt'], index=['lgn'], columns=['t2_lgn']) b.head <bound method NDFrame.head of cnt \ t2_lgn ASP ActionScript ApacheConf AppleScript Arduino Assembly lgn Assembly 13367.0 NaN NaN NaN NaN NaN Awk 11052.0 NaN NaN NaN NaN 18215.0 Batchfile 4921.0 NaN 9816.0 NaN NaN 10466.0 Bison NaN NaN NaN NaN NaN 3048.0 C 15086.0 NaN 4141.0 2513.0 7981.0 48471.0 ... ... ... ... ... ... ... Visual Basic NaN NaN NaN NaN NaN 2471.0 Vue NaN NaN NaN NaN NaN NaN XS NaN NaN NaN NaN NaN NaN XSLT 6534.0 NaN 3401.0 NaN NaN 7838.0 Yacc 5394.0 NaN NaN NaN NaN 10922.0 ... t2_lgn VimL lgn Assembly NaN Awk NaN Batchfile NaN Bison NaN C NaN ... ... Visual Basic NaN Vue NaN XS NaN XSLT 5300.0 Yacc NaN [94 rows x 86 columns]>
I need to place these data frames in the cosine similarity formula. The dataframes have different length. I found the code below and tried to use it but it did not work.
sum = 0 suma1 = 0 sumb1 = 0 for i,j in zip(a,b): suma1+=i *i sumb1+=j*j sum += i*j cosinesim = sum / ((sqrt(suma1))*(sqrt(sumb1))) print(cosinesim)
I get an error:
TypeError Traceback (most recent call last) <ipythoninput2653d1b0bb4770e> in <module> 28 29 for i,j in zip(a,b): > 30 suma1+=i *i 31 sumb1+=j*j 32 sum += i*j TypeError: can't multiply sequence by nonint of type 'str'
Thank you for help!