how to converting pyspark matrix to dataframe
Convert PySpark CoordinateMatrix into PySpark Dataframe got the CoordinateMatrix from cosine similarity calculation need to convert to PySpark Dataframe so I can convert to pandas DataFrame for analysis
from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
rdd = dfcon.rdd.map(tuple)
mymatrix = IndexedRowMatrix(rdd.map(lambda row: IndexedRow(row[0], row[1:]))).toBlockMatrix().transpose().toIndexedRowMatrix()
mysimilarities = mymatrix.columnSimilarities()
i used this for cosine similarity and the matrix contains coordinates of cosine similarity
See also questions close to this topic

How to start another window using eel python?
I am creating a GUI application using eel library in python.
index.html
contains login form and if login is successful I want to openretrieve.html
.This is my python code and please consider login as successful.
@eel.expose def retrieve(): eel.start('retrieve.html', size=(1000, 700)) eel.start('index.html', size=(1000, 700))
This is my javascript code
function login_func() { var username = document.getElementById("username").value var pword = document.getElementById("pword").value eel.login(username, pword)(set_result) } function set_result(result) { if(result == "Failed") { window.alert("Please insert correct username and password") } else { window.close() eel.retrieve() } }
Everything works fine but I am getting an error message and that is
OSError: [WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted: ('localhost', 8000)
How can I avoid this?

get combinations list of array
I have a list like
list = [ [14, 13, 10, 9, 7, 6, 5, 2, 1, 0], [11, 7, 4, 2, 1, 0], [15, 14, 13, 12, 7, 6, 5, 2, 0], [14, 13, 12, 11, 9, 6, 4, 2, 1, 0], [15, 14, 13, 8, 7, 6, 3, 1, 0] [15, 14, 11, 10, 8, 7, 6, 3, 2, 1], [15, 14, 9, 8, 7, 4, 2, 1], [15, 14, 13, 12, 10, 7, 5, 3, 2, 1], ]
how can I generate all possible combinations that have 2 elements in common? example:
i already try with a loop but i cant get to the result
output expected:
(10 3)  2 times (15 14)(7 2)  4 times (15 14)  5 times (13 6)(1 0)  3 times (7 2)  6 times (12 5)  2 times (11 4)  2 times (13 6)  4 times (15 14)(7 2)(1)  3 times (1 0)  4 times (14 9)  2 times something like that

difference between countplot and catplot
In python seaborn, What is the difference between
countplot
andcatplot
?Eg:
sns.catplot(x='class', y='survived', hue='sex', kind='bar', data=titanic);
sns.countplot(y='deck', hue='class', data=titanic);

How to unfold a Matrix on Matlab?
I have a given matrix
H
and I would like to unfold (expand) it to find a matrixB
by following the method below :Let
H
be a matrix of dimensionm × n
. Letx = gcd (m,n)
 The matrix
H
is cut in two parts.  The cutting pattern being such that :
 The "diagonal cut" is made by alternately moving
c = n/x
units to the right (we movec
units to the right several times).  We alternately move
cb = m/x
units down (i.e.b = (nm)/x
) (we moveb
units down several times).
 After applying this "diagonal cut" of the matrix, we copy and paste the two parts repeatedly to obtain the matrix B.
Exemple : Let the matrix
H
of dimensionm × n = 5 × 10
defined by :1 0 1 1 1 0 1 1 0 0 0 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0 1 1 1 0 0 1 0 1 0 1 1 1
 Let's calculate
x = gcd (m,n) = gcd (5,10) = 5
,  Alternatively move to the right :
c = n/x = 10/5 = 2
,  Alternatively move down :
b = (nm)/x = (105)/5 = 1
.
 Diagonal cutting diagram : The matrix
H
is cut in two parts. The cutting pattern is such that :
 We move
c = 2
units to the right several timesc = 2
units to the right,  We repeatedly move
c  b = 1
unit downwards.
 After applying this "diagonal cut" of the matrix, we copy and paste the two parts repeatedly to obtain the matrix :
Remark : In the matrices
X
,X1
andX2
the dashes are zeros. The resulting matrix
B
is (L
is factor) :
Any suggestions?
 The matrix

Matrix multiplication in Fixed Point for 16 bits
I need perform the matrix multiplicatión between differents layers in a neural network. That is:
W0, W1, W2, ... Wn
are the weights of the neural netwotk and the input isdata
. Resulting matrices are:Out1 = data * W0 Out2 = Out1 * W1 Out3 = Out2 * W2 . . . OutN = Out(N1) * Wn
I Know the absolute max value in the weights matrices and also I know that the input data range values are from 0 to 1 (input are normalizated). The matrix multiplication is in fixed point with 16 bits. The weights are scalated to the optimal format point. For example: if the absolute maximun value in
W0
is 2.5 I know that the minimun number of bits in the integer part is 2 and the bits in fractional part will be 14. Because the data input is in the range [0,1] also I know the integer and fractional bits are 1.15.My question is: How can I know the mininum number of bits in the integer part in the resultant matrix to avoid overflow? Is there anyway to study and infer the maximun value in a matrix multiplication? I know about determinant and norm of a matrix, but, I think the problem is in the consecutive negatives or positives values in the matrix rows an columns. For example, if I have this row vector and this column vector, and the result is in 8 bits fixed point:
A = [1, 2, 3, 4, 5, 6, 7, 8] B = [1, 2, 3, 4, 5, 6, 7, 8] A * B = (1*1) + (2*2) + (3*3) + (4*4) + (5*5) + (6*6) + (7*7) + (8*8) = 90  49 + 68
When the sum accumulator is below than 64, occurs overflow altough the final result be contained between [64,63].
Another example: If I have have this row vector and this column vector, and the result is in 8 bits fixed point:
A = [1, 2, 3, 4, 5, 6, 7, 8] B = [1, 2, 3, 4, 5, 6, 7, 8] A * B = (1*1)  (2*2) + (3*3)  (4*4) + (5*5)  (6*6) + (7*7)  (8*8) = 36
The sum accumulator in any moment exceeds the maximun range for 8 bits.
To sum up: I'm looking for a way to analize the weights matrices to avoid the overflow in the sum accumulator. The way that I do the matrix multiplication is (only a example if matrices A and B has been scalated to 1.15 format):
A1 > 1.15 bits B1 > 1.15 bits A2 > 1.15 bits B2 > 1.15 bits mult_1 = (A1 * B1) >> 2^15; // Right shift to alineate the operands mult_2 = (A2 * B2) >> 2^15; // Right shift to alineate the operands sum_acc = mult_1 + mult_2; // Sum accumulator

Android Custom View matrix and its Drawable incorrect placements of lines
I have the following setup:
 an custom View(inflated via XML) that extends ImageView and has scaleType = matrix
 A drawable(with its own onDraw method) that is used as a drawable to custom View above
The custom View has its own onTouch method. Inside of the onTouch method I update static values stored in MainActivity used for scaling and translating of the View's matrix. When any of these values change inside onTouch, the drawable gets invalidated with the new values.
These values are also used on the drawable in order to translate and zoom in/out of the drawable. I am able to get the event.getX and event.getY from the onTouch method and draw the exact location of the touchpoint within the drawable. See the pictures(red circle)
My goal is that the user will read the coordinates(math graph) of any touch point on the screen. I am trying to draw a line to the middleX and middleY positions of the canvas. Where middleX = canvas.getWidth()/2f and similarly for middleY.
Everything works fine when I have not zoomed into the view/drawable. When zooming in, I get the line but its endpoint(Should be the touchpoint of the user) is out of place.
The touchpoint is drawn in the correct place the user touches the screen , but using it as the endpoint of the segment does not work. I need it to work!!
I know I overcomplicated the setup. Should have just stuck to creating a custom view with its own onDraw method and onTouch method. But, at this time it will take many days and hours to merge the drawable into the onDraw method of custom view.
Here is the code inside the onDraw method of Drawable that draws the red line and the red dot(user touchpoint)
canvas.drawCircle(ballx1, bally1, radius, redpaint); // ballx1, bally1 are static floats computed inside onTouch of custom View canvas.translate(/*some amounts*/); canvas.scale(some amounts based on a midpoint) //amounts are computed inside onTouch canvas. drawLine(ballx1, bally1, middleX, middleY); //this is where problem is
what I don't understand is that the line gets drawn correctly. Except, the endpoint(ballx1, bally1) is incorrect. I need the enpoint to be the exact location of the red dot.

Low CPU usage while run pyspark svm model
I am trying to run svm on very very large dataset, which I am unable to run using sklearn. It take endless time with sklearn. So I decided to use pyspark Here are my spark configurations
[('spark.app.id', 'local1606562652917'), ('spark.executor.id', 'driver'), ('spark.app.name', 'SVM'), ('spark.driver.maxResultSize', '6g'), ('spark.driver.port', '60042'), ('spark.executor.cores', '6'), ('spark.rdd.compress', 'True'), ('spark.serializer.objectStreamReset', '100'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.driver.host', '192.168.56.1'), ('spark.ui.showConsoleProgress', 'true'), ('spark.cores.max', '6')]
Here is spark session
spark = SparkSession.builder \ .appName('SVM') \ .master('local[*]') \ .getOrCreate()
Here is SVM code
from pyspark.ml.classification import LinearSVC,OneVsRest clf = OneVsRest(classifier=LinearSVC(labelCol='label', featuresCol='features')) clf = clf.fit(train)
Cpu consumption is less than 10%, when I check via task manager.

Pyspark: Extracting rows of a dataframe where value contains a string of characters
I'm using pyspark and I have a large dataframe with only a single column of values, of which each row is a long string of characters:
col1  '20201120;id09;150.09,20.02' '20201120;id44;151.78,25.14' '20201120;id78;148.24,22.67' '20201120;id55;149.77,27.89' ... ... ...
I'm trying to extract rows of the dataframe where 'idxx' matches a list of strings such as ["id01", "id02", "id22", "id77", ...]. Currently, the way I extract rows from the dataframe is:
df.filter(df.col1.contains("id01")  df.col1.contains("id02")  df.col1.contains("id22")  ... )
Is there a way to make this more efficient instead of having to hard code every string item into the filter function?

python write pyspark dataframe to json without header
My apologies for the similar question asked previously. This question is in Python. But I can't find correct solution I have the following dataframe df1
SomeJson ================= [{ "Number": "1234", "Color": "blue", "size": "Medium" }, { "Number": "2222", "Color": "red", "size": "Small" } ]
and I am trying to write just the contents of this dataframe as a json.
df0.coalesce(300).write.mode('append').json(<json_Path>)
It brings in the first key as well like:
{ "SomeJson": [{ "Number": "1234", "Color": "blue", "size": "Medium" }, { "Number": "2222", "Color": "red", "size": "Small" } ] }
but, I would not like to have { "SomeJson": } this in the output file. I have tried to write below. But, I am getting lost at writing the custom Python function to eliminate the first header. Any assistance is highly appreciated
df0.rdd.map(<custom_function>).saveAsTextFile(<json_Path>)

How to write a tab.gz file using pyspark dataframe
I have a Pyspark dataframe and I want my output files to be in tab.gz extensions.
df.write\ .option("delimiter", "\t")\ .option("codec", "org.apache.hadoop.io.compress.GzipCodec")\ .save( s3_directory, format='csv', header=True, emptyValue='', compression="gzip" )
this creates the output files as
partxyz.csv.gz
how can I change the config to make it save as partxyz.tab.gz please?

Comparing data between two spark dataframes and populating PASS if match and FAIL at corresponding colums
I am having two spark data frames like below
I am using pyspark python to compare the data between the two sources using the Snapshot_Date as Key column and want to display the result in another dataframe like below
Compare color coding is for easy understanding and not needed
Thanks in Advance

How to parse and write xml stream data using azure databricks?
I'm new to data bricks and we are trying to get streaming messages in XML format from the event hub using databricks. We were able to read the streaming message whereas we are getting an error while trying to parse that streaming XML data. Please suggest.

Compute cosine similarity between every pair of sentences and add average scores of sentences in new column
I want to compute cosine similarity between every pair of sentences as bert embeddings and add average score of each sentences in new column as rank. I wrote following code to compute cosine similarity:
from scipy import spatial from sent2vec.vectorizer import Vectorizer for i in range(0, len(features)): for j in range(i+1, len(features)): sentence_1 = i sentence_2 = j temp_sim_value = spatial.distance.cosine(features[i],features[j])
features is
bert embedding
for each sentences in df['sents'] withnumpy.ndarray
type. now I want compute average cosine scores of each pair sentences and add this to related sentences in dataframe as following:sents rank s1 0.6 s2 0.3 ...
how can I do it?

Tensorflow1，How to calculate tensor's cosinsimilarity to form a similarity matrix?
First,I have a tensor like this,
a = [[A B],[C D]]
I'd like to calculate cosinsimilarity between each other,I mean calculate cos([A B],[A B]),cos([A B],[C D]),cos([C D],[A B]),cos([C D],[C D]) to form a similarity matrix like this,
[[cos([A B],[A B]),cos([A B],[C D])], [cos([C D],[A B]),cos([C D],[C D])]]
I want to use follow code to get similarity matrix,it did't work.
`tf.losses.cosine_distance(tf.expand_dims(a, 0), tf.expand_dims(a, 1), axis = 2)`
How to use efficient vectorization to do this work in TF1?thank your reply.

Calculating cosine similarity in pandas
I want to plot a heatmap visualizing the cosine similarity of two dataframes from csv files. I have two datasets: a.csv that contains information about how many GitHub repositories use specific programming language and b.csv that contains information about how many times two programming languages are used in a common repository. I want to plot a heatmap visualizing the cosine similarity of pairs of programming languages with respect to their cousage in GitHub repositories
I used pivot to make them use the same but not sure if this is what I have to do to get the cosine similarity.
data1= pd.read_csv ('a.csv') data1.head <bound method NDFrame.head of cnt lang 0 1160725 JavaScript 1 871264 CSS 2 814370 HTML 3 671755 Shell 4 567150 Python .. ... ... 333 4 Omgrofl 334 4 Befunge 335 4 RUNOFF 336 3 NetLinx+ERB 337 0 NaN [338 rows x 2 columns]> data2= pd.read_csv ('b.csv') data2.head <bound method NDFrame.head of lgn t2_lgn cnt 0 JavaScript CSS 716441 1 JavaScript HTML 602955 2 HTML CSS 589971 3 Shell JavaScript 221484 4 Shell Python 217501 .. ... ... ... 995 Gnuplot C 2199 996 Ruby AppleScript 2192 997 XS C 2192 998 SQLPL Batchfile 2189 999 Smarty C++ 2188 [1000 rows x 3 columns]> a= pd.pivot_table(data1, values=['cnt'], index=['lang']) a.head <bound method NDFrame.head of cnt lang 1C Enterprise 315 ABAP 483 AGS Script 730 AMPL 832 ANTLR 2666 ... ... mupad 79 nesC 510 ooc 130 wisp 25 xBase 356 [337 rows x 1 columns]> b = pd.pivot_table(data2, values=['cnt'], index=['lgn'], columns=['t2_lgn']) b.head <bound method NDFrame.head of cnt \ t2_lgn ASP ActionScript ApacheConf AppleScript Arduino Assembly lgn Assembly 13367.0 NaN NaN NaN NaN NaN Awk 11052.0 NaN NaN NaN NaN 18215.0 Batchfile 4921.0 NaN 9816.0 NaN NaN 10466.0 Bison NaN NaN NaN NaN NaN 3048.0 C 15086.0 NaN 4141.0 2513.0 7981.0 48471.0 ... ... ... ... ... ... ... Visual Basic NaN NaN NaN NaN NaN 2471.0 Vue NaN NaN NaN NaN NaN NaN XS NaN NaN NaN NaN NaN NaN XSLT 6534.0 NaN 3401.0 NaN NaN 7838.0 Yacc 5394.0 NaN NaN NaN NaN 10922.0 ... t2_lgn VimL lgn Assembly NaN Awk NaN Batchfile NaN Bison NaN C NaN ... ... Visual Basic NaN Vue NaN XS NaN XSLT 5300.0 Yacc NaN [94 rows x 86 columns]>
I need to place these data frames in the cosine similarity formula. The dataframes have different length. I found the code below and tried to use it but it did not work.
sum = 0 suma1 = 0 sumb1 = 0 for i,j in zip(a,b): suma1+=i *i sumb1+=j*j sum += i*j cosinesim = sum / ((sqrt(suma1))*(sqrt(sumb1))) print(cosinesim)
I get an error:
TypeError Traceback (most recent call last) <ipythoninput2653d1b0bb4770e> in <module> 28 29 for i,j in zip(a,b): > 30 suma1+=i *i 31 sumb1+=j*j 32 sum += i*j TypeError: can't multiply sequence by nonint of type 'str'
Thank you for help!