Soft cosine distance between two vectors (Python)
I am wondering if there is a good way to calculate the soft cosine distance between two vectors of numbers. So far, I have seen solutions for sentences, which however did not help me, unfortunately.
Say I have two vectors like this:
a = [0,.25,.25,0,.5]
b = [.5,.0,.0,0.25,.25]
Now, I know that the features in the vectors exhibit some degree of similarity among them. This is described via:
s = [[0,.67,.25,0.78,.53]
[.53,0,.33,0.25,.25]
[.45,.33,0,0.25,.25]
[.85,.04,.11,0,0.25]
[.95,.33,.44,0.25,0]]
So a
and b
are 1x5
vectors, and s
is a 5x5
matrix, describing how similar the features in a
and b
are.
Now, I would like to calculate the soft cosine distance between a
and b
, but accounting for betweenfeature similarity. I found this formula, which should calculate what I need:
soft cosine formula
I already tried implementing it using numpy:
import numpy as np
soft_cosine = 1  (np.dot(a,np.dot(s,b)) / (np.sqrt(np.dot(a,np.dot(s,b))) * np.sqrt(np.dot(a,np.dot(s,b)))))
It is supposed to produce a number between 0 and 1, with a higher number indicating a higher distance between a and b. However, I am running this on a larger dataframe with multiple vectors a and b, and for some it produces negative values. Clearly, I am doing something wrong.
Any help is greatly appreciated, and I am happy to clarify what need clarification!
Best, Johannes
1 answer

From what I see it may just be a formula error. Could you please try with mine ?
soft_cosine = a @ (s@b) / np.sqrt( (a @ (s@a) ) * (b @ (s@b) ) )
I use the
@
operator (which is a shorthand for np.matmul on ndarrays), as I find it cleaner to write : it's just matrix multiplication, no matter if 1D or 2D. It is a simple way to compute a dot product between two 1D arrays, with less code than the usualnp.dot
function.
See also questions close to this topic

SyntaxError: invalid syntax python, what's the cause?
I tried this
for i in range(len(current_pos)): vector.append(to_pos[i]  current_pos[i]) distance = math.sqrt((vector[0]2)+(vector[1]2)) angle = math.degrees(math.atan(vector[1]/vector[0])) print(vector[1], vector[0])
SyntaxError: invalid syntax
What is the cause of this syntax error? Now I just write because I need more details.

Binning multidimensional array in numpy
I have a 4d numpy array (these are stacks of imaging data) and would like to perform mean binning along all but one of the axes.
starting with say
x=np.random.random((3,100,100,100))
I want to apply binning to axes 1,2,3 with bin size 10 and average the values in each bin.
expected result would be an array of shape (3,10,10,10)
I have looked into np.reshape like so:
result=x.reshape(3,1,10,100,100).mean(axis=1) result=result.reshape(3,10,1,10,100).mean(axis=2)
and so on, but this messes up the structure of the image arrays
is there a more straightforward way to do this?

package install issues after upgrading pip to 21.1.1
After upgrading pip to 21.1.1, when trying to install some private packages getting the following errors: (I tried using pip 20.2.3 and it works fine. Any idea on how I can get it resolved.)
ERROR: Could not find a version that satisfies the requirement
privatepackagename
(unavailable) (from versions: none)ERROR: No matching distribution found for
privatepackagename
(unavailable) 
Why does my Webscraper built using python return an empty list when it should return scraped data?
I am trying to scrape product details such as product name,price,category,color from https://nike.co.in Despite giving the correct Xpath to the script, It does not seem to be scraping the details and it gives an empty list. Here's my complete script:
import time import numpy as np from selenium import webdriver from selenium.webdriver.common.keys import Keys from webdriver_manager.chrome import ChromeDriverManager def scrape_nike(shop_by_category): website_address = ['https://nike.co.in'] options = webdriver.ChromeOptions() options.add_argument('startmaximized') options.add_argument("windowsize=1200x600") options.add_experimental_option("excludeSwitches", ["enableautomation"]) options.add_experimental_option('useAutomationExtension', False) browser = webdriver.Chrome(ChromeDriverManager().install(), options=options) delays = [7, 4, 6, 2, 10, 19] delay = np.random.choice(delays) for crawler in website_address: browser.get(crawler) time.sleep(2) time.sleep(delay) browser.find_element_by_xpath('//*[@id="VisualSearchInput"]').send_keys(shop_by_category, Keys.ENTER) product_price = browser.find_elements_by_xpath('//*[@id="Wall"]/div/div[5]/div/main/section/div/div[1]/div/figure/div/div[3]/div/div/div/div') product_price_list = [elem.text for elem in product_price] product_category = browser.find_elements_by_xpath('//*[@id="Wall"]/div/div[5]/div/main/section/div/div[1]/div/figure/div/div[1]/div/div[2]') product_category_list = [elem.text for elem in product_category] product_name = browser.find_elements_by_xpath('//*[@id="Nike Air Zoom Vomero 15"]') product_name_list = [elem.text for elem in product_name] product_colors = browser.find_elements_by_xpath('//*[@id="Wall"]/div/div[5]/div/main/section/div/div[4]/div/figure/div/div[2]/div/button/div') product_colors_list = [elem.text for elem in product_colors] print(product_price_list) print(product_category_list) print(product_name_list) print(product_colors_list) if __name__ == '__main__': category_name_list = ['running'] for category in category_name_list: scrape_nike(category)
The output that I want is something like:
[Rs 1000, Rs 2990, Rs 3000,....] [Mens running shoes, Womens running shoes, ...] [Nike Air Zoom Pegasus, Nike Quest 3, ...] [5 colors, 1 colors, 3 colors, ...]
But the output that I am getting right now is:
[] [] [] []
What is the exact issue because of which I am getting empty lists? I do not understand. Please help!!
EDIT: I am now able to get just a single product details in my lists, whereas I want all products, here's my change in the code
product_price = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="Wall"]/div/div[5]/div/main/section/div/div[1]/div/figure/div/div[3]/div/div/div/div'))) product_price_list = [elem.text for elem in product_price] product_category = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="Wall"]/div/div[5]/div/main/section/div/div[1]/div/figure/div/div[1]/div/div[2]'))) product_category_list = [elem.text for elem in product_category] product_name = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="Nike Air Zoom Vomero 15"]'))) product_name_list = [elem.text for elem in product_name] product_colors = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="Wall"]/div/div[5]/div/main/section/div/div[4]/div/figure/div/div[2]/div/button/div'))) product_colors_list = [elem.text for elem in product_colors]
This gives:
['₹13,495'] ["Men's Running Shoe"] ['Nike Air Zoom Vomero 15'] ['5 Colours']
I want multiple such entries
EDIT2*: I have also tried using beautifulsoup4 but that also returned an empty output.
from selenium import webdriver from selenium.webdriver.common.keys import Keys from bs4 import BeautifulSoup import pandas as pd def adidas(shop_by_category): driver = webdriver.Chrome("F:\\chromedriver\chromedriver.exe") titles = [] # List to store name of the product prices = [] # List to store price of the product category = [] # List to store category of the product colors = [] # List to store the no of colors of the product # URL to fetch from Can be looped over / crawled multiple urls driver.get('https://nike.co.in') driver.find_element_by_xpath('//*[@id="VisualSearchInput"]').send_keys(shop_by_category, Keys.ENTER) content = driver.page_source soup = BeautifulSoup(content, features="lxml") # Parsing content for div in soup.findAll('div', attrs={'class': 'productcard__body'}): name = div.find('div', attrs={'class': 'productcard__title'}) price = div.find('div', attrs={'class': 'productprice css11s12ax iscurrentprice'}) subtitle = div.find('div', attrs={'class': 'productcard__subtitle'}) color = div.find('div', attrs={'class': 'productcard__productcount'}) titles.append(name.text) prices.append(price.text) category.append(subtitle.text) colors.append(color.text) # Storing scraped content df = pd.DataFrame({'Product Name': titles, 'Price': prices, 'Category': category, 'Colors': colors}) df.to_csv('adidas.csv', index=False, encoding='utf8') if __name__ == '__main__': category_name_list = ['running'] for category in category_name_list: adidas(category)

how can I filter using pandas dataframes by multiple values in one column
I have a excel list, includes 2 columns (a and b), and it is 502000 rows. Also I have 2. list, includes 1 column (a), and it is 55 rows.
How can I filter 55 values using pandas dataframes by multiple values (55 values) in one column with 1. list, then write a different file?
Thanks

actively rotating an image using a Kivy.garden Knob
I am trying to use a knob to rotate an image (in my case it's a polarizer) in kivy. My idea is to make a function that is active while the knob is being pressed, so it reads the angle and changes it into the image's canvas. The problem is that I don't know if there is a knob property that gives me its "status", so I can do something like:
While Knob is active: self.canvasid.angle=self.knobid.value
Thanks!

Remove points from a plot legend
I have this code that shows the Lagrange interpolation between set of points(x,y cordination). Using matplotlib:
import numpy as np from scipy.interpolate import lagrange import matplotlib.pyplot as plt x1 = [0.2, 0.4, 0.6, 0.8] y1 = [1, 2, 4, 6] x2 = [0.2, 0.4, 0.6, 0.8] y2 = [3, 10, 19, 43] x_new = np.arange(0.2, 0.8, 0.1) x_new2 = np.arange(0.2, 0.8, 0.1) f = lagrange(x1, y1) f2 = lagrange(x2, y2) fig = plt.figure(figsize=(8, 6)) plt.plot(x_new, f(x_new), 'y', x1, y1, 'ro', label='$r = 20') plt.plot(x_new2, f2(x_new2), 'b', x2, y2, 'ro', label='$r = 40') plt.legend(loc='best') plt.title('Lagrange Polynomial') plt.grid() plt.xlabel('r') plt.ylabel('cut size') plt.show()
My output:
I want to remove the red points in the legend. How can I do that?

Fitting a QuadraticPlateau in python  scipy optimize.curve_fit a function returns value depends on a conditional parameter
I'm trying to fit a Quadraticplateau model to agricultural data. In particular, it's Nitrogen fertilization and corn yield response to it. It's a common practice in research.
It's very common to do it using R, like in this following example https://gradcylinder.org/quadplateau/
but it lacks examples and resources when it comes to python. I've managed to find a great library, called eonr (https://eonr.readthedocs.io/en/latest/) that does what I'm looking for (and much more) but i need more flexibility and more options for visualization.
Through the eonr gallery I found the function it uses and the parameters for the fitting done by scipy.curve_fit.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from scipy.optimize import curve_fit x = df['N_Rate'].values.reshape(1) y = df['Yield'].values.reshape(1) def quad_plateau(x, b0, b1, b2): crit_x = b1/(2*b2) y = 0 y += (b0 + b1*x + b2*(x**2)) * (x < crit_x) y += (b0  (b1**2) / (4*b2)) * (x >= crit_x) return y guess=[10,0.0001,10] popt, pcov = curve_fit(quad_plateau,x,y,p0=guess,maxfev=1500) plt.plot(x, y, 'bo') plt.plot(x, quad_plateau(x, *popt), 'r') plt.show()
I overcame a lot of issues but i can't understand why the grapsh shows only the linear part of the graph... what am i doing wrong? Thanks a lot!!

nnz is too large even I have set the max_feature to 500 and float to 32bit
I would like to find the similarity between teleplays
from sklearn.feature_extraction.text import TfidfVectorizer #define a tfidf vectorizer object and remove all english stop words such as 'the','a' tfidf=TfidfVectorizer(stop_words='english', max_features=500) #construct the required TFIDF matrix by fitting and transforming the data tfidf_matrix=tfidf.fit_transform(merged_data['name']) # output the shape of the tfidf_matrix tfidf_matrix.shape
The shape for the tfidf_martix is (6472256, 500)
from sklearn.metrics.pairwise import cosine_similarity,linear_kernel,sigmoid_kernel # we use linear kernel similarity since is faster than cosine similarity #computing cosine similarity matrix cosine_sim=cosine_similarity(tfidf_matrix,tfidf_matrix) from sparse_dot_mkl import dot_product_mkl result = dot_product_mkl(random_tc, random_p)
But I got the error message as
 RuntimeError Traceback (most recent call last) <ipythoninput781e4fbfd6e689> in <module> 5 #computing cosine similarity matrix 6 > 7 cosine_sim=cosine_similarity(tfidf_matrix,tfidf_matrix) 8 from sparse_dot_mkl import dot_product_mkl 9 /opt/conda/lib/python3.7/sitepackages/sklearn/metrics/pairwise.py in cosine_similarity(X, Y, dense_output) 1187 1188 K = safe_sparse_dot(X_normalized, Y_normalized.T, > 1189 dense_output=dense_output) 1190 1191 return K /opt/conda/lib/python3.7/sitepackages/sklearn/utils/validation.py in inner_f(*args, **kwargs) 61 extra_args = len(args)  len(all_args) 62 if extra_args <= 0: > 63 return f(*args, **kwargs) 64 65 # extra_args > 0 /opt/conda/lib/python3.7/sitepackages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output) 150 ret = np.dot(a, b) 151 else: > 152 ret = a @ b 153 154 if (sparse.issparse(a) and sparse.issparse(b) /opt/conda/lib/python3.7/sitepackages/scipy/sparse/base.py in __matmul__(self, other) 558 raise ValueError("Scalar operands are not allowed, " 559 "use '*' instead") > 560 return self.__mul__(other) 561 562 def __rmatmul__(self, other): /opt/conda/lib/python3.7/sitepackages/scipy/sparse/base.py in __mul__(self, other) 478 if self.shape[1] != other.shape[0]: 479 raise ValueError('dimension mismatch') > 480 return self._mul_sparse_matrix(other) 481 482 # If it's a list or whatever, treat it like a matrix /opt/conda/lib/python3.7/sitepackages/scipy/sparse/compressed.py in _mul_sparse_matrix(self, other) 507 np.asarray(self.indices, dtype=idx_dtype), 508 np.asarray(other.indptr, dtype=idx_dtype), > 509 np.asarray(other.indices, dtype=idx_dtype)) 510 511 idx_dtype = get_index_dtype((self.indptr, self.indices, RuntimeError: nnz of the result is too large
Therefore, I tried changing the data from float64 to float 32
tfidf_matrix=tfidf_matrix.astype(np.float32)
But I still got the error message.Should I reduce the number of tfidf_matrix? How much could I reduce? I am now using Google Colab/ kaggle Notebook and Jupyter Notebook and try to find a platform which could run this so I could find the prediction of rating.

Calculate cosine similarity and create a graph in R
I have datasets like this:
x < c(1,4,6,8,0,5) y<c(2,3,5,8,1,14) z<c(3,5,23,51,3,15) t< c(14,14,23,4,16,17)
NB. I actually have 20 of vectors, this is a simplified example.
I want to automatically calculate the cosine similarity between all vectors and then create a network of these vectors (each vector will be connected with all the others), where the size of the bridge depends on the value of the cosine similarity. I hope I explained myself well, Thanks

iterate through nested dictionary and calculate cosine similarity
dic = {'Michael_Bay' :{'Bruce Willis': 1, 'Ben Affleck':2, 'Liam Neeson':2}, 'Steven_Spielberg': {'Liam Neeson': 1, 'Tome Hanks':1, 'Denzel Washington':1}}
need to compare the keys Michael Bay and Steven spielberg and if any actor doesnt exist in second director values then append 0
result should be as below: Michael Bay vector = (1,2,2,0,0) Steven Spielberg vector = (0,0,1,1,1)
Finaly: calculate cosine similarity between Michael Bay and Steven Spielberg. from above vector, please help