Document Cluster
I am doing clustering for news articles. I used Universal Sentence Encoder for the document embedding. I passed the embedding to the HDSCAN cluster algorithm but the resulting clusters are very problematic. The distance metric I am using is cosine distance.I must note that my dataset is very noisy (contains ads, comments etc that are passed as articles). What would be the best approach to get some good results? My initial idea is to perform dimension reduction techniques (ex: UMAP or NMF) and then pass it to KMeans algorithm. The cosine distance that I am passing at HDBSCAN is 4000x4000 dimensions. If I perform dimension reduction, what is a way to choose the reduction (till what extend should I reduce?)?
PS. I must note this is my first time working on a NLP task, so please dont be too harsh.
See also questions close to this topic

[Python]: mpi4py parallel numpy dot product
So I was trying to parallel the numpy's dot product using mpi4py on a cluster. The basic idea is to split the first matrix to smaller ones, multiply the smaller ones with the second matrix and the stack the results to one.
I am facing some issues though the result of the parallel multiplication is different than the one running on one thread except for the first row.
from mpi4py import MPI import numpy as np comm = MPI.COMM_WORLD world = comm.size rank = comm.Get_rank() name = MPI.Get_processor_name() a = np.random.randint(10, size=(10, 10)) b = np.random.randint(10, size=(10, 10)) c = np.dot(a, b) # Parallel Multiplication if world == 1: result = np.dot(a, b) else: if rank == 0: a_row = a.shape[0] if a_row >= world: split = np.array_split(a, world, axis=0) else: split = None split = comm.scatter(split, root=0) split = np.dot(split, b) data = comm.gather(split, root=0) if rank == 0: result = np.vstack(data) # Compare matrices if rank == 0: print("{}  {}".format(result.shape, c.shape)) if np.array_equal(result, c): print("Multiplication was successful") else: print("Multiplication was unsuccessful") print(result  c)
I have tried to execute the split, scatter, gather, vstack commands without the dot product. The gathered stacked matrix was the matrix A. That, probably, means that the gathered indices aren't getting shuffled between the processes. Since I think that it is impossible for the
np.dot
to fail doing the dot product correctly, I guess that my issue is my algorithm. What am I missing here? 
How can I migrate minikube k8s cluster from Linux to Windows?
How can I migrate minikube k8s cluster from Linux to Windows? Maybe there are tools or programs?

Redis memory usage CLI does not work on cluster
I using the goredis library to check memory usage of a specific key on a Redis cluster. The library fails sporadically with error "redis: nil" which usually means that it accesses the wrong redis instance to find the key. The goredis library is using the Redis CLI: "command" to get the list of arguments for each command, and to find where is the redis key position in the arguments list.
Specifically for the memory CLI, the output of the "command" CLI is:
157) 1) "memory" 2) (integer) 2 3) 1) readonly 2) random 4) (integer) 0 5) (integer) 0 6) (integer) 0
The Redis document: https://redis.io/commands/command items 4 and 5 are the positions of the first key in arguments, and the last key in arguments.
But the values are zero? According to the memory CLI document: https://redis.io/commands/memoryusage The items 4 and 5 should both have the value 3.
Is this a bug in the output of the redis "command" CLI, or am I misunderstanding this?

optimize matrix multiplication in for loop RcppArmadillo
The aim is to implement a fast version of the orthogonal projective nonnegative matrix factorization (opnmf) in R. I am translating the matlab code available here.
I implemented a vanilla R version but it is much slower (about 5.5x slower) than the matlab implementation on my data (~ 225000 x 150) for 20 factor solution.
So I thought using c++ might speed up things but its speed is similar to R. I think this can be optimized but not sure how as I am a newbie to c++. Here is a thread that discusses a similar problem.
Here is my RcppArmadillo implementation.
// [[Rcpp::export]] Rcpp::List arma_opnmf(const arma::mat & X, const arma::mat & W0, double tol=0.00001, int maxiter=10000, double eps=1e16) { arma::mat W = W0; arma::mat Wold = W; arma::mat XXW = X * (X.t()*W); double diffW = 9999999999.9; Rcout << "The value of maxiter : " << maxiter << "\n"; Rcout << "The value of tol : " << tol << "\n"; int i; for (i = 0; i < maxiter; i++) { XXW = X * (X.t()*W); W = W % XXW / (W * (W.t() * XXW)); //W = W % (X*(X.t()*W)) / (W*((W.t()*X)*(X.t()*W))); arma::uvec idx = find(W < eps); W.elem(idx).fill(eps); W = W / norm(W,2); diffW = norm(WoldW, "fro") / norm(Wold, "fro"); if(diffW < tol) { break; } else { Wold = W; } if(i % 10 == 0) { Rcpp::checkUserInterrupt(); } } return Rcpp::List::create(Rcpp::Named("W")=W, Rcpp::Named("iter")=i, Rcpp::Named("diffW")=diffW); }
This suggested issue confirms that matlab is quite fast, so is there no hope when using R / c++?
The tests were made on Windows 10 and Ubuntu 16 with R version 4.0.0.
EDIT
After the interesting comments in the answer below. I am posting additional details. I ran tests on a Windows 10 machine with R 3.5.3 (as that's what Microsoft provides) and the comparison shows that RcppArmadillo with Microsoft's R is fastest.
R
user system elapsed 213.76 7.36 221.42
R with RcppArmadillo
user system elapsed 179.88 3.44 183.43
Microsoft's Open R
user system elapsed 167.33 9.96 45.94
Microsoft's Open with RcppArmadillo
user system elapsed 85.47 4.66 23.56

Short text in the context of topic modeling
I am working on topic modeling and I am curious what exactly would be short text under this context?For example, if there is a research paper ,would the research paper's title and abstract be considered as short text?

How to do NMF topic modeling on a .txt file (book)?
I already have a code for NMF topic modeling for .csv file. Now I want to perform it on a .txt file (a book). Is it possible to do an NMF topic modeling on the .txt file? If yes then is it possible by changing the existing code (below)? Or there is a completely different code for .txt files?
Below is the code which I used for CSV file NMF topic modeling.
import pandas as pd import numpy as np reviews_datasets = pd.read_csv(r'Preprocessed file.csv') reviews_datasets = reviews_datasets.head(20000) reviews_datasets.dropna() from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vect = TfidfVectorizer(max_df=0.8, min_df=2, stop_words='english') doc_term_matrix = tfidf_vect.fit_transform(reviews_datasets['review'].values.astype('U')) from sklearn.decomposition import NMF nmf = NMF(n_components=10, random_state=42) nmf.fit(doc_term_matrix ) import random for i in range(10): random_id = random.randint(0,len(tfidf_vect.get_feature_names())) print(tfidf_vect.get_feature_names()[random_id]) first_topic = nmf.components_[0] top_topic_words = first_topic.argsort()[10:] for i in top_topic_words: print(tfidf_vect.get_feature_names()[i]) for i,topic in enumerate(nmf.components_): print(f'Top 10 words for topic #{i}:') print([tfidf_vect.get_feature_names()[i] for i in topic.argsort()[10:]]) print('\n')
I am using Python 3.7. Thanks in Advance.

HDBSCAN cluster caching and persistance
HDBSCAN has a flag to cache its cluster data as a param like mentioned below:
prediction_data :boolean, optional Whether to generate extra cached data for predicting labels or membership vectors few new unseen points later. If you wish to persist the clustering object for later reuse you probably want to set this to True. (default False)
Now I see that at a specifed location, below folder structure is created:
>joblib ...>hdbscan ......>hdbscan_ .........>_hdbscan_boruvka_balltree ............>f1bd5f351764560c3532dbe30f273481 ...............metadata.json ...............output.pkl ............func_code.py
As HDBSCAN docs suggest, we can use these files (probably the pickle file) as a persistence store and it can be later reused for finding cluster labels for new data points. But I don't find a way of doing it.

Python HDBScan class always fails on second iteration before even entering first function
I am attempting to look at conglomerated outlier information, utilizing several different SKLearn, HDBScan, and custom outlier detection classes. However, for some reason I am consistently running into an error where any class utilizing HDBScan cannot be iterated over. All other Sklearn and Custom classes can. The issue I am getting seems to consistently occur on the second pass of the HDBScan class and instantly happens upon algorithm.fit(tmp). Upon debugging the script, it looks like the error is thrown before even getting to the first line of the Class.
Any help? Below is the minimum viable reproduction:
import numpy as np import pandas as pd import hdbscan from sklearn.datasets import make_blobs from sklearn.svm import OneClassSVM from sklearn.ensemble import IsolationForest from sklearn.covariance import EllipticEnvelope class DBClass(): def __init__(self, random = None): self.random = random def fit(self, data): self.train_data = data cluster = hdbscan.HDBSCAN() cluster.fit(self.train_data) self.fit = cluster def predict(self, data): self.predict_data = data if self.train_data.equals(self.predict_data): return self.fit.probabilities_ def OutlierEnsemble(df, anomaly_algorithms = None, num_slices = 5, num_columns = 7, outliers_fraction = 0.05): if isinstance(df, np.ndarray): df = pd.DataFrame(df) assert isinstance(df, pd.DataFrame) if not anomaly_algorithms: anomaly_algorithms = [ ("Robust covariance", EllipticEnvelope(contamination=outliers_fraction)), ("OneClass SVM", OneClassSVM(nu=outliers_fraction, kernel="rbf")), ("Isolation Forest", IsolationForest(contamination=outliers_fraction)), ("HDBScan LOF", DBClass()), ] data = [] for i in range(1, num_slices + 1): data.append(df.sample(n = num_columns, axis = 1, replace = False)) predictions = [] names = [] for tmp in data: counter = 0 for name, algorithm in anomaly_algorithms: algorithm.fit(tmp) predictions.append(algorithm.predict(tmp)) counter += 1 names.append(f"{name}{counter}") return predictions blobs, labels = make_blobs(n_samples=3000, n_features=12) OutlierEnsemble(blobs)
The error provided is not the most helpful.
Traceback (most recent call last): File "<ipythoninput4e1d4b63cfccd>", line 75, in <module> OutlierEnsemble(blobs) File "<ipythoninput4e1d4b63cfccd>", line 66, in OutlierEnsemble algorithm.fit(tmp) TypeError: 'HDBSCAN' object is not callable

DBscan Machine Learning
I am doing now with the DBscan algorithm. But I got a big problem with high dimension of dataset Sklearn(make_blobs) because the dataset are 100 Dimension and 2 or 3 Milion Points. I have tried with parameters as DB = DBSCAN(eps=epsilon,algorithm='ball_tree', min_samples=min_samples, n_jobs = 1).fit(X) but the server was died therefore I set n_job = 1 as the number of parallel jobs to run(1: means using all processors).
I have been trying to run on server AWS 128gb of Ram. But my server was always killed without reason. My professor tells me that I should do with Indexing or Parallel then I read so much about Spark or OpenMP or Index Technology but I don't find the best or an example. I have found just Pseudo code from other papers (https://arxiv.org/pdf/1912.06255.pdf).
Pls give me a suggestion or how can I do with parallel. I have not so much time to do.
Thanks