Optimal number of clusters in data retrieval
I need figure out a way to determine the best number of clusters for a group of data, I m aware you can run the kmeans function in a for loop and plot the data to see, but how do return the most optimal value computationally, like without looking at the plot?
for loop example:
wss < (nrow(mydata)1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] < sum(kmeans(mydata,centers=i)$withinss)
See also questions close to this topic

Mysql query works well at workbench but takes too long in r
I have a query to run in R which retrieves data from the database and performs operations on it. When I run it in mysql workbench, it works just fine but in r it takes way too long and may hang the entire system. I also tried to run it in command prompt but got the error:
Error: memory exhausted (limit reached?)
mysql query:
library(DBI) library(RMySQL) con < dbConnect(RMySQL::MySQL(), dbname ="mydb", host = "localhost", port = 3306, user = "root", password = "") pedigree < dbGetQuery (connection, "SELECT aa.name as person, mother as mom, father as dad FROM addweight LEFT JOIN aa ON addweight.name2 = aa.name2 or addweight.name = aa.name LEFT JOIN death ON addweight.name2 = death.name2 or addweight.name = death.name Where((death.dodeath > curdate() OR aa.name2 NOT IN (SELECT name2 FROM death) OR aa.name NOT IN (SELECT name FROM death) OR aa.name NOT IN (SELECT name FROM death)) AND (dob < curdate() AND domove < curdate()))")

Can I use Zeppelin as an alternative to Shiny?
I've read that Zeppelin can also do R visualizations using spark.r
My question is can I use it to do visualizations based on user inputs. These users would have not R/zeppelin technical experience.

R Write data in a file
I'm trying to save data in a file, but every time I hit the save button, it saves it but keeps deleting the data I already have there. What could be the problem?
saveData < function(data) { data < as.data.frame(t(data)) if (exists("responses")) { responses << rbind(responses, data) } else { responses << data } write.csv(responses, file = "read.csv", row.names = FALSE)

How to combine regular clustering (like KMeans) with SNA clustering (like Louvain)?
I have a specific clustering problem, for which I did not find known answer:
I have a data set about entities, that have both connections to each other, and self independent variables.
For example, entity Xi might have both (weighted) connections to entities Xj, Xk, Xl, and might have the Color "Red", the shape "Square", and the Length 3.8.
I want to cluster them into groups (hopefully hierarchical clustering), but I am not sure how to combine the independent variables and the connections.
Suppose I will use some SNA clustering algorithm, such as Louvain  and all the independent variables (the variables that are not connections) will be lost. Suppose I will use some "spatial" clustering algorithm on the independent variables, such as Kmeans  and all the connections data will be lost.
what I thought about, is to try to convert all the independent variables into sort of "weak" connections. for example if I have the variable "Color", then to connect all the "Red" labeled entities to each other with an extra lightweighted connection, in order to encourage them to group together.
my fear is that such artificial connections might take over the SNA algorithm, and then I will find all of the red entities in one huge cluster (or clique).

Clustering Customers with Python (sklearn)
I work at an ecommerce company and I'm responsible for clustering our customers based on their transactional behavior. I've never worked with clustering before, so I'm having a bit of a rough time.
1st) I've gathered data on customers and I've chosen 12 variables that specify very nicely how these customers behave. Each line of the dataset represents 1 user, where the columns are the 12 features I've chosen.
2nd) I've removed some outliers and built a correlation matrix in order to check of redundant variables. Turns out some of them are highly correlated ( > 0.8 correlation)
3rd) I used sklearn's RobustScaler on all 12 variables in order to make sure the variable's variability doesn't change much (StandardScaler did a poor job with my silhouette)
4th) I ran KMeans on the dataset and got a very good result for 2 clusters (silhouette of >70%)
5th) I tried doing a PCA after scaling / before clustering to reduce my dimension from 12 to 2 and, to my surprise, my silhouette started going to 30~40% and, when I plot the datapoints, it's just a big mass at the center of the graph.
My question is:
1) What's the difference between RobustScaler and StandardScaler on sklearn? When should I use each?
2) Should I do : Raw Data > Cleaned Data > Normalization > PCA/TSNE > Clustering ? Or Should PCA come before normalization?
3) Is a 12 > 2 dimension reduction through PCA too extreme? That might be causing the horrible silhouette score.
Thank you very much!

Cluster analysis with nominal, ordinal and metric data
I got a data set wit nominal, ordinal and metric variables. I want to perform a cluster analysis, since I have mixed scales it seems that using kmodes clustering is the most appropriate way to explore the data. Or has anyone a better way in mind? I am thanksful for any advices!