Optimal number of clusters in data retrieval
I need figure out a way to determine the best number of clusters for a group of data, I m aware you can run the kmeans function in a for loop and plot the data to see, but how do return the most optimal value computationally, like without looking at the plot?
for loop example: wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) for (i in 2:15) wss[i] <- sum(kmeans(mydata,centers=i)$withinss)
See also questions close to this topic
- Remove nth percentage of cells randomly from a raster in r
python 3/ R UCI Robot failure data set
hi I am trying to get this data into python 3 / R. How do i read it into .csv format?
Preprocessing tweets.json file in R
I have extracted the tweets in JSON file using twitterscraper. Now, I have to preprocess these tweets in R. How to handle the tweets json file in R, anyone can help me in this problem?
Need ideas about Clustering of data with R
I have items data and need to cluster them with K-Means, DBSCAN, Hierarchical Clustering, where the goal is to find critical customers. Each customer (column - customer) has a lot of invoices (column invoice) and each invoice has many items. Any other column is additional information that characterizes the item.
If I run Clustering method on this data, I think I might get the clusters exactly on customers (similarity), which is not helpful if I want label customers as critical or not. Maybe I am wrong, that is why I need the opinion of somebody who is familiar with Data Mining in R.
I have following ideas: 1. Discard the column Customer. (But then I need somehow to understand that particular customer is mostly in this cluster.) 2. Use something like ORDER BY in Clustering ( not sure that it is possible).
Maybe I think too straight and miss something easier. I am doing DM in R for the first time and don't have much experience.
Thanks in advance.
Birch package replacement in latest R version
study birch algorithm in R, try to install old version birch package in latest R version(3.4.1). But it doesn't work.
Any solution about this case? Is there an replacement about birch package for birch algorithm or any better way to make old birch package working in latest R (3.4.1 version) ?
Edit: I have seen this post, and tried to install the old package of archived birch on Windows by RTools. Installation process got no issue. But when use this package in R always meet fatal error ( aborted R session).
So, just wonder if some new package already replace this old birch package to realize birch algorithm?
Bigquery Topic Modelling ecommerce data
I have some dummy user data from an ecommerce site like this: (hosted on Bigquery)
user | search_query | time_searched --------------------------------------------- abc | japanese dvd | T1 abc | canon cartridge | T2 abc | canon cartridge tx100 | T3 abc | ink tx100 canon | T4 abc | printer ink canon | T5 xyz | nike shoes | T1
I would like to count the number of search attempts made per user, for each type of item.
The problem comes with clustering/sorting which search queries point to the same item.
For example, the user above made 4 searches to browse printer ink cartridges, even though they are phrased differently. He made 1 search for a japanese DVD, that clearly is a separate category from printer inks.
What I've tried:
I've downloaded a smaller subset of data (n=5000) to
Rand tried analysing it using the
topicmodelingpackage that runs on
LDA. However, it didn't work well. I suspect the length of search queries were too short to model anything.
I thought of hard-coding a set of "keywords" (ie., the main types of items people tend to search for) and counting the number of times a user searches for each of them, and make an estimate from there. However, I'd rather not do this unless there is no other more accurate way.
What's the best way to approach this problem? If it's a specific type of machine learning, which areas should I look at?
I'm new to ml but willing to learn. Just point me in the right direction.