Is there any clustering algorithm that can work with data which has linear cluster?
Is there any clustering algorithm that can work with data that has linear clusters and does not require a predefined number of clusters? If there isn't, how can I solve this problem? clusterwise linear regression
do you know?
how many words do you know
See also questions close to this topic

suggestions on fulltext search or already existing search algorithms
Can someone suggest how to solve the below seach problem easily, I mean is there any algorithm, or full text search will be suffice for this?
There is below classification of items data,
ItemCategory ItemCluster ItemSubCluster SubCluster Items Vegetable Root vegetables Root WithOutSkin potato, sweet potato, yam Vegetable Root vegetables Root WithSkin onion, garlic, shallot Vegetable Greens Leafy green Leaf lettuce, spinach, silverbeet Vegetable Greens Cruciferous Flower cabbage, cauliflower, Brussels sprouts, broccoli Vegetable Greens Edible plant stem Stem celery, asparagus
The inputs will be some thing like,
sweet potato, yam Yam, Potato garlik, onion lettuce, spinach, silverbeet lettuce, silverbeet lettuce, silverbeet, spinach
From the input, I want to get the mapping of the input items those belongs to which ItemCategory, ItemCluster, ItemSubCluster, SubCluster.
Any help will be much appreciated.
 create string r using string p and q, C++, Data Structures, Strings

Writing multithreaded code to generate incremental index
I want to write a code in following scenario : we want to design an invoice generator.this invoice generator also generates an incremental invoice no .Also note that for some reasons we dont use something like sql Auto Increment here and we want to use C# (I wrote the increment logic). since several users can send requests at a time it means that we dont have to give the same invoice no to them (I used lock for it and I wrote it).
Now i have written increment logic. and i take care of multiple users requests with lock . and I wrote CRUD for mongodb database.
My problem is that my increent logic is naive. each time i find maximum and i increment by one which is not scalable. and another problem is that when we delete a record we don't want the invoice no to be repeated and since we find max if the deleted record had the max value then the invoice no would be duplicate and this is something we don't want to have.
I really don't know how to handle the problem. specially in case of deletion should i store max values in a file or should i change increment logic completely?
i appreciate any help

Training an ML model on two different datasets before using test data?
So I have the task of using a CNN for facial recognition. So I am using it for the classification of faces to different classes of people, each individual person being a it's own separate class. The training data I am given is very limited  I only have one image for each class. I have 100 classes (so I have 100 images in total, one image of each person). The approach I am using is transfer learning of the GoogLenet architecture. However, instead of just training the googLenet on the images of the people I have been given, I want to first train the googLenet on a separate larger set of different face images, so that by the time I train it on the data I have been given, my model has already learnt the features it needs to be able to classify faces generally. Does this make sense/will this work? Using Matlab, as of now, I have changed the fully connected layer and the classification layer to train it on the Yale Face database, which consists of 15 classes. I achieved a 91% validation accuracy using this database. Now I want to retrain this saved model on my provided data (100 classes with one image each). What would I have to do to this now saved model to be able to train it on this new dataset without losing the features it has learned from training it on the yale database? Do I just change the last fully connected and classification layer again and retrain? Will this be pointless and mean I just lose all of the progress from before? i.e will it make new weights from scratch or will it use the previously learned weights to train even better to my new dataset? Or should I train the model with my training data and the yale database all at once? I have a separate set of test data provided for me which I do not have the labels for, and this is what is used to test the final model on and give me my score/grade. Please help me understand if what I'm saying is viable or if it's nonsense, I'm confused so I would appreciate being pointed in the right direction.

What's the best way to select variable in random forest model?
I am training RF models in R. What is the best way of selecting variables for my models (the datasets were pretty big, each has around 120 variables in total). I know that there is a crossvalidation way of selecting variables for other classification algorithms such as KNN. Is that also a thing or if there exists a similar way for parameter tuning in RF model training?

How would I put my own dataset into this code?
I have been looking at a Tensorflow tutorial for unsupervised learning, and I'd like to put in my own dataset; the code currently uses the MNIST dataset. I know how to create my own datasets in Tensorflow, but I have trouble setting the code used here to my own. I am pretty new to Tensorflow, and the filepath to my dataset in my project is
\data\training
and\data\testval\
# Python ≥3.5 is required import sys assert sys.version_info >= (3, 5) # ScikitLearn ≥0.20 is required import sklearn assert sklearn.__version__ >= "0.20" # TensorFlow ≥2.0preview is required import tensorflow as tf from tensorflow import keras assert tf.__version__ >= "2.0" # Common imports import numpy as np import os (X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data() X_train_full = X_train_full.astype(np.float32) / 255 X_test = X_test.astype(np.float32) / 255 X_train, X_valid = X_train_full[:5000], X_train_full[5000:] y_train, y_valid = y_train_full[:5000], y_train_full[5000:] def rounded_accuracy(y_true, y_pred): return keras.metrics.binary_accuracy(tf.round(y_true), tf.round(y_pred)) tf.random.set_seed(42) np.random.seed(42) conv_encoder = keras.models.Sequential([ keras.layers.Reshape([28, 28, 1], input_shape=[28, 28]), keras.layers.Conv2D(16, kernel_size=3, padding="SAME", activation="selu"), keras.layers.MaxPool2D(pool_size=2), keras.layers.Conv2D(32, kernel_size=3, padding="SAME", activation="selu"), keras.layers.MaxPool2D(pool_size=2), keras.layers.Conv2D(64, kernel_size=3, padding="SAME", activation="selu"), keras.layers.MaxPool2D(pool_size=2) ]) conv_decoder = keras.models.Sequential([ keras.layers.Conv2DTranspose(32, kernel_size=3, strides=2, padding="VALID", activation="selu", input_shape=[3, 3, 64]), keras.layers.Conv2DTranspose(16, kernel_size=3, strides=2, padding="SAME", activation="selu"), keras.layers.Conv2DTranspose(1, kernel_size=3, strides=2, padding="SAME", activation="sigmoid"), keras.layers.Reshape([28, 28]) ]) conv_ae = keras.models.Sequential([conv_encoder, conv_decoder]) conv_ae.compile(loss="binary_crossentropy", optimizer=keras.optimizers.SGD(lr=1.0), metrics=[rounded_accuracy]) history = conv_ae.fit(X_train, X_train, epochs=5, validation_data=[X_valid, X_valid]) conv_encoder.summary() conv_decoder.summary() conv_ae.save("\models")
Do note that I got this code from another StackOverflow answer.

Why KMedoids and Hierarchical return different results?
I have a huge dataframe which only contains 0 and 1, and I tried to use the method
scipy.cluster.hierarchy
to get the dendrogram and then use the methodsch.fcluster
to get the cluster by a specific cutoff. (the metric for distance matrix is Jacccard, the method for linkage is "centroid")However, when I want to specify the optimistic numbers of clusters for my dataframe, I notice the method of KMedoids combined with the Elbow Method can help me. Then after I know the best numbers of clusters such as 2, I tried to use
KMedoids(n_clusters=2,metric='jaccard').fit(dataset)
to get clusters, but the result is different from Hierarchical method. (the reason why I don't use Kmeans is that it is too slow for my dataframe)Therfore, I did a test (the index 0,1,2,3 will be grouped):
import pandas as pd import numpy as np from scipy.spatial.distance import pdist label1 = np.random.choice([0, 1], size=20) label2 = np.random.choice([0, 1], size=20) label3 = np.random.choice([0, 1], size=20) label4 = np.random.choice([0, 1], size=20) dataset = pd.DataFrame([label1,label2,label3,label4]) dataset
Method KMedoids:
since there only are 4 indexes, so the cluster number was set to 2.
from sklearn_extra.cluster import KMedoids cobj = KMedoids(n_clusters=2,metric='jaccard').fit(dataset) labels = cobj.labels_ labels
the clustering result as shown below:
Method Hierarchical:
import scipy.cluster.hierarchy as such #calculate distance matrix disMat = sch.distance.pdist(dataset, metric='jaccard') disMat1 = sch.distance.squareform(disMat) # cluster: Z2=sch.linkage(disMat1,method='centroid') sch.fcluster(Z2, t=1, criterion='distance')
to meet the same number of clusters I tried several cutoff, the number of cluster was 2 when the cutoff was set to 1. Here is the result:
And I googled about the dataframe which was passed to KMedoids should be the original dataframe, not the distance matrix. but it seems that KMedoids will convert the original dataframe to a new one which I don't know for some reason. because I got the data conversion warning:
DataConversionWarning: Data was converted to boolean for metric jaccard warnings.warn(msg, DataConversionWarning)
I also got warning when I perform Hierarchical method:
ClusterWarning: scipy.cluster: The symmetric nonnegative hollow observation matrix looks suspiciously like an uncondensed distance matrix
Purpose:
What I want is to find some method to get the clusters if I know the optimal number of clusters. but the method Hierarchical need to try different cutoff, while the KMedoids don't, but it turns a different result.
Can anybody explain this to me? And are there better ways to perform clustering?

R: Double Clustering of Standard Errors in Panel Regression
so i am analysing fund data. I use a fixed effect model and want to double cluster my standard errors along "ISIN" and "Date" with plm().
output for dput(data) is :
> dput(nd[1:100, ]) structure(list(Date = structure(c(1517356800, 1519776000, 1522454400, 1525046400, 1527724800, 1530316800, 1532995200, 1535673600, 1538265600, 1540944000, 1543536000, 1546214400, 1548892800, 1551312000, 1553990400, 1556582400, 1559260800, 1561852800, 1564531200, 1567209600, 1569801600, 1572480000, 1575072000, 1577750400, 1580428800, 1582934400, 1585612800, 1588204800, 1590883200, 1593475200, 1596153600, 1598832000, 1601424000, 1604102400, 1606694400, 1609372800, 1612051200, 1614470400, 1617148800, 1619740800, 1622419200, 1625011200, 1627689600, 1630368000, 1632960000, 1635638400, 1638230400, 1640908800, 1517356800, 1519776000, 1522454400, 1525046400, 1527724800, 1530316800, 1532995200, 1535673600, 1538265600, 1540944000, 1543536000, 1546214400, 1548892800, 1551312000, 1553990400, 1556582400, 1559260800, 1561852800, 1564531200, 1567209600, 1569801600, 1572480000, 1575072000, 1577750400, 1580428800, 1582934400, 1585612800, 1588204800, 1590883200, 1593475200, 1596153600, 1598832000, 1601424000, 1604102400, 1606694400, 1609372800, 1612051200, 1614470400, 1617148800, 1619740800, 1622419200, 1625011200, 1627689600, 1630368000, 1632960000, 1635638400, 1638230400, 1640908800, 1517356800, 1519776000, 1522454400, 1525046400), tzone = "UTC", class = c("POSIXct", "POSIXt")), Dummy = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0), ISIN = c("LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "LU1883312628", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "NL0000289783", "DE0008474008", "DE0008474008", "DE0008474008", "DE0008474008"), Returns = c(0.12401, 4.15496, 1.39621, 4.46431, 2.28814, 0.58213, 3.61322, 3.56401, 0.6093, 4.73124, 0.88597, 5.55014, 5.12313, 2.65441, 1.3072, 2.99972, 5.1075, 3.51965, 0.24626, 2.21961, 4.48332, 0.03193, 2.19313, 1.81355, 2.2836, 8.3185, 14.58921, 4.47981, 4.52948, 5.51294, 2.16857, 2.56992, 2.04736, 6.17825, 14.71218, 1.24079, 1.33888, 3.5197, 8.09674, 1.43074, 3.79434, 0.47398, 1.57474, 2.48837, 3.08439, 3.68851, 2.93803, 6.43656, 2.67598, 3.39767, 5.27997, 4.76756, 4.89914, 0.95931, 2.22484, 3.01478, 1.63997, 6.64158, 3.46497, 8.54853, 7.40113, 5.68973, 1.64367, 4.35256, 5.09351, 3.43618, 2.16774, 0.77703, 3.16832, 1.65626, 4.91897, 1.76163, 1.49508, 5.16847, 9.53639, 12.74246, 3.08746, 3.4028, 0.09515, 5.66077, 2.85661, 2.58972, 9.53565, 2.93138, 0.32556, 2.92393, 5.02059, 0.98137, 0.58733, 4.91219, 2.21603, 2.52087, 3.87762, 7.66159, 0.04559, 4.48257, 2.83511, 6.27841, 3.98683, 4.99554), Flows = c(0.312598458, 37.228563578, 119.065088084, 85.601069424, 46.613436838, 20.996760878, 12.075112555, 40.571568112, 16.210315254, 54.785115578, 55.93565336, 25.073939479, 16.513305702, 111.112262813, 17.260252326, 44.287088276, 84.358676293, 12.73665543, 14.846322594, 30.353217826, 43.002634628, 31.293725624, 32.291532262, 21.145334594, 33.460150254, 22.458849454, 34.690817528, 34.088358344, 4.069613214, 7.841523244, 6.883674001, 11.99060429, 19.155102931, 20.274682083, 33.509645025, 25.764368282, 22.451403457, 39.075362392, 9.772306537, 7.214728071, 10.462230506, 12.550102699, 0.439609898, 16.527865041, 15.938402293, 10.916678964, 11.041205907, 11.627537098, 13.797947969, 18.096144272, 29.879529566, 51.895196556, 3.192064966, 1.469562773, 9.739671656, 35.108549922, 19.490401121, 36.459406559, 66.213269625, 8.105824198, 17.078089399, 59.408458411, 1.227033593, 42.501421101, 15.275983037, 19.425363714, 23.165013159, 19.68599313, 20.478530269, 19.566890333, 19.63229278, 59.274372862, 37.128708445, 5.129404763, 2.650978954, 0.566245645, 14.80700799, 4.891308881, 18.16286654, 17.570559084, 2.726629634, 14.482219321, 35.795673521, 10.119935801, 14.37900783, 20.385053784, 4.550848701, 17.672355509, 14.270420088, 1.440911458, 8.924636198, 5.749771862, 12.284920947, 23.093834986, 13.553880939, 31.572182943, 22.977082191, 8.076560195, 11.825577374, 9.263872938), TNA = c(2474.657473412, 2327.75517961, 2171.146502197, 2175.433117247, 2082.147188171, 2042.121760963, 2031.311390907, 1918.904748403, 1914.140451001, 1765.867322561, 1724.972362171, 1600.059421422, 1605.009162592, 1539.205393073, 1540.8291693, 1538.550310809, 1370.631945404, 1404.091772234, 1351.60138448, 1290.98574898, 1309.942298579, 1280.634128059, 1278.146819041, 1281.50075434, 1189.563983023, 1062.001168646, 859.735053702, 868.096185968, 894.397805491, 933.614731653, 885.975121845, 897.018097461, 854.196359787, 781.178047528, 863.00585297, 846.859512502, 796.10866733, 784.290994645, 838.747509395, 841.511540715, 863.678978862, 854.663205271, 856.363306246, 859.460891875, 816.275861034, 836.347760358, 800.867957871, 842.657752288, 2742.709413, 2629.70296, 2518.690562, 2516.902480001, 2635.037923, 2606.124805, 2672.082125, 2715.556617, 2738.845915, 2591.318371, 2613.260789, 2396.060545001, 2554.437804, 2638.160519, 2680.990319, 2753.467368, 2533.347075001, 2637.887076, 2670.127393, 2628.138778001, 2688.643794, 2711.56785, 2823.634535001, 2811.983963001, 2835.218976, 2672.765021, 2413.332814, 2718.586512, 2727.69596, 2823.040628, 2805.482839, 2944.602701, 2855.870812, 2765.189256, 2990.804719, 3066.36598, 3059.603769, 3126.458368, 3276.612153, 3289.257788, 3291.864476, 3397.759970999, 3461.462599, 3540.518638, 3388.702548, 3622.641661, 3604.82519, 3732.115875999, 4129.617979, 3857.780349, 3687.848268001, 3858.323607), Age = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 62, 62, 62, 62)), row.names = c(NA, 100L), class = c("tbl_df", "tbl", "data.frame"))
My code did yield me initially a result, i didn't change anything but all of the sudden it doesn't allow me to execute the last line of code.
library(plm) attach(nd) library(lmtest) library(stargazer) library(sandwich) library(etable) library(pacman) library(fixest) library(multiwayvcov) library(foreign) #cleaning #adjust units of TNA and Flows nd < nd %>% mutate(TNA = TNA / 1000000, Flows = Flows / 1000000) #1mio and 1mio #drop na's #nd < nd %>% #drop_na() #variable creation for model Y < cbind(nd$Flows) X < cbind(nd$Dummy, lag(nd$Returns), lag(nd$TNA), nd$Age) # descriptive statistics summary(Y) summary(X) #random effects random2 < plm(Y ~ X, nd, model='random', index=c('ISIN', 'Date')) summary(random2) #fixed effect model fixed2 < plm(Y ~ X, nd, model='within', index=c('ISIN', 'Date')) # BreuschPagan Test bptest(fixed2) #Test which model to use fixed effect or random effects #hausmann test phtest(random2, fixed2) # we take fixed effects ##Doubleclustering formula (Thompson, 2011) vcovDC < function(x, ...){ vcovHC(x, cluster="ISIN", ...) + vcovHC(x, cluster="Date", ...)  vcovHC(x, method="white1", ...) } #visualize SEs coeftest(fixed2, vcov=function(x) vcovDC(x, type="HC1")) stargazer(coeftest(fixed2, vcov=function(x) vcovDC(x, type="HC1")), type = "text")
Now, when i try to run:
coeftest(fixed2, vcov=function(x) vcovDC(x, type="HC1"))
I get the error: Error in match.arg(cluster) : 'arg' should be one of “group”, “time” Before it didn't.
I highly appreciate any answer. I'd also like to know if the formula i used for the double clustered standard errors is correct. I followed the approach from: Double clustered standard errors for panel data
 the comment from Iandorin
edit: i rewrote the code and now it works:
library(plm) attach(nd) library(lmtest) library(stargazer) library(sandwich) library(etable) library(pacman) library(fixest) library(multiwayvcov) library(foreign) #cleaning #adjust units of TNA and Flows #nd < nd %>% #mutate(TNA = TNA / 1000000, Flows = Flows / 1000000) #1mio and 1mio #drop na's #nd < nd %>% #drop_na() #variable creation for model Y < cbind(nd$Flows) X < cbind(nd$Dummy, lag(nd$Returns), lag(nd$TNA), nd$Age) # descriptive statistics summary(Y) summary(X) #random effects random2 < plm(Y ~ X, nd, model='random', index=c('ISIN', 'Date')) summary(random2) #fixed effect model fixed2 < plm(Y ~ X, nd, model='within', index=c('ISIN', 'Date')) # BreuschPagan Test bptest(fixed2) #Test which model to use fixed effect or random effects #hausmann test phtest(random2, fixed2) # we take fixed effects ##Doubleclustering formula (Thompson, 2011) vcovDC < function(x, ...){ vcovHC(x, cluster="ISIN", ...) + vcovHC(x, cluster="Date", ...)  vcovHC(x, method="white1", ...) } testamk < plm(Y ~ X, nd, model='within', index=c('ISIN', 'Date')) summary(testamk) coeftest(testamk, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))
Many thanks in advance! Joe

Seurat  cannot plot the same dimplot again
I am trying to rewrite the code of this paper: https://doi.org/10.1038/s4200302008370
I have written the code stepbystep based on the instructions mentioned in the methods section. But after clustering, for plotting the clusters by dimplot, I receive a dissimilar plot compared to the same plot in the paper.
I wonder what is the problem? I have tailored every parameter to receive the same plot but it hasn't worked yet.
Graph of the paper
My graph
Please help me to solve this issue. 
Linear regression with gradient descent won't converge
I've written linear regression from scratch. The fit function calculates partial derivatives at each epoch for slope (m) and bias (b) and updates these variables using their partial derivatives.
My loss decreases, but it gets stuck at a curious place.
The code:
class LinearRegression: def __init__(self): self.m = 3 self.b = 2 def fit(self, x, y, epochs, lr): for epoch in range(epochs): pred = (self.m * x) + self.b cur_loss = np.sum((y  pred)**2) if epoch % 1000 == 0: print(cur_loss) pd_wrt_m = (2/x.size) * np.sum((y  pred) * x) pd_wrt_b = (2/x.size) * np.sum(y  pred) self.m = (lr * pd_wrt_m) self.b = (lr * pd_wrt_b) def predict(self, x): return (self.m * x) + self.b lr_model = LinearRegression() lr_model.fit(x, y, 10000, 0.01) Y_pred = lr_model.predict(x) plt.scatter(x, y) plt.plot([min(x), max(y)], [min(Y_pred), max(Y_pred)], color='red') # regression line plt.show()
And this is the output when the above code is executed.
I've tried smaller and larger learning rates, smaller and larger epoch counts, but it seems to be the case that wherever the initial m and b values start out, they always stop converging at this particular position.
I've reread my code and could not find a mistake. What could I be doing wrong?

predict function producing too high y values
hi can anyone tell me why my linear regression line is being displayed like https://imgur.com/gallery/u3L2avz. i reverted back to a previous version of my python script that was working before however I mustve edited something for it to be displayed like this.
the y values being predicted are: https://pastebin.com/HkR2JzvU
the actual y values are: https://pastebin.com/gTW90urJ
This is my code:
data = pd.read_csv('food.csv') x = data['Date'].values x = pd.to_datetime(x, errors="coerce") x = x.values.astype("float64").reshape(1,1) y = data['TOTAL'].values.reshape(1,1) reg = LinearRegression() reg.fit(x, y) print(f"The slope is {reg.coef_[0][0]} and the intercept is {reg.intercept_[0]}") predictions = reg.predict(x.reshape(1, 1)) x= data['Date'].astype('str') plt.scatter(x, y,c='black') plt.plot(x, predictions, c='blue', linewidth=2) plt.show()
 Keyerror while using using shift() function of pandas while removing autocorrelation in multiple linear regression

Improve my knowledge while I work under this project
I’d like to develop an algorithm should help to identify geometric patrons under images. Then the program would learn to create suggestions of similar images by consider the geometric patron.
I know I need to improve my knowledge on recursion algorithms and python. But I would like to get a better idea how to develop my project. So that’s the reason why I’m posting this inquiry to improve my knowledge with this project.

How to get clustering on variables using gower in R?
I have a dataset with mixed types: continuous, binary, categorical.
I read some articles that using 'gower' is a good clustering distance for mixed type data. So I would like to try it out and make an exploratory heatmap (clustering both row and column). For a minimal exmaple:
library(cluster) data(agriculture) agriculture$test < as.factor(ifelse(agriculture$y %% 2 == 0, "yes", "no")) head(agriculture) x y test B 16.8 2.7 no DK 21.3 5.7 no D 18.7 3.5 no GR 5.9 22.2 no E 11.4 10.9 no F 17.8 6.0 yes
I can get a dissimilarity matrix on sample using
gower_sample_dist < daisy(agriculture, metric = "gower")
. However, if I need to get the heatmap, I would also need the clustering on variables, which I am not able to run successfully usinggower_variable_dist < daisy(t(agriculture), metric = "gower")
.> daisy(t(agriculture), metric = "gower") Error in daisy(t(agriculture), metric = "gower") : x is not a dataframe or a numeric matrix.
Is there way to get a clustering/dissimlarity matrix on variables using
gower
?Thank you!