Use python seaborn to set Heatmap correlations ONLY between certain values
I've got some data for a plastic extruder machine that I am looking for patterns in. I was able to use this answer to get part of the way to a solution by showing correlations over a certain threshold using a seaborn heatmap. Due to the way the machine operates, many of the values I need to analyze are negatively correlated, for example if you increase the speed the extruder operates at, you will decrease the weight of the product made and defining this is of interest to the operators.
What I have so far is from the answer in the link above and works fine for events over the threshold set at kot.
corr = df2.corr()
kot = corr[corr>=.8]
plt.figure(figsize=(60,40))
sns.heatmap(kot, cmap="Greens")
Can someone help me define it so I could also print correlations that are less than -0.8. It would be really helpful if the same display could have also correlations above +0.8 as well and how I would set kot to so that?
Many thanks.
1 answer
-
answered 2021-01-19 12:13
r.b.leon
fig, ax = plt.subplots() kot1 = corr[corr>= 0.8] kot2 = corr[corr< -0.8] sns.heatmap(kot1, cmap="Greens") ax2 = ax.twinx() sns.heatmap(kot2, cmap="Greens") sns.plt.show()
of course you can overlay them.
See also questions close to this topic
-
NLTK: how to split based on a given pattern
if I have this code:
source_text = 'To convert some docs, <g id="1">just click “Add Books” </g>button and then click “<g id="2">Convert”.</g> Set the output format and click <g id="3">“OK”</g>. abbreviations = ['e.g', 'i.e'] tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') tokenizer._params.abbrev_types.update(abbreviations) tokenizer_dicts = tokenizer.tokenize(source_text) for tokenized in tokenizer_dicts: print(tokenized)
this will output:
To convert some docs, <g id="1">just click “Add Books” </g>button and then click “<g id="2">Convert”. </g> Set the output format and click <g id="3">“OK”</g>.
whereas the expected result is:
To convert some docs, <g id="1">just click “Add Books” </g>button and then click “<g id="2">Convert”.</g> Set the output format and click <g id="3">“OK”</g>.
Notice the
</g>
at the start of the second string is now "moved and appended" to the first stringHow do I give it a pattern so it splits like the expected result? or any way around it?
-
How to split two kind of values in the column and count the number of occurrence?
I want have a dataframe which has a TRUE and FALSE column. I want to get a dataframe like :
Sample Input
Country | Class | Catalog A | abc | TRUE A | abc | FALSE B | def | TRUE C | ghi | FALSE
Sample Output
Country | Class | TRUE | FALSE | TOTAL A | abc | 1 | 1 | 2 B | def | 1 | 0 | 1 C | ghi | 0 | 1 | 1
I had tried :
df.groupby(['Country','Class','Country'])['Catalog'].value_counts()
but I did not get the desired results.
Any help around this?
-
Using pygame in google colab
I just tried to import pygame and when try to open a .wav file, it keeps producing an error like this. I don't know what to do. please anyone help me to sort this issue.
Please check the code and error
-
Providing dtypes for Dataframe.apply()
Problem description
When provided a func that returns a list of numerical values of different dtypes, DataFrame's apply up-converts all the returned values to a common type. For example, in the code below the elements in the 2nd column, the integers "3", are converted by the apply() to the complex number (3.0+0.0j).
df = pd.DataFrame([1,2,3]) df.apply(lambda row: [ 1+5j, 3], axis='columns', result_type='expand') 0 1 0 1.0+5.0j 3.0+0.0j 1 1.0+5.0j 3.0+0.0j 2 1.0+5.0j 3.0+0.0j
This behavior is inherited from Numpy's type determination:
If not given, then the type will be determined as the minimum type required to hold the objects in the sequence.
Is there any way to provide a dtype parameter to the DataFrame's apply ?
Expected Output0 1 0 1.0+5.0j 3 1 1.0+5.0j 3 2 1.0+5.0j 3
-
How do I display more than one pandas describe() output in a single jupyter cell?
This is a really basic question but I haven't been able to find an answer:
In Jupyter, if I execute two pandas
df.describe()
calls in the same cell, only the last one's output is displayed. The same is true for.info()
,.head()
etc. etc.How do I persuade Jupyter and pandas to display all of the above outputs as intended?
FWIW example code would be:
df1.describe() df2.describe() # Only the result of the final call is displayed
-
How do I use different colors for x, y, and reg line in seaborn jointplot?
I am trying to have
sepal_width
data points and marginal histogram to be in blue, andsepal_length
data points and marginal histogram to be in green. I would also like to have the regression line be in a different color, say, brown. Here is the iris data.import pandas as pd import seaborn as sns sns.set(style='white', color_codes=True) import matplotlib.pyplot as plt columns = ['sepal_length','sepal_width','petal_length','petal_width','sth'] iris = pd.read_csv('iris.csv',names=columns) sns.jointplot(data=iris, x='sepal_length', y='sepal_width', kind='reg', height=4, color='blue', marginal_kws={'color':'green'}) r = iris['sepal_length'].corr(iris['sepal_width']) plt.text(6,4,'r = '+str(round(r,5)), fontsize=13) plt.show()
Thank you!
-
how do I plot std of data by column?
edit: tl-dr;
imagine you have data like np.array([x0, x1, ...]) where 0 ≤ x ≤ 1. Only one value will be 1 exactly and you should probably throw that out. You want to bar chart the standard deviation of each value, with lines in the chart to show where the standard deviation is for sd = 1, = 2, etc.
How do you draw this chart?
a little more detail
I was thinking something perhaps like this but this is wrong:
corr = results[results['deciles'] > 0].corr() data = corr[np.abs(corr['predictions']) > my_sigma * (l + np.std(corr['predictions'])).mean()]['predictions'] # print(data) # 1567450 0.179339 # 1948520 0.183407 # 3004299 -0.191020 # predictions 1.000000 fig, ax = plt.subplots() xs = np.arange(len(data.index)) width = 1 labels = list(data.index) plt.bar(x=xs, y=data.values, height=2) plt.xticks(xs, labels) plt.yticks(data.values) ax.bar(y=data.values, height=d_firm.columns) plt.show()
but this throws:
ValueError: setting an array element with a sequence.
on the plt.bar line
original
I am predicting sales associate chances of creating new accounts. I have a count of the number of new accounts in the previous year and I am using the boolean of that to do a binary classification. Afterwards, I want to see if any subgrouping of attributes are important. some categories have thousands of types, so a correlation matrix is pure grey.
I have data like:
# predicting the binary indication of a count column: so 0 = 0, > 0 = 1 in y results = pd.get_dummies(df['Firm ID']) X, y = apply_to_pipeline(df) results = predict(X) # returns a df with ['predictions'] results['actual'] = y results['deciles'] = apply_to_qcut(results)
note that the
actual
column is the binary class where 1 is the raw target column > 0, and 0 is the same column == 0. Also, thepredictions
column is from predict_proba, not predict, so these are values between 0 and 1.Now I want to correlate the results wiht some of the columns from X to see if any subgrouping of attributes are important... I am going to do this same thing once for each categorical column (like "Firm", in this example). The correlation matrix (mostly just mapping the group dummies to the
actual
andpredictions
columns) will center around the std of values per dummy, so I like this approach to limit things:corr = results[results['deciles'] > 0].corr() corr[np.abs(corr['predictions']) > (1.5 * np.std(corr, axis=1).mean())]['predictions']
note that the last line selects only the
predictions
column, so I want to plot only a single column.That gives me just a few of the ones that are most over and underperforming. I would like to plot this.
I think what makes the most sense is to draw a line at 1, 2, 3 std, and then plot on the x-axis each remaining value in
corr
and on the y-axis each std of their result.How can I do that?
-
Mapping HEX color codes onto individual columns in seaborn
I have a CSV with thousands of color codes (first column) and values assigned to them (second column).
I would like to create a graph where the codes are shown as their actual colors.
Is this possible in Seaborn?
-
R: Understanding and Controlling "color shading" in R
I am using the R programming language. I am following this tutorial this tutorial over here and I made the following plots: https://www.r-graph-gallery.com/2d-density-plot-with-ggplot2.html
#load library library(ggplot2) #create data a <- data.frame( x=rnorm(20000, 10, 1.9), y=rnorm(20000, 10, 1.2) ) b <- data.frame( x=rnorm(20000, 14.5, 1.9), y=rnorm(20000, 14.5, 1.9) ) c <- data.frame( x=rnorm(20000, 9.5, 1.9), y=rnorm(20000, 15.5, 1.9) ) data <- rbind(a,b,c) # plot 1 ggplot(data, aes(x=x, y=y) ) + stat_density_2d(aes(fill = ..density..), geom = "raster", contour = FALSE) + scale_fill_distiller(palette=4, direction=-1) + scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(expand = c(0, 0)) + theme( legend.position='none' ) # plot 2 ggplot(data, aes(x=x, y=y) ) + stat_density_2d(aes(fill = ..density..), geom = "raster", contour = FALSE) + scale_fill_distiller(palette=4, direction=1) + scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(expand = c(0, 0)) + theme( legend.position='none' ) # plot 3 ggplot(data, aes(x=x, y=y) ) + stat_density_2d(aes(fill = ..density..), geom = "raster", contour = FALSE) + scale_fill_distiller(palette= "Spectral", direction=1) + scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(expand = c(0, 0)) + theme( legend.position='none' )
Can anyone please walk me through the process as to how a 2 dimensional coordinate (data$a, data$b) is converted into a continuous and shaded color plot? Is a 2 dimensional coordinate converted into 3 coordinates (e.g. Red, Blue, Green or Hue, Lum, Saturation?) using some mapping function? How do certain regions on this map (that do not have an exact coordinate in the dataset) have their colors "interpolated"?
Are there any videos, documentation, website, blogs etc. that explain the math behind this process?
Thanks
-
pheatmap discrete colormap placement of tick labels
I have a matrix with thousands of cells, with values range from 0-5. I want to use a discrete color palette to indicate the value of the respective cell. This code already works pretty well, but the position of the colormap labels is off. I simply want, that each tick is in the center for the corresponding color..
library("pheatmap") library("RColorBrewer") matrix <- round(matrix(rexp(200, rate=.1), ncol=20)/10) color <- brewer.pal(max(matrix)+1,"Blues") pheatmap(matrix,color=color,cluster_rows = F,cluster_cols = F)
The example produces a heatmap like this:
I want to move the colorbar labels, so it looks more like this:
If anyone has an idea how to do this, Id be very thankful!
-
plotting duplicated values in python - using pandas or matplotlib
I want to plot in a heatmap all the duplicated values in the entire dataset. So the code for missing values is: sns.heatmap(training_data.isnull(), cbar=False)
so i want something like that but for duplicated values.
please help out
Thanks
-
Generate two negative binomial distributed random variables with predefined correlation
Assume I have a negative binomial distributed variable X1 with NB(mu=MU1,size=s1) and a negative binomial distributed variable X2 with NB(mu=MU2,size=s2). I fitted a negative binomial regression to estimate Mu's and size's from my data
I can use the
rnbinom()
function in R to generate random draws from this distribution.X1model<-rnbinom(n=1000,mu=MU1fitted,size=s1fitted) X2model<-rnbinom(n=1000,mu=MU2fitted,size=s2fitted)
Those draws are now independent. However how can I draw from those distributions, so that they exhibit a predefined correlation r, which is the correlation I observe between my original data X1,X2.
so that:
cor(X1,X2,method="spearman") = r = cor(X1model,X2model,method="spearman")
-or even better draw from those with any arbitrary preset correlation r
-
Correlation in excel for range of values
enter image description hereI have two columns x and y. I want to observe how the correlation changes when x is in 0-10,10-20,20-30 and so on
-
regarding the cholesky decomposition of a crossproduct of a matrix
With respect to the following R implementation
Y %*% solve(chol(crossprod(Y)))
I can see it aims to perform cholesky decomposition over the Y'Y, and then multiplied by Y again.
What is it used for in the data processing? I do not quite understand the underlying mechanism.