overlay a normal distribution to a histogram of nonnormally distributed values in ggplot r
I'm trying to overlay a normal bell curve on top of the histogram of these fake data that are intentionally NOT normally distributed. My goal is to show other students how nonnormally distributed data look in comparison to a normal distribution.
While I have figured out how to get the bell curve on from other questions that have been asked, my y axis is acting strange. For a density plot, I would assume that the axis would go from 0 to 1, but for some values, it says the density is 2 (see image of screenshot below). I want bars that show the density and a bell curve that shows the normal distribution. Any help would be appreciated!
Here's the fake dataset:
library(dplyr)
tester2 < tibble(
fake = c(2, 2, 2, 2, 10, 10, 10, 10, 5, 3, 4, 5, 6, 7, 8, 9, 10, 10, 5, 2, 4, 5, 6, 7, 8, 4, 4, 5, 5, 2, 2, 2, 2, 2, 10, 10, 10, 10, 5, 2, 2, 2, 2, 2, 10, 10, 10, 10, 5, 2, 2, 2, 2, 2, 10, 10, 10, 10, 5, 2, 3, 4, 5, 5, 5, 5, 5, 4, 6, 5),
also_fake = c(1, 2, 2, 2, 3, 3, 3.3, 4, 4, 5, 1, 2, 2, 2, 3, 3.6, 3, 4, 4, 5, 1, 2, 2, 2.1, 3, 3, 3, 4, 4, 5, 1, 2, 2, 2, 3.1, 3, 3, 4.6, 4, 5, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5)
)
Here's my code so far:
testing < ggplot(tester2, aes(x = also_fake)) +
geom_histogram(aes( y = ..density..)) +
geom_rug() +
stat_function(fun = dnorm,
color = "blue",
args=list(mean = mean(tester2$also_fake),
sd = sd(tester2$also_fake)))
And here's what it produces:
EDIT: This question is different from this question because I do not want a density plot: Superimpose a normal distribution to a density using ggplot in R
It is also different from this question because my values are intentionally nonnormally distributed: ggplot2: histogram with normal curve.
See also questions close to this topic

How to make a function that loops over two lists
I have an event A that is triggered when the majority of coin tosses in a series of tosses comes up heads. I have an unfair coin and I'd like to see how the likelihood of A changes as the number of tosses change and the probability in each toss changes.
This is my function assuming 3 tosses
n < 3 #victory requires majority of tosses heads #tosses only occur in odd intervals k < seq(n/2+.5,n) victory < function(n,k,p){ for (i in p) { x < 0 for (i in k) { x < x + choose(n, k) * p^k * (1p)^(nk) } z < x } return(z) } p < seq(0,1,.1) victory(n,k,p)
My hope is the
victory()
function would1  find the probability of each of the outcomes where the majority of tosses are heads, given a particular value p
2  sum up those probabilities and add them to a vector z
3  go back and do the same thing given another probability pI tested this with
n < 3, k < c(2,3)
andp < (.5,.75)
and the output was 0.75000, 0.84375. I know that the output should've been 0.625, 0.0984375. 
Exponentiation of Log Transformed Values in Mixed Effects Model
I have run a linear mixedeffects model in R using the nlme package in which my response variable (Proximal_Lead_Bowing) was transformed to log10 scale (Log_Bowing) due to a non normal distribution of values. The estimated differences in Log_Bowing between different Deep Brain Stimulation Electrodes (DBS_Electrode) as estimated by the model using the "glht" function for multiple comparisons of means (Tukey contrasts) are as follows: (View screenshot for full glht() output: https://imgur.com/WVJ9KM6)
Linear Hypothesis: Medtronic 3389  Boston Scientific Versice == 0 Estimate: 0.5766* St. Jude Medical Infinity  Boston Scientific Versice == 0 Estimate: 0.2208 St. Jude Medical Infinity  Medtronic 3389 == 0 Estimate:0.3558* *Denotes significance
Exponentiating these values (10^Abs(Estimate)) provide me with the following estimates for true differences in Proximal_Lead_Bowing as estimated by our mixedeffects model:
Linear Hypothesis: Medtronic 3389  Boston Scientific Versice == 0 3.77 (in millimeters) St. Jude Medical Infinity  Boston Scientific Versice == 0 1.66 St. Jude Medical Infinity  Medtronic 3389 == 0 2.27
These values do not make sense considering that the the average Proximal_Lead_Bowing ± 95% CI for each DBS_Electrode in the sample is as follows:
Boston Scientific Versice: 2.10 ± 0.67 (in millimeters) Medtronic 3389: 2.95 ± 0.58 St. Jude Medical Infinity: 2.00 ± 0.35
Thus I would expect true differences in Proximal_Lead_Bowing as estimated by our linear mixed model to be estimated as approximately 1.0 mm between Medtronic 3389 and the other DBS_Electrode models but instead the exponentiated values I have calculated don't seem to make sense. Am I missing something in the process of exponentiation of log10 values and/or use of the "glht" function for multiple comparisons of means? Any feedback would be appreciated.

What kind of Statistic Method for enrichment or overrepresent should I used for a rank ordered vector with Binary status
I have a gene expression data from 1065 different cell lines, let's say "BRAF" gene. BRAF gene expression levels are ordered. Most TP53 mutated cell lines are high BRAF expression (see the figure below). So what kind of statistical method should I use to test the enrichment or overrepresent for TP53 status (WT vs Mutant) on BRAF expression?

Geom_sf does not use geometry coordinates in axes but plots correct shape of polygon?
My overall aim is to combine multiple shape files (polygons of river subbasins from within a large river basin) into one file and plot as a map. This new combined file will later combine with variable data e.g.(rainfall) and plot by
aes()
.My problem is:
ggplot()+geom_sf()
plots the correct shapes of the polygons but doesn't have the correct coordinates on the axes  it doesn't use the values given in the geometry column on the axes.My thoughts on what is wrong, but I'm not sure how to correct:
 The shape file read in has geometry in 'long' 'lat' (crs= 4326) but the crs is saying the coordinates are in UTM Zone 48N WGS84 (crs=32648). If I try and force the crs to 4326 the coordinate values change as if the conversion formula is trying to correct them.
geom_sf
andcoord_sf
are doing something that I don't understand!
library(sp)
library(raster)
library(ggplot2)
library(sf)
library(ggsf)
library(rgdal)
library(plyr)
library(dplyr)
library(purrr)setwd("/Users/.../Sub_Basin_Outlines_withSdata/")
list.files('/Users/.../Sub_Basin_Outlines_withSdata/', pattern='\.shp$')Read in individual polygon shape files from folder. Combine with ID.
bangsai < st_read("./without_S_data/", "Nam Bang Sai")
BasinID < "BGS"
bangsai < cbind(bangsai,BasinID)ing < st_read("./without_S_data/", "Nam Ing Outline")
BasinID < "ING"
ing < cbind(ing,BasinID)
The two individual shape files import as simple features, see image of R codeCombine the individual subbasin polygon shape files into one shapefile with multiple features.
all_sub_basins < rbind(bangsai,ing)
The image shows the values of the coordinates of the polygons/features in
all_sub_basins$geometry
. They are long lat format yet the proj4sting suggests UTM?Plot the
all_sub_basins
simple feature shapefile in ggplotsubbasins< ggplot()+
geom_sf(data=all_sub_basins, colour="red", fill=NA)
subbasinsThe result is a correctly plotted shape file with multiple features (there are more polygons in this image than read in above). However the axes are incorrect (nonsense values) and are not plotting the same values as in the geometry field.
If I add in coord_sf and confirm the crs:
subbasins< ggplot()+
geom_sf(data=all_sub_basins, colour="red", fill=NA)
coord_sf(datum=st_crs(32648), xlim = c(94,110), ylim = c(9,34))
subbasinsThen I get the Correct axes values but not as coordinates with N and E. It seems as if the geometry isn't recognised as coordinates, just as forced numbers?
I don't mind if the coordinates are UTM Zone 48N or lat long. Could I fix it in any of these ways? If so, how do I achieve that?
 Change the shape file crs without changing the values in the geometry column so geom_sf would know to plot the correct axes text.
 Extract the geometry from the shape file into a two column .csv file with long and lat columns. Convert csv into a sf and create my own shape file with correct crs.
 Last resort, leave the plot as it is and replace new axes text manually.
Any help is much appreciated!

How do I force ggplot to use the ordered x axis
I have this data called
test.melted
below. I also have code to plot this data, but it doesn't plot xaxis values in order(xaxis should be 100pc, 95pc, 90pc so on..). How can I fix this? I also wanted to add line instead of geom_point, but changing it to geom_line gives blank plot.data:
test.melted< structure(list(`diluted sample` = c("100pc", "95pc", "90pc", "85pc", "0pc", "100pc", "95pc", "90pc", "85pc", "0pc", "100pc", "95pc", "90pc", "85pc"), variable = c(" of self", " of self", " of self", " of self", " of self", " with NA12878", " with NA12878", " with NA12878", " with NA12878", " with NA12878", " with NA12877", " with NA12877", " with NA12877", " with NA12877"), value = c(0.96, 0.87, 0.78, 0.71, 0.96, 1.13, 1.03, 0.98, 0.96, 0, 0, 0.03, 0.07, 0.14)), .Names = c("diluted sample", "variable", "value"), row.names = c(1L, 2L, 3L, 4L, 21L, 22L, 23L, 24L, 25L, 42L, 43L, 44L, 45L, 46L), class = "data.frame")
code:
p = ggplot(test.melted, aes( x = `diluted sample`, y = value, color = variable )) p + geom_point()

Log and break in y axis (ggplot2)
I have this graph
Code :
library("tidyverse") library("scales") #data head(Vesself, n = 20L) AREA VESSELm VESSEL Clust 1 A10 5 1 4 2 A13 5 1 4 3 A16 5 1 4 4 A2 5 2 4 5 A23 5 1 4 6 A25 3 2 4 7 A25 5 5 4 8 A26 5 5 4 9 A26 3 2 4 10 A26 2 1 4 11 A27 5 1 4 12 A28 3 1 4 13 A28 5 6 4 14 A36 3 1 4 15 A39 5 1 2 16 A43 5 5 2 17 B25 5 1 4 18 B25 3 1 4 19 B26 3 1 4 20 B26 5 2 4 my_breaksx = c(1, 4, 16, 64, 256, 660) #Plot ggHist < ggplot(data = Vesself, aes(VESSEL, color = Clust, fill = Clust)) + geom_bar(stat = "count", width = 0.08) + scale_color_manual(values = cols, name = "Group") + scale_fill_manual(values = cols, name = "Group") + scale_x_continuous(trans = log2_trans(), breaks = my_breaksx) + labs(x="Density of ships per area", y="Number of area", title="Distribution of ship density", subtitle="by scales")+ theme_bw() + theme(plot.title = element_text(face="bold", hjust=0.5), plot.subtitle=element_text(hjust=0.5), legend.background = element_rect(fill="grey90", size=0.5, linetype="solid", colour ="black"), aspect.ratio = 1) + facet_wrap(~VESSELm) ggHist
When I try to apply a logarithm transformation to the y axis, I don't have the same result as the x axis. The values are incredibly high. I don't understand why.
The result of the transformation without manual breaks :
scale_y_continuous(trans = log2_trans())
And the result with manual breaks :
my_breaksy = c(1, 4, 16, 64, 150) scale_y_continuous(trans = log2_trans(), breaks = my_breaksy)
My goal is to have an equivalent representation as the x axis.

How to Isolate connected and fused characters from a text image using opencv
I have been developing an ocr system. I was able to segment out lines and words from the images fairly easily, but Iam completely stuck with character segmentation.How do i extract characters from these blocks. I have read many answers on forums but they have been applied on texts with enough spacing between two characters. I have tried histogram method to isolate characters from the words but the results are not satisfactory. My words look like this.Image with characters touching eachother and One more Sample. The thresholding process only adds to problem by fusing the characters further more. How can I segment these characters. This is what i have done so far: Function accepts a 2d list containing words of size 500*500 pixels to be segmented:
import cv2 import numpy as np def character_detector(words): characters=[] kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(3,3)) for i in words: j=i l=[] for k in j: s=k f=k from cv2.ximgproc import THINNING_ZHANGSUEN s = cv2.ximgproc.thinning(k,s,thinningType=THINNING_ZHANGSUEN ) s = cv2.dilate(s,kernel,iterations=3) k = cv2.threshold(k,75,255,cv2.THRESH_OTSU)[1] m = cv2.Canny(k,100,200) k=km i=0 while i<5: m = cv2.Canny(k,100,200) k = km i+=1 y_sum = cv2.reduce(k, 0, cv2.REDUCE_AVG) y_sum=y_sum[0] y_avg = sum(y_sum)//500 hist =[] for i in range(0,500): if(y_sum[i]==0): hist.append(False) else: hist.append(True) j=1 y_start=0 y_end = 0 y_coord=[] i=0 while i<500: j=1 if not(hist[i]): i=i+1 continue else: y_start = i temp = i j=0 while temp<500 and hist[temp]: temp+=1 j+=1 i= i+j y_end = y_start+j y_coord.append((y_start,y_end)) for i in range (0,len(y_coord),1): roi = f[0:500,y_coord[i][0]:y_coord[i][1]] cv2.imshow('thresh',roi) cv2.waitKey(0) cv2.destroyAllWindows()

python: opencv comparing histograms wird results
I'm using builtin
opencv
function to open image, remove background, crop image, and then calculate histogram of file, to compare it with histogram of different file.To compare histograms I'm using BGR color space with function:
cv2.compareHist(hist_1, hist_2, cv2.HISTCMP_CORREL)
My code is
def cv_histogram(image, channels=[0, 1, 2], hist_size=[10, 10, 10], hist_range=[0, 256, 0, 256, 0, 256], hist_type='BGR'): #convert to different color space if needed if hist_type=='HSV': image = cv2.cvtColor(image, cv2.COLOR_BGR2HSV) elif hist_type=='GRAY': image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) elif hist_type=='RGB': image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) image_hist = cv2.calcHist([image], channels, None, hist_size, hist_range) image_hist = cv2.normalize(image_hist, image_hist).flatten() return image_hist def cv_compare_images_histogram(img_base, img_compare, method='correlation'): hist_1 = cv_histogram(img_base) hist_2 = cv_histogram(img_compare) if method == "intersection": comparison = cv2.compareHist(hist_1, hist_2, cv2.HISTCMP_INTERSECT) else: comparison = cv2.compareHist(hist_1, hist_2, cv2.HISTCMP_CORREL) return comparison im1 = image_remove_background(cv2.imread("1.jpg"), bg_lower_bgr, bg_upper_bgr) im2 = image_remove_background(cv2.imread("2.jpg"), bg_lower_bgr, bg_upper_bgr) sim = cv_compare_images_histogram(im1, im2) img_new = image_stack(im1, im2) cv2.imshow('img_new', img_new) print("Histogram similarity is: ", sim)
as on screen below, images have different colors/objects, but I receive very high correlation: 0.9198019904818888
Script works perfect for most of files, any idea WHY so wired results?

Python: Unable to save matplotlib hist2d plot to file using PdfPages
I'm trying to save multiple 2D histograms generated using hist2d to a multipage pdf generated using PdfPages using following code:
import numpy as np import pandas as pd import seaborn as sns import scipy.stats as stats import matplotlib.pyplot as plt from matplotlib.backends.backend_pdf import PdfPages import warnings import subprocess import os warnings.simplefilter("ignore", category=PendingDeprecationWarning) x1 = np.random.randn(100000) y1 = np.random.randn(100000) + 5 pp = PdfPages("somepdf.pdf") fig = plt.figure() plt.hist2d(x=x1,y=y2, bins=50) plt.title(row['smRNAname']) plt.xlabel("Position(BP)") plt.ylabel("Read Length") cb = plt.colorbar() cb.set_label('counts in bin') pp.savefig(fig, dpi=300, transparent = True) plt.close() fig = plt.figure() fig = plt.hist2d(x=x1,y=y1, bins=50) plt.title(row['PIWIname']) plt.xlabel("Position(BP)") plt.ylabel("Read Length") cb = plt.colorbar() cb.set_label('counts in bin') pp.savefig(fig, dpi=300, transparent = True) plt.close() pp.close()
but I'm getting following error:
 TypeError Traceback (most recent call last) <ipythoninput87ccbb61958687> in <module>() 61 cb = plt.colorbar() 62 cb.set_label('counts in bin') > 63 pp.savefig(fig, dpi=300, transparent = True) 64 plt.close() 65 /anaconda3/lib/python3.7/sitepackages/matplotlib/backends/backend_pdf.py in savefig(self, figure, **kwargs) 2519 manager = Gcf.get_active() 2520 else: > 2521 manager = Gcf.get_fig_manager(figure) 2522 if manager is None: 2523 raise ValueError("No figure {}".format(figure)) /anaconda3/lib/python3.7/sitepackages/matplotlib/_pylab_helpers.py in get_fig_manager(cls, num) 39 figure and return the manager; otherwise return *None*. 40 """ > 41 manager = cls.figs.get(num, None) 42 if manager is not None: 43 cls.set_active(manager) TypeError: unhashable type: 'numpy.ndarray'
Looking from the error itself I can understand that it might be due to the fact that hist2d returns a 2D array instead of referencing to a figure(?). Saving the histogram directly using plt.savefig("test.pdf") works just fine. I'm not sure what I'm doing wrong or it is just not possible ?

Use bar graphs as parameter filter in Tableau
I've created a parameter that changes a top 10 list of selected measure. I also created a sheet with all the measures as bar graphs as shown in the image.
Is it possible for me to create some kind of filter where I select a measure in the bars which can change the parameter and the list?
Also, additionally, are there any other cool ways to select a box or a container that changes the parameter?
Appreciate your time
Parameter:
Bars:
Bars2:

Link filters to queries on Superset
I have created a visualisation in Apache Superset based on a Saved Query. How can I update the query based on the values filtered within a Filter Box?
I have experimented with Jinja and managed to pass hardcoded variables to my query through the template parameters. Now I just need to connect Jinja to the Filter Box such that the values are obtained through the filter rather than hard coded.

Duplicated links between sectors Circlize Package R
I'm trying to use Circlize Package for circular visualisation but unfortunately it's duplicating the links between sectors. This is my data set, from which I want to see if different domains are sharing the same Session ID:
mat < structure(list(Website = c("domain1", "domain2", "domain3", "domain2", "domain4", "domain1", "domain2"), ClientID = c("xxx", "xxx", "yyy", "yyy", "yyy", "zzz", "zzz"), SessionId = c("d.0686", "d.0686", "f.1871", "f.1871", "f.1871", "n.9210", "n.9210")), .Names = c("Website", "ClientID", "SessionId"), row.names = c(NA, 7L), class = "data.frame") domains < unique(mat$Website) output < matrix(0, length(domains), length(domains)) colnames(output) < rownames(output) < domains for (x in domains) { X < unique(mat[mat$Website == x, 'SessionId']) for (y in domains) { Y < unique(mat[mat$Website == y, 'SessionId']) output[rownames(output) == x, y] < length(intersect(X, Y)) } }
Then I plot the mat using circlize()
chordDiagram(output, annotationTrack = "grid", preAllocateTracks = 1, transparency = 0.5, self.link = 1, symmetric = FALSE) circos.track(track.index = 1, panel.fun = function(x, y) { circos.text(CELL_META$xcenter, CELL_META$ylim[1], CELL_META$sector.index, facing = "clockwise", niceFacing = TRUE, adj = c(0, 0.5), cex = 0.5) }, bg.border = NA) # here set bg.border to NA is important
Unfortunately the circlize() package is giving me duplicated links as you can see below
I don't understand why it's duplicating the links from the matrix instead of merging them together. Is it because of my data (is the matrix correct for circlize) or because of missing parameters in circlize plotting?

Area Under Half Normal Distribution is too big
I have a project where I need to look what I think is a half normal distribution.
x is roughly normally distributed and centered at 0. What I need to look at is y = x.
I've never used half normal distributions before.
To get more familiar with it I've been experimenting with both R and excel.
I created a standard normal distribution and verify that the area under the curve is one.
Then, I doubled all those probabilities for z scores greater than or equal to 0.
I verified this method by thinking about it, and using the R package "extraDist."
Now the rub is that the area under the half normal is equal to 1.4.Here is a simple table to illustrate my problem;
Z norm half 6.00 0.00 0.00 5.00 0.00 0.00 4.00 0.00 0.00 3.00 0.00 0.00 2.00 0.05 0.00 1.00 0.24 0.00 0.00 0.40 0.80 1.00 0.24 0.48 2.00 0.05 0.11 3.00 0.00 0.01 4.00 0.00 0.00 5.00 0.00 0.00 6.00 0.00 0.00
I am pretty certain that the area under any pdf should not exceed 1.
Can someone please enlighten me?Best of luck, thanks!

Python function to draw random numbers based on chosen distribution
I am writing a function that will draw random numbers from a given distribution like Poisson, Normal or Binomial. It takes one argument for number of samples and a second argument for type of distribution. It would accept additional parameters based on the distribution chosen. So if I take Normal samples then its mean and sd.
Is there an optimal way of writing this?
My code
import matplotlib.pyplot as plt import numpy as np def randNumberDistribution(samples, distribution,*optional): if distribution.capitalize() == 'Normal': if len(optional) == 2: mean, sd, = optional s = np.random.normal(mean, sd, samples) print(s) count, bins, ignored = plt.hist(s, 20, density=True) plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * np.exp(  (bins  mu)**2 / (2 * sigma**2) ),linewidth=3, color='y') plt.show() else: print("Invalid number of arguments") if distribution.capitalize() == 'Binomial': if len(optional) == 2: numOfTrials, probSuccess = optional # number of trials, probability of success(each trial) s = np.random.binomial(n, p, samples) count, bins, ignored = plt.hist(s, 14, density=True) else: print("Invalid number of arguments") if distribution.capitalize() == 'Poisson': if len(optional) == 1: exp = optional s = np.random.poisson(exp, samples) #Expectation of interval(should be >= 0) count, bins, ignored = plt.hist(s, 14, density=True) else: print("Invalid number of arguments") print(randNumberDistribution(5,'Poisson',5))

Calculating the centroids of two superposed gaussian functions
I am trying to find a solution to the following problem. I have a set of points which should model a sum of 2 Gaussian functions centered at different points. I need to find these two points. Up to now my approach has been to find the centroid of the whole set and cut the set of date below and above it. Then I calculate the centroid of each piece and those are my centers. This approach however cuts the information of, say, the left Gaussian which leaks into the right half of the data. This makes the procedure fail when the Gaussians are close together. Is there way to do this more intelligently? Due to the computational difficulty I would prefer if the solution didn't involve curve fitting.