overlay a normal distribution to a histogram of non-normally distributed values in ggplot r
I'm trying to overlay a normal bell curve on top of the histogram of these fake data that are intentionally NOT normally distributed. My goal is to show other students how non-normally distributed data look in comparison to a normal distribution.
While I have figured out how to get the bell curve on from other questions that have been asked, my y axis is acting strange. For a density plot, I would assume that the axis would go from 0 to 1, but for some values, it says the density is 2 (see image of screenshot below). I want bars that show the density and a bell curve that shows the normal distribution. Any help would be appreciated!
Here's the fake dataset:
library(dplyr) tester2 <- tibble( fake = c(2, 2, 2, 2, 10, 10, 10, 10, 5, 3, 4, 5, 6, 7, 8, 9, 10, 10, 5, 2, 4, 5, 6, 7, 8, 4, 4, 5, 5, 2, 2, 2, 2, 2, 10, 10, 10, 10, 5, 2, 2, 2, 2, 2, 10, 10, 10, 10, 5, 2, 2, 2, 2, 2, 10, 10, 10, 10, 5, 2, 3, 4, 5, 5, 5, 5, 5, 4, 6, 5), also_fake = c(1, 2, 2, 2, 3, 3, 3.3, 4, 4, 5, 1, 2, 2, 2, 3, 3.6, 3, 4, 4, 5, 1, 2, 2, 2.1, 3, 3, 3, 4, 4, 5, 1, 2, 2, 2, 3.1, 3, 3, 4.6, 4, 5, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5) )
Here's my code so far:
testing <- ggplot(tester2, aes(x = also_fake)) + geom_histogram(aes( y = ..density..)) + geom_rug() + stat_function(fun = dnorm, color = "blue", args=list(mean = mean(tester2$also_fake), sd = sd(tester2$also_fake)))
And here's what it produces:
EDIT: This question is different from this question because I do not want a density plot: Superimpose a normal distribution to a density using ggplot in R
It is also different from this question because my values are intentionally non-normally distributed: ggplot2: histogram with normal curve.
See also questions close to this topic
- How do I apply a scale to my y-axis when it includes negatives?
Dictionary-like matching on string in R
I have a dataframe in which a string variable is an informal list of elements, that can be split on a symbol. I would like to make operaion on these elements on the basis of another dataset.
e.g. task: Calculate the sum of the elements
df_1 <- data.frame(element=c(1:2),groups=c("A,B,C","A,D")) df_2 <- data.frame(groups=c("A","B","C","D"), values=c(1:4)) desired <- data.frame(element=c(1:2),groups=c("A,B,C","A,D"),sum=c(6,5))
How to test code I added to a cloned GitHub repo of a R package?
I have a general question about best practices for writing and testing code that is from a Github repo (package hosted on CRAN), specifically for very large functions (300+ lines) which cannot be run independently of other functions from the same package.
So far, I:
- cloned the repo of the "package" from GitHub
- using a text editor, opened the .R files within the cloned package repo (let's call this "package-dev")
- added a few lines of code to a existing function and saved the .R within "package-dev"
- ...? (want to test the function I added code to)
- Git add, commit, push.
Regarding step 4., I simply want to test the function (literally by calling it) with my newly-added changes, even before running any unit tests. However, this function is part of a huge wrapper function and a larger pipeline, and requires package-specific inputs. Therefore, my plan is to load this package "package-dev" and run through the functions in this pipeline.
Is this the correct way to test code you're contributing to that is part of a package? i.e. load the local version of the package, "package-dev", whose code you made changes to, and run functions from it?
I tried re-installing "package-dev" from a different file path using devtools with the intention of testing the functions which now have my snippets of code added to them, but seem to be having issues, possibly because of the same name(s).
.onLoad failed in loadNamespace() for 'rJava', details: call: inDL(x, as.logical(local), as.logical(now), ...) error: unable to load shared object 'path-to-rJava': LoadLibrary failure: %1 is not a valid Win32 application.
What do people normally do for their Step 4?
ggplot `expand_scale()` for the axes - inconsistent
library(tidyverse) ggplot(mtcars) + geom_bar(aes(x = factor(cyl))) + scale_y_continuous(expand = expand_scale(mult = c(0, 0)))
My issue seems to be that ggplot
expand_scale()is not consistent in it's behavior. But that statement is probably incorrect. Let's start with the plot above as our baseline and dig into this.
If I understand the argument correctly,
mult = c(X, Y)allows me the ability to expand ggplot scales X% below the plot, and Y% above the plot. That's what I get with this code below.
ggplot(mtcars) + geom_bar(aes(x = factor(cyl))) + scale_y_continuous(expand = expand_scale(mult = c(1, 0)))
ggplot(mpg %>% filter(displ > 6, displ < 8), aes(displ, cty)) + geom_point() + facet_grid(vars(drv), vars(cyl)) + geom_text(aes(label = trans)) + scale_x_continuous(expand = c(0, 0)) + coord_cartesian(clip = "off")
Here's the next baseline I want to work off for examples three and four.
ggplot(mpg %>% filter(displ > 6, displ < 8), aes(displ, cty)) + geom_point() + facet_grid(vars(drv), vars(cyl)) + geom_text(aes(label = trans)) + scale_x_continuous(expand = c(1, 0)) + coord_cartesian(clip = "off")
Using the same logic as in example one I'd think
mult = c(X, Y)allows me the ability to expand ggplot scales X% to the left of the plot, and Y% to the right of the plot. BUT, my
scale_x_continuous(expand = c(1, 0))doesn't seem to expand the scale
1 = 100%to the left of the plot and
0 = 0%to the right of the plot.
scale_x_continuous(expand = c(1, 0))instead puts some extra space to the left of the plot and a lot more extra space to the right of the plot?
What is happening? Why?
ggplot2: How to indicate data subsets in a time series geom_bar plot?
I have a bar plot based on a time series of dependent variable observations. However, I would also like to include in the graph some indication on the subsets of the data. The subsets are defined by explanatory variables that do not completely correspond to the dependent variable.
require(dplyr) df <- data.frame(year = 1995:2020) %>% mutate(values = runif(26, 0, 1), dumOne = case_when(year %in% 2000:2010 ~ 1, T ~ 0), dumTwo = case_when(year %in% 2003:2009 ~ 1, T ~ 0)) ggplot(df, aes(year, values)) + geom_bar(stat = "identity")
To this graph I would like to add horizontal lines that correspond to variables
dumTwoand possibly some explanatory text. Any ideas how I can achieve this?
How to change histogram of an image?
I want to get a histogram of my input_image and then do some process on it and after that apply new histogram on the input_image. how can I apply new histogram on the image?
Modifying the histogram curve for positive x
I have some histogram code as follows:
plt.subplots(figsize=(10,8), dpi=100) sns.distplot(x1, color='k', label='a',norm_hist = True) sns.distplot(x2, color='g', label='b',norm_hist = True) sns.distplot(x3, color='b', label='b',norm_hist = True) sns.distplot(x4, color='r', label='c',norm_hist = True) sns.distplot(x5, color='y', label='c',norm_hist = True)
This is good but what I'm really trying is to fit the curve only on positive x values. Negative duration doesn't make physical sense. Is there any option for that?
- How to find x-value of highest peak in histogram?
Python Altair - Add Categorical Circles to Heatmap Visualization
I am following the following tutorial : https://altair-viz.github.io/gallery/interactive_cross_highlight.html
I have this up and running no problem, and even have it working on my data to an extent. However, I am having difficulty adjusting it.
I'm curious, can those circles be replaced by ANOTHER categorical variable? For example :
Instead of "records in selection" imagine that there is a "Trustworthy Score" of between 1-5 for each entry. So we want the circle to display instead the average of this "trust" column for all of the records, rather than the raw count.
Can this be done, or is this getting too complex?
Tl;dr Want the circles to not be a raw count, but rather an aggregate (average in this case) of yet another column.
Edit : Also as a follow-up, can circles be differentiated in the original example by size AND color. So instead of just getting smaller, can they get smaller and also change color?
Building stacked bar plot for specific data organisation
I would like to build the stacked bar plot with x-axis representing the number of genomes (or just organisms) and y-axis representing the number of gene clusters, which occur in exact number of genomes. As I know from which organisms these genes came from, I would like each bar to show the impact of each genome in building this bar.
Example of my data:
df = data.frame (genomes_envoled = c(1,2,2,3,3,1), number_of_genes = c(1,3,2,3,3,2), genome1_genes = c("A","B","*", "B", "A,M","*"), genome2_genes = c("*","C,B","E", "D", "N", "*"), genome3_genes = c("*","*", "L", "H", "O", "P"))
rows are gene clusters;
1) the first column show the number of genomes involved in each gene cluster;
2) the second column represents the number of genes in the cluster;
3) columns 3-5 represent concrete names of genes from different genomes;
"*" shows that there are no genes in the cluster for this genome.
It has more or less specific organisation, that's why I am not sure how to put it in the right way, for example in this ggplot function:
ggplot(df, aes(x = factor(Time), y = Value, fill = factor(Type))) + geom_bar(stat="identity", position = "stack")
As the result I want to get 3 bars on x-axis, representing the number of genomes 1,2 or all 3; y-axis representing the number of clusters found in 1, 2 or all the 3 genomes; and show the impact in percentage of each genome in building each concrete bar.
How to apply a Vertical texture to a QT Surface3D?
I'm using QT Surface3D graph to plot some 2D/3D data. I need to apply a Texture with some color-scale to my data. the function surface3dSeries->setTexture works only on horizontal surface,
Is possible to apply a texture on a vertical surface? (in order to obtain something like a terrain slice)
Drawing from truncated normal distribution delivers wrong standard deviation in R
I draw random numbers from a truncated normal distribution. The truncated normal distribution is supposed to have mean 100 and standard deviation 60 after truncation at 0 from the left. I computed an algorithm to compute the mean and sd of the normal distribution prior to the truncation (mean_old and sd_old). The function vtruncnorm gives me the (wanted) variance of 60^2. However, when I draw random variables from the distribution, the standard deviation is around 96. I don't understand why the sd of the random variables varies from the computation of 60.
I tried increasing the amount of draws - still results in sd around 96.
require(truncnorm) mean_old = -5425.078 sd_old = 745.7254 val = rtruncnorm(10000, a=0, mean = mean_old, sd = sd_old) sd(val) sqrt(vtruncnorm( a=0, mean = mean_old, sd = sd_old))
Generate Random Data for Each Player that Follows a Normal Distribution
I have a python dataframe that contains goals scored for players in the NHL from multiple seasons. My dataframe looks like this:
Player 2018-2019 2017-2018 2016-2015 John 25 22 23 James 27 20 24 Joe 18 19 18
What I'd like to do is for each player, I'd like to generate 1000 random numbers that follow a normal distribution based on their career mean and standard deviation, and a 95% confidence interval for those 1000 numbers.
I know I will need to use numpys random.normal function to calculate the random numbers, but I'm not sure about calculating the confidence interval within python.
I'm thinking the pseudo code for this process would be something like:
for rows in df: s = np.random.normal(avg, std_dev, 1000) df['Confidence Interval'] = 95% confidence interval function (s)
Thank you for any help!
Imposing normal distribution to column bars by factor
I have a dataframe with 3 columns and several rows, with this structure
Label Year Frequency 1 a 1 86.45 2 b 1 35.32 3 c 1 10.94 4 a 2 13.55 5 b 2 46.30 6 c 2 12.70
up until 20 years. I plot it like this:
ggplot(data=df, aes(x=df$Year, y=df$Frequency, fill=df$Label))+ geom_col(position=position_dodge2(width = 0.1, preserve = "single"))+ scale_fill_manual(name=NULL, labels=c("A", "B", "C"), values=c("red", "cyan", "green")) + scale_x_continuous(breaks = seq(0, 20, by = 1), limits = c(0, 20)) + scale_y_continuous(expand = c(0, 0), limits = c(0, 90), breaks = seq(0, 90, by = 10)) + theme_bw()
What I want to do is to add three normal distribution to the plot, so that each group of data (A, B, C) can be visually compared with the normal distribution more similar to its distribution, using the same colors (the normal distribution for label A will be red, and so on).
From the data used in here as an example, I will expect to see a red distribution higher and narrower than the green distribution, which will be shorter and wider. How can I add them to the plot?