How to make multiple bar graphs for factors in R
I would love to make a figure like what I have for my numeric features
hist(df[ , purrr::map_lgl(df, is.numeric)])
If I try to do the same thing with factors
hist(df[ , purrr::map_lgl(df[,interest_factors], is.factor)])
I get
Any suggestions? I just want to quickly view them
Thanks
See also questions close to this topic

Add sequence of dates to data.table (R)
I have a data table that contains locations of places that have recurring events at different frequencies. The date of the last event is provided, as well as how frequently it occurs.
Example:
dt # Location Last_Occurrence Frequency # 1: Home 7192018 30 # 2: School 662018 60 # 3: Moon 151993 90
What I would like to do is add a new column that includes all of the future event dates for each location up through the end of the year 2018.
So, I would like a table that looks something as follows:
dt # Location Last_Occurrence Frequency Next_Dates # 1: Home 7192018 30 7192018 # 2: Home 7192018 30 8182018 # 3: Home 7192018 30 9172018 # 4: Home 7192018 30 10172018 # 5: Home 7192018 30 11162018 # 6: Home 7192018 30 12162018 # 7: School 662018 60 662018 # 8: School 662018 60 852018 # 9: School 662018 60 1042018 etc.
How should I go about doing this? I suspect a lapply function would be useful, since I'm doing this over each location...
I've figured out how to use a "while" loop to generate a vector of future dates:
Last_Sample_Date < Sys.Date() #For testing increase < 5 #For testing NextDate < Last_Sample_Date+increase multiplier < 1 # Create vector of next sampling dates  updated with each iteration of the while loop NextDates < c(Last_Sample_Date, NextDate) while (year(NextDate) == 2018) { multiplier < multiplier+1 NextDate < NextDate+multiplier*increase #Add to vector of next sampling dates NextDates < append(NextDates, NextDate) })
(I realize this actually generates a vector containing the final date in 2019, but I'm OK with that.)
Could I use this while loop somehow, or is there another way I should go about this?

Find rows in a data frame where the text in one column can be found in another column, in R
I want to identify rows in a data frame where the text in one column can be found in another column. For example, in the data frame below, I would like to identify the rows in which the model column contains the text in the gear column (in this case, rows 1, 2, 7, 8, 32).
mydf < cbind.data.frame(model=rownames(mtcars), gear=as.character(mtcars$gear), stringsAsFactors=F) mydf model gear 1 Mazda RX4 4 2 Mazda RX4 Wag 4 3 Datsun 710 4 4 Hornet 4 Drive 3 5 Hornet Sportabout 3 6 Valiant 3 7 Duster 360 3 8 Merc 240D 4 9 Merc 230 4 10 Merc 280 4 11 Merc 280C 4 12 Merc 450SE 3 13 Merc 450SL 3 14 Merc 450SLC 3 15 Cadillac Fleetwood 3 16 Lincoln Continental 3 17 Chrysler Imperial 3 18 Fiat 128 4 19 Honda Civic 4 20 Toyota Corolla 4 21 Toyota Corona 3 22 Dodge Challenger 3 23 AMC Javelin 3 24 Camaro Z28 3 25 Pontiac Firebird 3 26 Fiat X19 4 27 Porsche 9142 5 28 Lotus Europa 5 29 Ford Pantera L 5 30 Ferrari Dino 5 31 Maserati Bora 5 32 Volvo 142E 4
It seems like I should be able to use something like grep or match in combination with something like apply or map, or even ifelse, but I can't quite figure it out. (I could of course do a for loop but I have several million rows of data and would prefer not to.)

R: Changing scale and format of axis labels
I'm trying to put together a graph of data points as a function of time elapsed over the date, but the problem is I have too many data points for the date string size as you can see in the graph below.
I'd prefer if I could have the XAxis show just %Y%m%d instead of the full date and time, but I can't seem to get
scale_x_date
,scale_x_datetime
,xlim
, orxmin
andxmax
to work.Errors I've gotten:
Error: Invalid input: time_trans works with objects of class POSIXct only Error: Invalid input: date_trans works with objects of class Date only
Code I have so far (with failures commented out):
library(ggplot2) library(scales) mydata < read.csv("/Users/user/R/restore_graphs/CSV/store.csv.tmp") restore.df = data.frame( Time = mydata$start, Duration = mydata$time, Labels = gsub(" [09]{1,2}:[09]{1,2}:[09]{1,2}","",mydata$start) ) p < ggplot(restore.df, aes(x=Time,y=Duration)) + geom_point(colour="red") #p < ggplot(restore.df, aes(x=Time,y=Duration)) + geom_point(colour="red") + scale_x_datetime(date_labels = "%Y%m%d %H") #p + scale_x_date(date_labels = "%y%m%d", limits = as.Date('20180614', "%y%m%d"), as.Date('20180620', "%Y%m%d")) #, xlim(as.Date('20180614', "%Y%m%d"), as.Date('20180620', "%Y%m%d")))) + geom_point(colour="red")# + xlim(as.Date('20180614', "%Y%m%d"), as.Date('20180620', "%Y%m%d")) #aes(xmin = as.Date("20180614", "%Y%m%d"), xmax = as.Date("20180620", "%y%m%d"))) # dput(restore.df$Time) print(p)
When I run the line with ggplot changed to:
p < ggplot(restore.df, aes(x=Time,y=Duration,xmin = as.Date("20180614", "%Y%m%d"), xmax = as.Date("20180620", "%y%m%d"))) + geom_point(colour="red")
It changes the graph to have every point shoved to the left of the screen.
Sample data:
uuid,db,table,start,stop,time,size 941439639,test,,"20180614 17:35:07","20180614 17:35:07",62.9666666666667,141329782065 890252165,test,,"20180614 23:35:38","20180614 23:35:38",61.7166666666667,141380294237 943883747,test,,"20180615 05:38:39","20180615 05:38:39",77.7666666666667,141469254934 827384296,test,,"20180615 11:35:11","20180615 11:35:11",63.4166666666667,141276941916 454468935,test,,"20180615 17:35:23","20180615 17:35:23",64.4333333333333,141380122325 705894402,test,,"20180615 23:35:29","20180615 23:35:29",63.9,141715941073 396694772,test,,"20180616 05:39:59","20180616 05:39:59",75.0666666666667,141789270192

How to generate a distribution with a given mean, variance, skew and kurtosis in MATLAB?
There are many questions like this on stackoverflow but they are either talking about Python or R. How can I do this thing in MATLAB?
There is a function
normpdf(x,mu,sigma)
in MATLAB which generates distribution with desiredmu
andsigma
.Equivalently, Is there any way by which I can add skewness and kurtosis to distribution generated by
normpdf
function? 
Should samples from np.random.normal sum to zero?
I am working on the motion model of a robot. In every time step, the robot's motion is measured, then I sample the normal distribution with the measurement as the mean and a small sigma value for covariance in order to simulate noise. This noisy motion is then added to the robot's previous state estimate.
But when I keep the robot still, these noisy measurements seem to accumulate and the robot "thinks it's moving."
Shouldn't these random samples not accumulate, but sum to zero?
In other words, would you expect the following to be true:
0 ~ np.sum([np.random.normal(0, 0.1) for _ in range(1000)])
I have tried writing out the above in an explicit loop and seeding the random number generator with a different number before taking every sample, but the sums still deviate far from zero.
Is this simply a limitation of random number generators, or am I misunderstanding the fact(?) that many samples from the normal distribution should sum to zero?

Quantile egual to math expectation
I am trying to find any distribution corresponding to following:
E(X) = x_5, where
F(x_5) <= 0,05 ; F(x_5+0) >= 0,05
Is there any distributions having that speciality?
I tried to find the one among exponential dist., lognorm and so on but i didn`t succeed 
how to stratified groups in R cmprsk package in R doing Cumulative incidence analysis
I am working on a projected and would like to further stratified the groups in my analysis. For example, current the graph is differentiated by p16 status. What I would like is to have 4 lines instead with groups also being differentiated by treatment they received (like 1 they received and 0 they didnt).
Here is my code
## CIF and the variance of point estimates timepoints(resCumIncByDis, times = c(0,1,2,3,4,5)) plot(resCumIncByDis, xlab = 'Years from Diagnosis', ylab = "Probability of InField Recurrence") ## Regression (crr can only take a covariate matrix) dsv$CIStatus1 = factor(dsv$CICenInField, levels = c(0,1), labels = c("Censored","InField")) dsv$p16Yes1 = factor(dsv$p16Yes, levels = c(0,1), labels = c("p16", "p16+")) bmtDisMat = matrix(as.numeric(dsv$p16Yes == 1)) colnames(bmtDisMat) < "dis" resCrrRelByDis < crr(ftime = dsv$DiagCITimeInField/12, # vector of failure/censoring times fstatus = dsv$CIStatus1, # vector with a unique code for each failure type and censoring cov1 = bmtDisMat, # matrix (nobs x ncovs) of fixed covariates ## cov2 = , # matrix of covariates that will be multiplied by functions of time ## tf = , # functions of time ## cengroup = , # vector with different values for each group with a distinct censoring distribution failcode = "InField", # code of fstatus that denotes the failure type of interest cencode = "Censored" # code of fstatus that denotes censored observations ) summary(resCrrRelByDis, conf.int = 0.95)
Here is the output
As you can see, I tried using the strata option (using margin (1+ 0) status but nothing happened).

Python  inserting randomlyextracted regression coefficients within the same Dataframe
I have this script, it basically regresses multiple columns of a database against a generic Y value, it is designed to exhaust all possible column combinations within my dataframe (or at least it tries to, with a discrete success) by "urn extracting" the columns in a Binomial fashion.
The first part of the script is shared with you in the hope that it may be an extra reward for those passing by, if you want to see the problem, the presentation is a little bit below:
import random as r import numpy as np import pandas as pd import statsmodels as smf j = 0 #Here below i'm just saying that j has to be lower then n*(n1)*(n2)*(n3)*(n4), which is the amount of combinations that you can find by extracting 5 different coefficients from a n columns database.# while j < ((Dataframe.shape[1])*(Dataframe.shape[1]1)*(Dataframe.shape[1]2)*(Dataframe.shape[1]3)*(Dataframe.shape[1]4)): #Below our dependent variable X, which is randomized from the dataframe# X = Dataframe.iloc[:, r.sample(range(4,Dataframe.shape[1]),5)] #Below our model, a classic OLS with constant# sm.add_constant(X , prepend=True ) Model = smf.OLS(Y, X) #Below the results for a single regression# Results = Model.fit()
Within the same while loop, I have created a single database 'DB_Parameters', which contains all the coefficient names, I need, however, to populate it with the coefficient values.
#The Results parameters type is "Pandas.Core.Series.Series", #therefore I have transformed the values within Results.params into a #Database# Parameters = pd.DataFrame(Results.params).reset_index() Parameters.columns = ['names', 'values'] #Below you will see our database, named as DB_Parameters, right now #it's nothing but a repository for all the 47 coefficient names to be in place.# DB_Parameters = pd.DataFrame(np.unique(np.append(DB_Parameters, Parameters.iloc[:,0]))) DB_Parameters.columns = ['names'] if all((Parameters['names']).isin(Databank.iloc[:,0])): #Here i want start to populate the database with the regression parameters#
Now my problem starts! I want to populate an empty column with the estimated coefficients, the problems are two:
1) The values must be appended to the correct indexed row, and not on the empty spots which should be assigned to other parameters, Let's assume that the extracted values are given by [const, B_1, B_2, B_5] ; an hypothetical solution should provide a result similar this one :
print(DB_Parameters): names values const 0.04 B_1 0.05 B_2 0.03 B_3 NaN B_4 NaN B_5 0.96
The problem comes with the fact that DB_Params has 47 rows, while the regression just have 5 coefficients at each iteration.
Finally, the regression coefficients names and values are entirely random, some times you will get B_1 and B_2, another time you will get B_5 and B_42.
Problem 2: Since at the beginning of the game we have a while cycle, and since we are "binomially" extracting the independent variables, the same coefficient can be estimated multiple times.
This proposition represents a problem when we are populating the same row, we cannot afford that the same parameter, estimated two times, will just refresh the value on the previous position.
There's need that a new column is created just for this particular regression. A clarifying example is provided below:
In iteration 1, these could be some of our estimated parameters:
const = 0.04 B_1 = 0.05
and in iteration 2 the result could be like this:
const = 0.06 B_1 = 0.12
hence we could have the same coefficients with different values, and by the nature of the problem, they are both to be stored in the same dataframe.
Therefore, in this case, instead of having a conflict, a new column should be created. A desirable pandas dataframe could look something like this:
name value value(2) .... ..... .... const 0.04 0.06 B_1 0.05 0.015 B_2 0.03 "Eventually, another value, if extracted" B_3 NaN "Eventually, another value, if extracted" ... ... NaN
At the end of the computation, there should be a number close to 184 millions of models, therefore manually inserting them is not a viable option by any means.
A possible solution may account for the fact that we have 2 indexes: one exists within DB_Params["names"], while a random subset is generated within Parameters['names'].
The subset is randomly refreshed as a "child", the whole database remains unmodified as the "mother" index.
As a final part of my code, this was my tentative solution, which, it doesn't work:
if all((Parameters['names']).isin(DB_Parameters.iloc[:,0])): list = Parameters['value'].tolist() for x in list: DB_Parameters.ix[x,j] = Parameters.ix[x,1] print(DB_Parameters)
This is the error that I get. exclusively from this last part:
Traceback (most recent call last): File "pandas_libs\index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc File "pandas_libs\hashtable_class_helper.pxi", line 811, in pandas._libs.hashtable.Int64HashTable.get_item TypeError: an integer is required
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "C:\Python\lib\sitepackages\pandas\core\indexes\base.py", line 2525, in get_loc return self._engine.get_loc(key) File "pandas_libs\index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc File "pandas_libs\index.pyx", line 141, in pandas._libs.index.IndexEngine.get_loc KeyError: 'Formazione'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "pandas_libs\index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc File "pandas_libs\hashtable_class_helper.pxi", line 811, in pandas._libs.hashtable.Int64HashTable.get_item TypeError: an integer is required During handling of the above exception, another exception occurred: line 164, in multiregressor Databank.ix[x,j] = DataParams.ix[x,1] File "C:\Python\lib\sitepackages\pandas\core\indexing.py", line 121, in getitem return self._getitem_tuple(key) File "C:\Python\lib\sitepackages\pandas\core\indexing.py", line 858, in _getitem_tuple return self._getitem_lowerdim(tup) File "C:\Python\lib\sitepackages\pandas\core\indexing.py", line 991, in _getitem_lowerdim section = self._getitem_axis(key, axis=i) File "C:\Python\lib\sitepackages\pandas\core\indexing.py", line 1108, in _getitem_axis return self._get_label(key, axis=axis) File "C:\Python\lib\sitepackages\pandas\core\indexing.py", line 145, in _get_label return self.obj._xs(label, axis=axis) File "C:\Python\lib\sitepackages\pandas\core\generic.py", line 2344, in xs loc = self.index.get_loc(key) File "C:\Python\lib\sitepackages\pandas\core\indexes\base.py", line 2527, in get_loc return self._engine.get_loc(self._maybe_cast_indexer(key)) File "pandas_libs\index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc File "pandas_libs\index.pyx", line 141, in pandas._libs.index.IndexEngine.get_loc KeyError: 'Formazione'
Where 'Formazione' it's just a key within the index.
Thank you, i'm fairly new to python, and while I have tried several solutions (using dictionaries, merges, appends, concat, series etc.), I still have problems to make a forest out of these trees.