How to programatically get parameter names and values in scipy
Is there any way to get the parameters of a distribution? I know almost every distribution has "loc" and "scale" but theres differences between them, for example alpha has "a", beta has "a" ,"b".
What i want to do is programatically print(after fiting a distribution) key value pairs of parameter,value.
But i dont want to write a print routine for every possible distribution.
2 answers

inspect
ing the_pdf
method appears to work:import inspect # keys [p for p in inspect.signature(stats.beta._pdf).parameters if not p=='x'] # ['a', 'b'] # keys and values dist = stats.alpha(a=1) inspect.signature(stats.alpha._pdf).bind('x', *dist.args, **dist.kwds).arguments # OrderedDict([('x', 'x'), ('a', 1)]) # 'x' probably doesn't count as a parameter

In the end what i did was:
parameter_names = [p for p in inspect.signature(distribution._pdf).parameters if not p=='x'] + ["loc","scale"] parameters = distribution.fit(pd_series) distribution_parameters_dictionary =dict(zip(parameter_names,parameters))
Where pd_series is a pandas series of the data being fitted.
See also questions close to this topic

Tensorflow performance: Numpy matrix o TF matrix?
I have the following code:
with tf.Session() as sess: sess.run(init_vars) cols = sess.run(tf.shape(descriptors)[1]) descriptor_matrix = np.zeros((n_batches*batch_size, cols)) while True: batch_descriptor = sess.run(descriptors, feed_dict={dropout_prob: 1}) descriptor_matrix[i:i+elements_in_batch] = np.array(batch_descriptor)
I am mixing tensors and numpy vectors. Does this have an important impact on performance? Why is it? Should I just use tensors instead?

Return dataframe rows where the values in a column are not of type date
I have a dataframe
df
that looks like:Name Date of birth Bob Steve 22/07/1963 Jo pencil Karen 03/02/1953 Frank 29/09/1994
Is there a way to return rows where
Date of birth
is not a date?In the above example I would have returned:
Name Date of birth Bob Jo pencil
Where
Date of birth
is not a date.I can identify where there is a blank value for Date of birth using:
missingDoBError = df.loc[df['Date of birth'].isnull()]
I have tried to find Date of birth values where the value is not a date format at set to NaT by using:
if pd.to_datetime(df['Date of birth'], format='%d%b%Y', errors='coerce').notnull().all():
But I can't get this to work.

How to implement tSNE in a model?
I split my data to train/test. When i use PCA It is straight forward.
from sklearn.decomposition import PCA pca = PCA() X_train_pca = pca.fit_transform(X_train) X_test_pca = pca.transform(X_test)
From here i can use X_train_pca and X_test_pca in the next step and so on..
But when i use tSNE
from sklearn.manifold import TSNE X_train_tsne = TSNE(n_components=2, random_state=0).fit_transform(X_train)
I can't seem to transform the test set so that i can use the tSNE data for the next step e.g. SVM.
Any help?

raise ValueError("Unknown label type: %s" % repr(ys)) ValueError: Unknown label type: (array
Im trying to make a Machine Learning approach but I'm having some problems. This is my Code:
import sys import scipy import numpy import matplotlib import pandas import sklearn from pandas.plotting import scatter_matrix import matplotlib.pyplot as plt from sklearn import model_selection from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC dataset = pandas.read_csv('Libro111.csv') array = numpy.asarray(dataset,dtype=numpy.float64) #all values are float64 X = array[:,1:49] Y = array[:,0] validation_size = 0.2 seed = 7.0 X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed) scoring = 'accuracy' models = [] models.append(('LR', LogisticRegression())) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('NB', GaussianNB())) models.append(('SVM', SVC())) results = [] names = [] for name, model in models: kfold = model_selection.KFold(n_splits=10, random_state=seed) cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring) results.append(cv_results) names.append(name) msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()) print(msg)
And then I get two different errors.
For Logistic Regression:
File "C:\ProgramData\Anaconda3\lib\sitepackages\sklearn\utils\multiclass.py", line 172, in check_classification_targets raise ValueError("Unknown label type: %r" % y_type) ValueError: Unknown label type: 'continuous'
I found someone who had the same problems but I couldn't sort it out yet..
And (most important):
File "C:\ProgramData\Anaconda3\lib\sitepackages\sklearn\utils\multiclass.py", line 97, in unique_labels raise ValueError("Unknown label type: %s" % repr(ys)) ValueError: Unknown label type: (array([ 0.5, 0. , 1. , 1. , 0.5, 0.5, 1. , 0.5, 0. , 0.5, 1. , 0. , 0. , 0. , 1. , 1......
In both cases the error come when I execute "cv_result" line... So, I hope you can help me...

scipy.loadmat() doesnt accept relative path
I tried to give a relative path to scipy.loadmat() but it looks like its not accepting relative path(only accepts system path). I did a workaround for this :
cwd = os.getcwd() model_path = os.path.join(cwd, 'Models.mat')
Is there a better way to do this? I am asking this because via the mentioned way the model file should be in the same directory as the python file (which is not possible always).

Numpy output based on 3 sequential conditions?
I try to build a vectorized/parallel stock backtesting program. I implemented a sequential version with loops, but now I'm stuck at vectorizing the functionality. I'm looking to use Pandas/Numpy for that, here's a quick outline:
There are 2 given columns, left is order quantity (to be added to position), right is stops (if stop is 1, position gets reset to 0)
M = [[0.1, 0], # left column is order quantity, right is stop [0.1, 0], [0.5, 0], [0.5, 0], [0.3, 0], [0.3, 0], # negative order quantity means short or sell [0.1, 1]] # right column (stop) is 1, so position is reset to 0
And 2 columns which I want to calculate based on the initial matrix M: Left column is position (ranges from 1 to 1 but can't go beyond) based on order quantity and right column the executed order quantity
R = [[0.1, 0.1], [0.2, 0.1], [0.7, 0.5], # position (left column) is equal to cumsum of order quantity (from last stop trigger) [1, 0.3], # executed quantity is < order quantity as it's the remainder to position's max of 1 [1, 0], [0.7, 0.3], [0.1, 0.8]] # stop triggered, so position is reset to 0, and then 0.1 in order quantity is executed
 Position is basically cumsum of order quantity, but only until 1 or 1, and only if stops are not triggered
 Executed order quantity is either the order quantity if position limits are not exceeded, otherwise the remainder
 Stops (when 1) reset the position to 0
The problem is that each condition is based on the other one. Does that mean this task can't be solved in parallel?
I can imagine an approach with quantity cumsum and indices where stops trigger, applied on the cumsum to calculate the executed quantity. I would appreciate any tips for elegant ways to solve this. Maybe which Numpy functions to look into, besides cumsum.
Edit: A very simplified version of the sequential version:
orders = [{'quantity': 0.1,'stop': 0},{'quantity': 0.1,'stop': 0},{'quantity': 0.5,'stop': 0},{'quantity': 0.5,'stop': 0},{'quantity': 0.3,'stop': 0},{'quantity': 0.3,'stop': 0},{'quantity': 0.1,'stop': 1}] position = 0 for order in orders: position_beginning = position if order['stop'] == 1: position = 0 if order['quantity']+position <= 1 and order['quantity']+position >= 1: position += order['quantity'] elif position < 0 and order['quantity'] < 0: position = 1 elif position > 0 and order['quantity'] > 0: position = 1 executed_quantity = abs(position  position_beginning) * (1 if position > position_beginning else 1) print(position, executed_quantity)
In the actual app, the order quantities are much more complex, e.g. divided into sub quantities. The fact that the backtester has to run over millions of orders with sub quantities, makes things really slow using this loop approach.

Creating an algorithm for selecting multiple lottery numbers (mathematics and statistics)
In the spanish bet system there is a concept called
multiple
which means that if the game you want to play has bets of 6 numbers, you can create a special bet of 7 or 8 numbers, or even 9, 10 or 11 numbers. That special bet will traduce in X normal bets of 6 numbers which will combine the given numbers.The multiple bet of 7 numbers will traduce in 7 bets of 6 numnbers. The multiple bet of 8 numbers will traduce in 28 bets of 6 numbers. The multiple bet of 9 numbers will traduce in 84 bets of 6 numbers. The multiple bet of 10 numbers will traduce in 210 bets of 6 numbers. The multiple bet of 11 numbers will traduce in 462 bets of 6 numbers.
Sample of multiple of 7 with the numbers 1,2,3,4,5,6,7:
234567 134567 124567 123567 123467 123457 123456
Sample of multiple of 8 with the numbers 1,2,3,4,5,6,7,8:
123456 123457 123458 123467 123468 123478 123567 123568 123578 123678 124567 124568 124578 124678 125678 134567 134568 134578 134678 135678 145678 234567 234568 234578 234678 235678 245678 345678
My first goal is to achieve an algorithm in Java for generate
multiples
. I mean, each bet has a cost of 1 coin, so, given for example 30 numbers and 800 coins, waste the 800 coins in X multiple bets of X numbers. The multiple bets must combine the 30 numbers in more or less equal cuantity of appereances. The total cost of the multiples must be near of 800 euros, can be a little less but never be more than 800 euros. The algorithm will offer different proposals, for example, can offer a result near of 800 euros with multiples of 7, a result near of 800 with multiples of 8, etc... and the user will select which one prefeers. I have no idea of how to achieve this, I am not good in mathematics or statistics so I will appreciate help with this problemIn this website there is a web multiple generator which can generate multiples of 7 and of 8, but it's code is not public: http://www.miramiprimi.miraestudio.es/MetodoMultiplePrimitiva.php
Thanks a lot.

Conducting a series of ttests between two data frames with covariates
I have two dataframes, one with covariates for patient samples, and one with methylation data for the samples. I need to perform ttests to compare the methylation data by sex.
My dataframes look somewhat like this  Covariates:
"patient" "sex" "ethnicity" sample1 p1 0 caucasian sample2 p2 1 caucasian sample3 p3 1 caucasian sample4 p4 0 caucasian sample5 p5 0 caucasian sample6 p6 1 caucasian
and continues up to sample46
Methylation:
sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 probe1 0.1111 0.2222 0.3333 0.4444 0.5555 0.6666 0.7777 0.8888 0.9999 1.111 probe2 0.1111 0.2222 0.3333 0.4444 0.5555 0.6666 0.7777 0.8888 0.9999 1.111 probe3 0.1111 0.2222 0.3333 0.4444 0.5555 0.6666 0.7777 0.8888 0.9999 1.111 probe4 0.1111 0.2222 0.3333 0.4444 0.5555 0.6666 0.7777 0.8888 0.9999 1.111
and so on for 80,000 different probes and 46 different samples. So if I want to do a series of ttests comparing the methylation data to sex for the first 8 samples, could I just specify:
t.test(t(methylation[,1:8]) ~ covariates$sex)
? Or is there a way that I can tie in the sample names (sample1, sample2...)? (Sorry in advance, I'm very new to both R and statistics) 
Best way to analyze correlation between 3 different categorical variables
I'm trying to run some analysis and running into a roadblock (more like a mental block)...
Goal
I have 3 different factor variables:
 Cohort:
Analyst
,Associate
,Manager
,Sr. Manger
,Director
,ED
,VP
 Gender:
Male
,Female
 Timeframe:
MidYear
,YearEnd
,Beyond
I want to check to see if there is any difference in
Gender
acrossCohort
andTimeframe
. I.e., are female analysts more likely to fall intoTimeframe = "Beyond"
than their Male counterparts.Code
My initial thought is to do something like this:
library(dplyr) x < df %>% filter(Gender %in% c("Male","Female")) %>% filter(!is.na("Timeframe")) %>% group_by(Timeframe, Cohort, Gender) %>% summarise(n = n()) %>% mutate(freq = 100 * (n / sum(n)))
But this is giving me percents that don't quite make sense. Ideally I'd like to conclude: "In the Analyst cohort, there is or is not a big difference in the timeframe Yearend or Midyear or Beyond for gender"
Data
dput(head(df1,30)) structure(list(V1 = c("Female", "Male", "Male", "Male", "Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female", "Female", "Female", "Male", "Female", "Female", "Male", "Female", "Female", "Male", "Male", "Female", "Female", "Male", "Male", "Female", "Female"), V2 = c("Executive Director", "Executive", "Vice President", "Manager", "Director", "Executive Director", "Manager", "Senior Manager", "Senior Manager", "Vice President", "Director", "Senior Manager", "Manager", "Senior Manager", "Senior Manager", "Senior Manager", "Executive Director", "Senior Manager", "Manager", "Director", "Senior Manager", "Associate", "Vice President", "Senior Manager", "Executive Director", "Manager", "Executive Director", "Director", "Associate", "Senior Manager"), V3 = c("Beyond", "Beyond", "Beyond", "Beyond", "Beyond", "MidYear Promotion", "Beyond", "Year End Promotion", "Beyond", "Year End Promotion", "Beyond", "Beyond", "Beyond", "Beyond", "Beyond", "Year End Promotion", "Beyond", "Beyond", "Beyond", "Beyond", "Beyond", "Year End Promotion", "Beyond", "Beyond", "Beyond", "Year End Promotion", "Beyond", "Beyond", "Beyond", "Beyond")), row.names = c("1", "2", "4", "5", "6", "7", "8", "10", "11", "12", "13", "14", "15", "16", "17", "19", "21", "22", "23", "24", "25", "27", "28", "29", "30", "31", "32", "33", "34", "35"), class = "data.frame")
 Cohort:

How to code for a probability map in opencv?
Hope I am posting in correct forum.
Just want to sound out my ideas and approach to solve problem. Would welcome any pointers, help (if given code would definitely be ideal :) )
Problem: I want to code for the probability distribution (in a 400 x 400 map) in order to find the spatial location (x,y) of another line (let us call it fL) based upon the probability, in the probability map.
I have gotten a nearly horizontal line cue (let call it lC) from prior processing to calculate the probability to determine fL. fL is estimated to lie at D distance away from this horizontal line cue. My task is to calculate this probability map
Approach:
1) I would take the probability map distribution as Gaussian and to be
P(fL  point ) = exp( ( xD )^2 /sigma^2 )
which is giving probability of the line fL given the point in line cue lC is at D distance away, pending on sigma (which defines how fast the probability decrease)
2) I would use a LineIterator to find every single pixel that lie on the line cue lC (given that I know the start and end point of line). Let say I have gotten n pixel in this line
3) For every single pixel in the 400 x 400 image, I would calculate the probability using 1) as described above for all n points that I have gotten for the line. I would sum up each line point contribution
4) After finishing all the pixel calculation in the 400x400 image, I would then normalize the probability based the largest pixel probability value. This part I am not unsure that I should normalize by the sum of all pixel probability or by using the step above.
5) After this I would multiply this probability map with other probability map. So I would get
P(fL  Cuefromthisline, Cuefromsomeother....) = P( fL  Cuefromthisline)P( fL  Cuefromsomeother).....
And I would set pixel with near 0 probability to be 0.001
6) That outlines my approach
Question
1) Is this workable? Or if there is any better method to doing this? ie getting the probability map
2) How do I normalize the map. by normalizing with the sum of all pixel probability or by normalizing with the max value
Thanks in advance for reading out this long post

How to generate random numbers with uniform distribution in Java?
So, i'm having trouble generating random numbers with uniform distribution in java, given the maximum and the minimun value of some attributes in some data set (Iris from UCI for machine learning). What i have is iris dataset, in some 2darray called samples. I put the random values according to the maximun and the minimun value of each attribute in iris data set (without the class attribute) in a 2darray called gworms (which has some extra fields for some other values of the algorithm).
So far, the full algorithm is not working properly, and my thoughts are in the fact that maybe the gworms (the points in 4d space) are not generating correctly or with a good randomness. I think that the points are to close to each other (this i think because of some results obtained later whose code is not shown here). So, i'm asking for your help to validate this code in which i implement "uniform distribution" for gworms (for de first 4 positions):
/* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools  Templates * and open the template in the editor. */ package glowworms; import java.lang.Math; import java.util.ArrayList; import java.util.Random; import weka.core.AttributeStats; import weka.core.Instances; /** * * @author oscareduardo937 */ public class GSO { /* ************ Initializing parameters of CGSO algorithm ******************** */ int swarmSize = 1000; // Swarm size m int maxIte = 200; double stepSize = 0.03; // Step size for the movements double luciferin = 5.0; // Initial luciferin level double rho = 0.4; // Luciferin decay parameter double gamma = 0.6; // Luciferin reinforcement parameter double rs = 0.38; // Initial radial sensor range. This parameter depends on the data set and needs to be found by running experiments double gworms[][] = null; // Glowworms of the swarm. /* ************ Initializing parameters of clustering problem and data set ******************** */ int numAtt; // Dimension of the position vector int numClasses; // Number of classes int total_data; //Number of instances int threshold = 5; int runtime = 1; /*Algorithm can be run many times in order to see its robustness*/ double minValuesAtts[] = new double[this.numAtt]; // Minimum values for all attributes double maxValuesAtts[] = new double[this.numAtt]; // Maximum values for all attributes double samples[][] = new double[this.total_data][this.numAtt]; //Samples of the selected dataset. ArrayList<Integer> candidateList; double r; /*a random number in the range [0,1)*/ /* *********** Method to put the instances in a matrix and get max and min values for attributes ******************* */ public void instancesToSamples(Instances data) { this.numAtt = data.numAttributes(); System.out.println("********* NumAttributes: " + this.numAtt); AttributeStats attStats = new AttributeStats(); if (data.classIndex() == 1) { //System.out.println("reset index..."); data.setClassIndex(data.numAttributes()  1); } this.numClasses = data.numClasses(); this.minValuesAtts = new double[this.numAtt]; this.maxValuesAtts = new double[this.numAtt]; System.out.println("********* NumClasses: " + this.numClasses); this.total_data = data.numInstances(); samples = new double[this.total_data][this.numAtt]; double[] values = new double[this.total_data]; for (int j = 0; j < this.numAtt; j++) { values = data.attributeToDoubleArray(j); for (int i = 0; i < this.total_data; i++) { samples[i][j] = values[i]; } } for(int j=0; j<this.numAtt1; j++){ attStats = data.attributeStats(j); this.maxValuesAtts[j] = attStats.numericStats.max; this.minValuesAtts[j] = attStats.numericStats.min; //System.out.println("** Min Value Attribute " + j + ": " + this.minValuesAtts[j]); //System.out.println("** Max Value Attribute " + j + ": " + this.maxValuesAtts[j]); } //Checking /*for(int i=0; i<this.total_data; i++){ for(int j=0; j<this.numAtt; j++){ System.out.print(samples[i][j] + "** "); } System.out.println(); }*/ } // End of method InstancesToSamples public void initializeSwarm(Instances data) { this.gworms = new double[this.swarmSize][this.numAtt + 2]; // Ddimensional vector plus luciferin, fitness and intradistance. double intraDistance = 0; Random r = new Random(); //Random r; for (int i = 0; i < this.swarmSize; i++) { for (int j = 0; j < this.numAtt  1; j++) { //Uniform randomization of ddimensional position vector this.gworms[i][j] = this.minValuesAtts[j] + (this.maxValuesAtts[j]  this.minValuesAtts[j]) * r.nextDouble(); } this.gworms[i][this.numAtt  1] = this.luciferin; // Initial luciferin level for all swarm this.gworms[i][this.numAtt] = 0; // Initial fitness for all swarm this.gworms[i][this.numAtt + 1] = intraDistance; // Intradistance for gworm i } //Checking gworms /*for(int i=0; i<this.swarmSize; i++){ for(int j=0; j<this.numAtt+2; j++){ System.out.print(gworms[i][j] + "** "); } System.out.println(); }*/ } // End of method initializeSwarm }
The main class is this one:
package uniformrandomization; /** * * @author oscareduardo937 */ import java.io.BufferedReader; import java.io.FileReader; import java.io.FileNotFoundException; import weka.core.Instances; import glowworms.GSO; public class UniformRandomization { public UniformRandomization(){ super(); } //Loading the data from the filename file to the program. It can be .arff or .csv public static BufferedReader readDataFile(String filename) { BufferedReader inputReader = null; try { inputReader = new BufferedReader(new FileReader(filename)); } catch (FileNotFoundException ex) { System.err.println("File not found: " + filename); } return inputReader; } /** * @param args the command line arguments */ public static void main(String[] args) throws Exception { // TODO code application logic here BufferedReader datafile1 = readDataFile("src/data/iris.arff"); Instances data = new Instances(datafile1); GSO gso = new GSO(); gso.instancesToSamples(data); gso.initializeSwarm(data); System.out.println("Fin..."); } }
So i want to know if with this code, the numbers of the position ij of the gworms are generating within the range of max value and min value for attribute j.
Thanks so much in advanced.

Using mean to calculate most likely
At a car hire service 50% of cars are returned on time. A sample of 20 car hires is studied. In order to calculate the probability all 20 cars are returned on time I use the binomial distribution :
dbinom(x=20, size=20, prob=0.5)
How can I calculate the mean to determine the most likely number of returned cars ? To calculate the mean I use :
mean(dbinom(x=20, size=20, prob=0.5))
which returns :
[1] 9.536743e07
How can I then use the mean to calculate the most likely number of returned cars ?

Converting a binary array into probability distribution
I have a 2d binary array indicating the presence of halfchannels at a particular coordinate (0=not present, 1=present). I need to convert this array into a probability distribution to plot on a map of the globe using matplotlib.
I tried dividing each element of the array by the amount of time over which the values are calculated. For example, if the data was taken over a period of one month I divided by 30. I also tried taking the exponent of each value like so:
return np.exp(x ** 2)
but nothing looks right. Any suggestions? Thanks.

Weka Logistic Regression Understanding Predictions
I did a logistic regression using Weka and I selected Output as plain text as I am trying to get the probability results. There is something weird as I cannot understand which is the following.
For instances 209(0.994) and 216(0.811), the algorithm predicted '1' and let's say instance 176(0.845) it predicted 0.
Isn't supposed for prediction > 0.5 or < 0.5, the algorithm classify as '1' or '0'. I cannot understand the linkage between the prediction number and the predicted class.
inst# actual predicted error prediction 161 1:0 2:1 + 0.547 162 1:0 1:0 0.782 163 2:1 2:1 0.809 164 1:0 1:0 0.98 165 1:0 1:0 0.839 166 1:0 1:0 0.962 167 2:1 1:0 + 0.787 168 1:0 1:0 0.921 169 1:0 1:0 0.94 170 1:0 1:0 0.959 171 2:1 2:1 0.645 172 2:1 1:0 + 0.737 173 2:1 1:0 + 0.781 174 1:0 1:0 0.911 175 1:0 1:0 0.901 176 1:0 1:0 0.845 177 2:1 1:0 + 0.852 178 1:0 1:0 0.807 179 1:0 1:0 0.886 209 2:1 2:1 0.994 210 1:0 1:0 0.965 211 1:0 1:0 0.719 212 1:0 1:0 0.96 213 1:0 1:0 0.929 214 1:0 1:0 0.956 215 1:0 1:0 0.943 216 2:1 2:1 0.811

Why doesn't "dnorm" sum up to one as a probability?
This may be some basic/fundamental question on 'dnorm' function in R. Let's say I create some z scores through z transformation and try to get the sum out of 'dnorm'.
data=c(232323,4444,22,2220929,22323,13) z=(datamean(data))/sd(data) result=dnorm(z,0,1) sum(result) [1] 1.879131
As above, the sum of 'dnorm' is not 1 nor 0.
Then let's say I use zero mean and one standard deviation even in my z transformation.
data=c(232323,4444,22,2220929,22323,13) z=(data0)/1 result=dnorm(z,0,1) sum(result) [1] 7.998828e38
I still do not get either 0 or 1 in sum.
If my purpose is to get sum of the probability equal to one as I will need for my further usage, what method do you recommend using 'dnorm' or even using other PDF functions?