How to programatically get parameter names and values in scipy
Is there any way to get the parameters of a distribution? I know almost every distribution has "loc" and "scale" but theres differences between them, for example alpha has "a", beta has "a" ,"b".
What i want to do is programatically print(after fiting a distribution) key value pairs of parameter,value.
But i dont want to write a print routine for every possible distribution.
2 answers

inspect
ing the_pdf
method appears to work:import inspect # keys [p for p in inspect.signature(stats.beta._pdf).parameters if not p=='x'] # ['a', 'b'] # keys and values dist = stats.alpha(a=1) inspect.signature(stats.alpha._pdf).bind('x', *dist.args, **dist.kwds).arguments # OrderedDict([('x', 'x'), ('a', 1)]) # 'x' probably doesn't count as a parameter

In the end what i did was:
parameter_names = [p for p in inspect.signature(distribution._pdf).parameters if not p=='x'] + ["loc","scale"] parameters = distribution.fit(pd_series) distribution_parameters_dictionary =dict(zip(parameter_names,parameters))
Where pd_series is a pandas series of the data being fitted.
See also questions close to this topic

I got issue with automatic grab in basket python
I got a script on github, but it doesn't seem to work anymore.. anyone who knows a possible solution for it ?
it's on the website ; https://github.com/MartijnDevNull/ticketnak

Symmetric Difference of Two CSV Files in Python
I want to write a Python script to compare two specified columns of two separate CSV files to output a third CSV file with the complete rows of the unique values in CSV 1 and 2. So, for example, if both CSVs have a column ID, I want to see, which rows of CSV 1 and CSV 2 have unique ID values and output those as a third CSV file.
I was thinking of using a set of the two CSV files, but how do I specify the shared column?
 How to separate a data frame using pandas?

scipy ode integrate to unknown limit of t
I am modeling charged particles moving through an electromagnetic field and am using scipy ode. The code here is simplified, obviously, but works as an example. The problem I have is that I want to end the integration after a limit on r, not on t. So, integrate dx/dt up to the point where norm(x) > r.
I don't want to just change the function to integrate over r, however, because the position is a function of t. Can I do a definite integral over an unrelated variable or something?
import numpy as np import matplotlib.pyplot as plt from scipy.integrate import odeint def RHS(state, t, Efield, q, mp): ds0 = state[3] ds1 = state[4] ds2 = state[5] ds3 = q/mp * Efield*state[0] ds4 = q/mp * Efield*state[1] ds5 = q/mp * Efield*state[2] r = np.linalg.norm((state[0], state[1], state[2])) # if r > 30000 then do stop integration.....? # return the two state derivatives return [ds0, ds1, ds2, ds3, ds4, ds5] ts = np.arange(0.0, 10.0, 0.1) state0 = [1.0, 2.0, 3.0, 0.0, 0.0, 0.0] Efield=1.0 q=1.0 mp=1.0 stateFinal = odeint(RHS, state0, ts, args=(Efield, q, mp)) print(np.linalg.norm(stateFinal[1,0:2]))

numpy.ma.masked_where function is giving a mere outside of the mask
I am currently trying got make a template matching algorithm between 2 images using cross correlation (the scipy correlate2d function). I am then planning on iterating this as I get too many hits. My plan for iterating was to mask out the region above a certain xcorrelation score.
I found the numpy.ma.masked_where function that seems to be perfect for my needs but when I save the images I get a 'smear' of values outside of the obviously masked out areas. It also does not results in a change in correlation but that may be a different issue.
Has anyone else had an issue with this 'smearing' of values before?

Passing extra arguments to broyden1
Im trying to execute scipy broyden1 function with extra parameters (called "data" in the example), here is the code:
data = [radar_wavelen, satpos, satvel, ellipsoid_semimajor_axis, ellipsoid_semiminor_axis, srange] target_xyz = broyden1(Pixlinexyx_2Bsolved, start_xyz, args=data) def Pixlinexyx_2Bsolved(target, *data): radar_wavelen, satpos, satvel, ellipsoid_semimajor_axis, ellipsoid_semiminor_axis, srange = data print target print radar_wavelen, satpos, satvel, ellipsoid_semimajor_axis, ellipsoid_semiminor_axis, srange
Pixlinexyx_2Bsolved is the function whose root I want to find.
start_xyz is initial guess of the solution:
start_xyz = [4543557.208584103, 1097477.4119051248, 4176990.636060918]
And data is this list containing a lot of numbers, that will be used inside the Pixlinexyx_2Bsolved function:
data = [0.056666, [5147114.2523595653, 1584731.770061729, 4715875.3525346108], [5162.8213179936156, 365.24378919717839, 5497.6237250296626], 6378144.0430000005, 6356758.789000001, 850681.12442702544]
When I call the function broyden1 (as in the second line of example code) I get the next error:
target_xyz = broyden1(Pixlinexyx_2Bsolved, start_xyz, args=data) File "<string>", line 5, in broyden1 TypeError: __init__() got an unexpected keyword argument 'args'
What I'm doing wrong?
Now, seeing the documentation of fsolve, it seems to be able to get extra args in the callable func... Here is a similar question as mine.

"Filtering" input data for analysis in Python
I have a large set of data on which I have to perform a lot of serach operations. In order to reduce the number of data points, the data is "compressed" by merging every continuous positiveslope or negativeslope point into a single point representing a local maximum or minimum, and also recording the number of original data points that the new point represents.
For example, if my data points are 2, 3, 4, 6, 4, 1, the compressed data is recorded as: 2,1  represents the initial value of 2 with a local min after 1 cycle 6,4  represents the local max of 6 which occured after 3 more cycles 1,2  represents the local min of 1 which occured after 2 more cycles
Is this some sort of decimation or downsampling filter?
Thanks in advance.

ARTool package in R  multiple within factors
I have recently discovered the ARTool package for R (https://cran.rproject.org/web/packages/ARTool/) when looking for a nonparametric alternative for a repeated measures ANOVA.
I have used ARTool and find it really very useful, but I came across a problem, that I am not sure how to deal with. Specifically, the Df.res seem to be strongly inflated as soon as I have more than one within factor. I have not come across this when I tried it with two between factors, a between and a within factor, or two between and one within factor, but whenever I add a second within factor, Df.res seems to become inflated.
I just wondered whether I am misunderstanding something or maybe there is an explanation that I am not aware of.
Any response would be greatly appreciated.
Many thanks!

adehabitat compana() doesn't work or returns lambda=NaN
I'm trying to do the compositional analysis of habitat use with the compana() function in the adehabitatHS package (I use adehabitat because I can't install adehabitatHS). Compana() needs two matrices: one of habitat use and one of avaiable habitat.
When I try to run the function it doesn't work (it never stops), so I have to abort the RStudio session.
I read that one problem could be the 0values in some habitat types for some animals in the 'avaiable' matrix, whereas other animals have positive values for the same habitat. As done by other people, I replaced 0values with small values (0,001), ran compana and it worked BUT the lambda values returned me NaN.
The problem is similar to the one found here (adehabitatHS compana test returns lambda = NaN?), and they said they resolved using as 'used' habitat matrix the counts (integers) and not the proportions. I tried also this approach, but never changed (it freezes when there are 0values in the avaiable matrix, or returns NaN value for Lambda if I replace 0 values wit small values). I checked all matrices and they are ok, so I'm getting crazy.
I have 6 animals and 21 habitat types. Can you resolve this BIG problem?

Impossible: 4 columns permutation with limit of 11 in VBA Excel
What I am looking for is impossible permutation combination. I have been on this for sometime and can't seems to wrap an Excel VBA procedure around it.
I have 5 columns.
Each has a list of names (maximum limit of names per column is 50): a list of fruits
 a list of vegetables
 a list of nuts
 a list of herbs
 a list of spices
Each column will have a minimum and a maximum range and the total output will have a limit.
So â€¦
 column 1: have min 1 max 3,
 column 2: min 2 max 4,
 column 3: min 3 max 4,
 column 4: min 1 max 2,
 column 5 min 0 max 0,
 Total limit is 11
So the code has to pick any three names from first column and two names from second column and goes on till fifth.
If the last column is empty it can skip that and work with only 4 columns.
Then from all five column, within the minimum and maximum defined limits, the count of names should total 11.That means, for example, the macro can choose 3 Fruits, 4 Vegetables, 2 Nuts, 2 Herbs, 0 Spice to reach a total of 11.
Examples:
Apple, Apricot, Avocado, Fennel, Potato, Cauliflower, Garlic, Acorn, Almond, Anise, Basil.
Avocado, Banana, Bilberry, Fennel, Potato, Cauliflower, Garlic, Acorn, Almond, Anise, Basil.
The combinations should go on and on until all the combinations are met. It can be in 11 columns each combination can be a row so it can bring the maximum number of combinations if it can provide number of combinations before starting with a formula then we can edit or limit the list to accommodate the number of rows in excel from what I can see with 1 combination value 5,2,1,3,0 I can get upto 1100 combinations
I have been up with this for a long time and the solution is not that great can anyone help me with this?
Screenshot: Excel input table:
Code I was working with so far. The thing is not able to pick a range from list and set a total limit to 11
Dim rngData As Range Dim rngResults As Range Dim lngCount As Long Dim lngCol As Long Dim lngNumberRows As Long Dim ItemCount() As Long Dim RepeatCount() As Long Dim PatternCount() As Long Dim lngForRow As Long Dim lngForPattern As Long Dim lngForItem As Long Dim lngForRept As Long Dim DataArray() As Variant Dim ResultArray() As Variant Set rngData = DataRange If DataHasHeaders Then Set rngData = rngData.Offset(1).Resize(rngData.Rows.Count  1) End If DataArray = rngData.Value lngCol = rngData.Columns.Count ReDim ItemCount(1 To lngCol) ReDim RepeatCount(1 To lngCol) ReDim PatternCount(1 To lngCol) For lngCount = 1 To lngCol ItemCount(lngCount) = _ Application.WorksheetFunction.CountA(rngData.Columns(lngCount)) If ItemCount(lngCount) = 0 Then MsgBox "Column " & lngCount & " does not have any items in it." Exit Sub End If Next lngNumberRows = Application.Product(ItemCount) ReDim ResultArray(1 To lngNumberRows, 1 To lngCol) RepeatCount(lngCol) = 1 For lngCount = (lngCol  1) To 1 Step 1 RepeatCount(lngCount) = ItemCount(lngCount + 1) * _ RepeatCount(lngCount + 1) Next lngCount For lngCount = 1 To lngCol PatternCount(lngCount) = lngNumberRows / _ (ItemCount(lngCount) * RepeatCount(lngCount)) Next For lngCount = 1 To lngCol lngForRow = 1 For lngForPattern = 1 To PatternCount(lngCount) For lngForItem = 1 To ItemCount(lngCount) For lngForRept = 1 To RepeatCount(lngCount) ResultArray(lngForRow, lngCount) = _ DataArray(lngForItem, lngCount) lngForRow = lngForRow + 1 Next lngForRept Next lngForItem Next lngForPattern Next lngCount Set rngResults = ResultRange(1, 1).Resize(lngNumberRows, lngCol) If DataHasHeaders And HeadersInResult Then rngResults.Rows(1).Value = DataRange.Rows(1).Value Set rngResults = rngResults.Offset(1) End If rngResults.Value = ResultArray() End Sub

how to stimulate using python code?
How do I code this problem using python? I'm having trouble setting it up

For a subset of 5 vertices find P(all edges between these vertices are present in G)
For a random graph, G, on n vertices's, each possible edge is present independently with probability k, 0 <= k <= 1. I seek P(all edges between these vertices's are present in G) My thoughts so far If we have the empty subset, p = 1 If we have a one element set, p = 1 If we have a two element set, p = k If we have a three element set, p = k^3 If we have a four element st, p = k^6 If we have a five element set, p = k^10. If the above is correct, then I can capture the probability as the following: P = k^(n C 2)
However, this only works for two  five element set. If I have a one or two element set the following if incorrect. If I am understanding everything correctly up to this point, how can I capture the other two cases?
Is the only possibility a piecewise defined function? If n=0 or n = 1, 1 Otherwise, k^(n C 2)

Unittesting a probability distribution with conditionals
I have a function
choose(elems) > elem
that callsrand()
which makes it nondeterministic.To be able to better test this, I figured that I could split this function in two,
generate_choices(elems, ...) > distribution choose(distribution) > elem
where
choose()
is a thin wrapper aroundrand()
andgenerate_choices()
generates a distribution from which to draw an element. I could then deterministically test that this probability distribution is as expected.The distribution is uniform but with two conditionals:
 If there are not enough
elems
, add a random fallback element uniformly.  If there are still not enough
elems
, add a random default element uniformly.
Some examples:
generate_choices([a, b, c, d], [], []) > [a, b, c, d] generate_choices([a, b, c], [fallback1], []) > [a, b, c, fallback1] generate_choices([a, b, c], [fb1, fb2], []) > [a, b, c, (fb1  fb2)] generate_choices([a, b], [fb1, fb2], [default1]) > [a, b, (fb1  fb2), default1] generate_choices([a, b], [fb1, fb2], [d1, d2]) > [a, b, (fb1  fb2), (d1  d2) ] generate_choices([a], [fb1, fb2], [d1, d2]) > [a, (fb1fb2), (d1d2) ]
My question is then: How should I model
distribution
? If I choose a simple list and call
rand()
from withingenerate_choices()
to fill the fallback and default, then I can only test some deterministic parts ofgenerate_choices()
.  If I choose three lists,
(elems, fallback, default)
, thengenerate_choices()
is fully deterministic, but thenchoice()
becomes less trivial and must be tested more thoroughly anyways.
 If there are not enough

Random Forest Classifier Probabilities
My dataset has 140k rows with 5 attributes and 1 Attrition as target variable (value can either be 0 (Customer Does not churn) or 1 (Customer churn)). I divided my dataset in 80% training and 20% testing. My dataset is heavily imbalanced. 84% of my dataset has 0 as target variable and only 16% has 1 as target variable.
The feature importance of my training dataset is as follows:
ColumnA = 28%, ColumnB = 27%, AnnualFee 17%, ColumnD  17% an ColumnE  11%
I initially wanted to do a very simple check of my model. After creating a Random Forest Classifier I tested the model on a dataset with just 5 rows. I kept all variables constant except Column AnnualFee. Below is a snapshot of my test dataset:
Column A Column B AnnualFee ColumnD ColumnE 4500 3.9 5% 2.1 7 4500 3.9 10% 2.1 7 4500 3.9 15% 2.1 7 4500 3.9 20% 2.1 7 4500 3.9 25% 2.1 7
I expected that as annual fee increases the probability of customer churn also increases. But my rf.predict_proba(X_test) seems to be all over the place. I am not sure why this is happening:
I tried two different codes but the anomaly seems to be happening on both codes:
Code 1:
rf = RandomForestClassifier(n_estimators = 400,random_state = 0, min_samples_split=2,min_samples_leaf=5, class_weight = {0:.0001,1:.9999}) rf.fit(X_train, Y_train )
Code 2: Not My Code  Got it Online
from sklearn.model_selection import StratifiedKFold from sklearn.model_selection import GridSearchCV clf_4 = RandomForestClassifier(class_weight = {0:1,1:5}) estimators_range = np.array([2,3,4,5,6,7,8,9,10,15,20,25]) depth_range = np.array([11,21,35,51,75,101,151,201,251,301,401,451,501]) kfold = 5 skf = StratifiedKFold(n_splits = kfold,random_state = 42) model_grid = [{'max_depth': depth_range, 'n_estimators': estimators_range}] grid = GridSearchCV(clf_4, model_grid, cv = StratifiedKFold(n_splits = 5, random_state = 42),n_jobs = 8, scoring = 'roc_auc') grid.fit(X_train,Y_train)
I would really appreciate any help on this!

Choosing the best distribution for any given data by using R
So, I am a beginner in R and would like to know if it is possible to automatically bestfit a distribution for any given set of data in R. (Arena Simulationâ€™s Input Analyzer is a very powerful tool in this and I want to perform a similar task in R)
Note 1: I need a solution without using qqplots. Note 2: Corresponding pvalues could work for me if it is the only possible way to output such data.