One dimensional input and multi dimensional output
My task is to predict a result for 5 day's data using data for one day i.e
Let's say the features under consideration are: Height, weight, size and colour
The shape of my each input data point is (1,4) The shape of each output data point is (5,4)
Is there any regression method using which I can model this problem. Or better can this problem also be modelled using a dense neural network
See also questions close to this topic

List of xy coordinates to predict a xy target
I have a database containing coordinate points (X, Y) .. A column corresponds to a single coordinate => For n points, I therefore have 2n columns following the model: X1, Y1, X2, Y2, ..., Xn, Yn
Each line corresponds to a polygon, described by a sequence of coordinates (X, Y) (These are my features) For each line, I have an output Target which is a point of XY coordinates, therefore composed of 2 outputs (TX & TY)
My output corresponds to the location of a target to predict in (X, Y) according to the points
I have already done some work on the database by transforming all these coordinates of Points into coordinates of Vectors (to give binder) This allowed me to build linear regression models especially with Ridge, But the predictions made do not satisfy me, I would like to be more precise ..
Have you ever worked on a similar problem? If so, I am looking for some avenues to explore, otherwise your ideas will be welcome
The aim of my subject is to predict the location of electrical outlets in a given room. For that I translated my room into a polygon, and the electric outlet into a Target T.

Clustering group with few samples
I would like a feedback from a person with more experience.
I have a dataframe that is in the image format that I sent, with about 1 million samples and 50 features.
What I'm looking for are customers similar to customers who own 'Product A'. So I thought about use dummies on categorical variables and then doing a clustering. Problem: The number of customers who own 'Product A' represents about 1% of all customers, so I am not sure if a cluster will be able to separate the group I am looking for. Is clustering appropriate in this case? If so, do you know the most efficient algorithm in this case? I worked only with Kmeans and I don't know if it would be ideal for having to inform the number of clusters I want to form.

What should I do if there are too many zero values in the outlier handling part?
I am working on a data science project which is about Churn analysis(whether costomer is leaving or not). I am trying to do outlier handling part but i have a question about how i need to think when my data has many zero values. I know it may contain a meaning but please see the results below. Results ,Value Counts, z scorehard edges and outliers
I would like to ask what should i need to do for better results and should i keep all the zero values? Any suggestion? What should I do if there are too many zero values in the outlier handling part?

Get coefficients of a logit model
Dear Stackoverflow users,
I have been working on a machine learning project. A few months ago I trained a logistic regression model and saved it using pickle, so I could apply it to my datasets. I use this code to load the model when I need it.
import pickle infile = open('classifier','rb') MODEL = pickle.load(infile) infile.close() MODEL output: GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=500, n_splits=5, random_state=1234), estimator=Pipeline(steps=[('transformer', QuantileTransformer(random_state=1234)), ('scaler', RobustScaler()), ('logreg', LogisticRegression(penalty='elasticnet', solver='saga'))]), n_jobs=1, param_grid={'logreg__C': [0.1], 'logreg__l1_ratio': [0.1], 'transformer__output_distribution': ['uniform']}, return_train_score=True, scoring='roc_auc')
The model is not just the logit but I have different steps.
I would like to get the coefficients of the logit but when I use
.coef_
it gives me an error.logreg = MODEL.estimator.steps[2][1] logreg.coef_ AttributeError: 'LogisticRegression' object has no attribute 'coef_'
Any ideas on how to solve this?
Thanks in advance!

can't apply sklearn.compose.ColumnTransformer on only one column of pandas dataframe
I have defined a custom tansformer that takes a pandas dataframe, apply a function on only one column and leaves all the remaining columns untouched. The transformer is working fine during testing, but not when I include it as part of a Pipeline.
Here's the transformer:
import re from sklearn.base import BaseEstimator, TransformerMixin class SynopsisCleaner(BaseEstimator, TransformerMixin): def __init__(self): return None def fit(self, X, y=None, **fit_params): # nothing to learn from data. return self def clean_text(self, text): text = text.lower() text = re.sub(r'@[azAZ09_]+', '', text) text = re.sub(r'https?://[AZaz09./]+', '', text) text = re.sub(r'www.[^ ]+', '', text) text = re.sub(r'[azAZ09]*www[azAZ09]*com[azAZ09]*', '', text) text = re.sub(r'[^azAZ]', ' ', text) text = [token for token in text.split() if len(token) > 2] text = ' '.join(text) return text def transform(self, X, y=None, **fit_params): for i in range(X.shape[0]): X[i] = self.clean_text(X[i]) return X
When I test it manually like this, it is working just as expected.
train_synopsis = SynopsisCleaner().transform(train_data['Synopsis'])
But, when I include it as a part of sklearn pipeline:
from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline # part 1: defining a column transformer that learns on only one column and transforms it synopsis_clean_col_tran = ColumnTransformer(transformers=[('synopsis_clean_col_tran', SynopsisCleaner(), ['Synopsis'])], # set remainder to passthrough to pass along all the unspecified columns untouched to the next steps remainder='passthrough') # make a pipeline now with all the steps pipe_1 = Pipeline(steps=[('synopsis_cleaning', synopsis_clean_col_tran)]) pipe_1.fit(train_data)
I get KeyError, like shown below:
 KeyError Traceback (most recent call last) /usr/local/lib/python3.6/distpackages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2890 try: > 2891 return self._engine.get_loc(casted_key) 2892 except KeyError as err: pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 0 The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) 16 frames <ipythoninput103396fa5d6092> in <module>() 6 # make a pipeline now with all the steps 7 pipe_1 = Pipeline(steps=[('synopsis_cleaning', synopsis_clean_col_tran)]) > 8 pipe_1.fit(train_data) /usr/local/lib/python3.6/distpackages/sklearn/pipeline.py in fit(self, X, y, **fit_params) 352 self._log_message(len(self.steps)  1)): 353 if self._final_estimator != 'passthrough': > 354 self._final_estimator.fit(Xt, y, **fit_params) 355 return self 356 /usr/local/lib/python3.6/distpackages/sklearn/compose/_column_transformer.py in fit(self, X, y) 482 # we use fit_transform to make sure to set sparse_output_ (for which we 483 # need the transformed data) to have consistent output type in predict > 484 self.fit_transform(X, y=y) 485 return self 486 /usr/local/lib/python3.6/distpackages/sklearn/compose/_column_transformer.py in fit_transform(self, X, y) 516 self._validate_remainder(X) 517 > 518 result = self._fit_transform(X, y, _fit_transform_one) 519 520 if not result: /usr/local/lib/python3.6/distpackages/sklearn/compose/_column_transformer.py in _fit_transform(self, X, y, func, fitted) 455 message=self._log_message(name, idx, len(transformers))) 456 for idx, (name, trans, column, weight) in enumerate( > 457 self._iter(fitted=fitted, replace_strings=True), 1)) 458 except ValueError as e: 459 if "Expected 2D array, got 1D array instead" in str(e): /usr/local/lib/python3.6/distpackages/joblib/parallel.py in __call__(self, iterable) 1027 # remaining jobs. 1028 self._iterating = False > 1029 if self.dispatch_one_batch(iterator): 1030 self._iterating = self._original_iterator is not None 1031 /usr/local/lib/python3.6/distpackages/joblib/parallel.py in dispatch_one_batch(self, iterator) 845 return False 846 else: > 847 self._dispatch(tasks) 848 return True 849 /usr/local/lib/python3.6/distpackages/joblib/parallel.py in _dispatch(self, batch) 763 with self._lock: 764 job_idx = len(self._jobs) > 765 job = self._backend.apply_async(batch, callback=cb) 766 # A job can complete so quickly than its callback is 767 # called before we get here, causing self._jobs to /usr/local/lib/python3.6/distpackages/joblib/_parallel_backends.py in apply_async(self, func, callback) 206 def apply_async(self, func, callback=None): 207 """Schedule a func to be run""" > 208 result = ImmediateResult(func) 209 if callback: 210 callback(result) /usr/local/lib/python3.6/distpackages/joblib/_parallel_backends.py in __init__(self, batch) 570 # Don't delay the application, to avoid keeping the input 571 # arguments in memory > 572 self.results = batch() 573 574 def get(self): /usr/local/lib/python3.6/distpackages/joblib/parallel.py in __call__(self) 251 with parallel_backend(self._backend, n_jobs=self._n_jobs): 252 return [func(*args, **kwargs) > 253 for func, args, kwargs in self.items] 254 255 def __reduce__(self): /usr/local/lib/python3.6/distpackages/joblib/parallel.py in <listcomp>(.0) 251 with parallel_backend(self._backend, n_jobs=self._n_jobs): 252 return [func(*args, **kwargs) > 253 for func, args, kwargs in self.items] 254 255 def __reduce__(self): /usr/local/lib/python3.6/distpackages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params) 726 with _print_elapsed_time(message_clsname, message): 727 if hasattr(transformer, 'fit_transform'): > 728 res = transformer.fit_transform(X, y, **fit_params) 729 else: 730 res = transformer.fit(X, y, **fit_params).transform(X) /usr/local/lib/python3.6/distpackages/sklearn/base.py in fit_transform(self, X, y, **fit_params) 569 if y is None: 570 # fit method of arity 1 (unsupervised transformation) > 571 return self.fit(X, **fit_params).transform(X) 572 else: 573 # fit method of arity 2 (supervised transformation) <ipythoninput6004ee595d544> in transform(self, X, y, **fit_params) 20 def transform(self, X, y=None, **fit_params): 21 for i in range(X.shape[0]): > 22 X[i] = self.clean_text(X[i]) 23 return X /usr/local/lib/python3.6/distpackages/pandas/core/frame.py in __getitem__(self, key) 2900 if self.columns.nlevels > 1: 2901 return self._getitem_multilevel(key) > 2902 indexer = self.columns.get_loc(key) 2903 if is_integer(indexer): 2904 indexer = [indexer] /usr/local/lib/python3.6/distpackages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2891 return self._engine.get_loc(casted_key) 2892 except KeyError as err: > 2893 raise KeyError(key) from err 2894 2895 if tolerance is not None: KeyError: 0
What am I doing wrong here?
EDIT 1: without brackets and the column name specified as string, this is the error I see:
 ValueError Traceback (most recent call last) <ipythoninput11bdd42b09e2af> in <module>() 6 # make a pipeline now with all the steps 7 pipe_1 = Pipeline(steps=[('synopsis_cleaning', synopsis_clean_col_tran)]) > 8 pipe_1.fit(train_data) 3 frames /usr/local/lib/python3.6/distpackages/sklearn/pipeline.py in fit(self, X, y, **fit_params) 352 self._log_message(len(self.steps)  1)): 353 if self._final_estimator != 'passthrough': > 354 self._final_estimator.fit(Xt, y, **fit_params) 355 return self 356 /usr/local/lib/python3.6/distpackages/sklearn/compose/_column_transformer.py in fit(self, X, y) 482 # we use fit_transform to make sure to set sparse_output_ (for which we 483 # need the transformed data) to have consistent output type in predict > 484 self.fit_transform(X, y=y) 485 return self 486 /usr/local/lib/python3.6/distpackages/sklearn/compose/_column_transformer.py in fit_transform(self, X, y) 536 537 self._update_fitted_transformers(transformers) > 538 self._validate_output(Xs) 539 540 return self._hstack(list(Xs)) /usr/local/lib/python3.6/distpackages/sklearn/compose/_column_transformer.py in _validate_output(self, result) 400 raise ValueError( 401 "The output of the '{0}' transformer should be 2D (scipy " > 402 "matrix, array, or pandas DataFrame).".format(name)) 403 404 def _validate_features(self, n_features, feature_names): ValueError: The output of the 'synopsis_clean_col_tran' transformer should be 2D (scipy matrix, array, or pandas DataFrame).

How to apply tfidf to rows of text
I have rows of blurbs (in text format) and I want to use tfidf to define the weight of each word. Below is the code:
def remove_punctuations(text): for punctuation in string.punctuation: text = text.replace(punctuation, '') return text df["punc_blurb"] = df["blurb"].apply(remove_punctuations) df = pd.DataFrame(df["punc_blurb"]) vectoriser = TfidfVectorizer() df["blurb_Vect"] = list(vectoriser.fit_transform(df["punc_blurb"]).toarray()) df_vectoriser = pd.DataFrame(x.toarray(), columns = vectoriser.get_feature_names()) print(df_vectoriser)
All I get is a massive list of numbers, which I am not even sure anymore if its the TF or TFIDF that it is giving me as the frequent words (the, and, etc) all have a score of more than 0.
The goal is to see the weights in the tfidf column shown below and I am unsure if I am doing this in the most efficient way:

R: Variable lengths differ
I'm trying to create a linear model based off a time series analysis such as the following:
Model 1 = novice_crash ~ time + grad + time.after + month
I have the following code that creates the variables in question above:
grad< c(replicate(66,0),replicate(30,1)) grad< ts(grad, start=c(2002,1), frequency=12) time< seq(1,96, by=1) time< ts(time,start=c(2002,1), frequency = 12) time.after< c(replicate(66,0),replicate(30,1)) time.after< ts(time.after, start=c(2002,1), frequency = 12) #month< seasonaldummy(novice_crashes) month<time grad.lag1< lag(grad) time.after.lag1< lag(time.after)
'novice_crashes' is a ts object that comes from the following code (where 'crashes' is a csv file
novice< crash$novice_crash total< crash$total_crash novice_crashes<ts(novice, start = c(2002,12), end=c(2009,12), frequency = 12)
When I try to run this
model1< lm(novice_crashes ~ time + grad + time.after + month)
I get the following error:Error in model.frame.default(formula = novice_crashes ~ time + grad + : variable lengths differ (found for 'time')
I have checked the lengths of time, grad, time.after and month (which are all 96 units long).
The dataset
crash
had NA's present but I removed withcrash< na.omit(crash)
Im much more used to python so I may be missing something here...

How to plot statsmodels timeseries plots side by side and customize x axis in Python
I am creating these timeseries plots specifically stl decomposition and already managed to get all the plots into one. The issue I am having is having them shown side by side like the solution here. I tried the solution on the link but it did not work, instead I kept getting an empty plot on the top. I have four time series plots and managed to get them outputted on the bottom of each other however I would like to have them side by side or two side by side and the last two on the bottom side by side.
Then for the dates on the xaxis, I have already tried using
ax.xaxis.set_major_formatter(DateFormatter('%b %Y'))
but it is not working on the code below since the res.plot function won't allow it.I have already searched everywhere but I can't find the solution to my issue. I would appreciate any help.
Data
Date Crime 0 20180101 149 1 20180102 88 2 20180103 86 3 20180104 100 4 20180105 123 ... ... ... 664 20191027 142 665 20191028 113 666 20191029 126 667 20191030 120 668 20191031 147
Code
from statsmodels.tsa.seasonal import STL import matplotlib.pyplot as plt import seaborn as sns from pandas.plotting import register_matplotlib_converters from matplotlib.dates import DateFormatter register_matplotlib_converters() sns.set(style='whitegrid', palette = sns.color_palette('winter'), rc={'axes.titlesize':17,'axes.labelsize':17, 'grid.linewidth': 0.5}) plt.rc("axes.spines", top=False, bottom = False, right=False, left=False) plt.rc('font', size=13) plt.rc('figure',figsize=(17,12)) #fig=plt.figure() #fig, axes = plt.subplots(2, sharex=True) #fig,(ax,ax2,ax3,ax4) = plt.subplots(1,4,sharey=True) #fig, ax = plt.subplots() #fig, axes = plt.subplots(1,3,sharex=True, sharey=True, figsize=(12,5)) #ax.plot([0, 0], [0,1]) stl = STL(seatr, seasonal=13) res = stl.fit() res.plot() plt.title('Seattle', fontsize = 20, pad=670) stl2 = STL(latr, seasonal=13) res2 = stl.fit() res2.plot() plt.title('Los Angles', fontsize = 20, pad=670) stl3 = STL(sftr, seasonal=13) res3 = stl.fit() res3.plot() plt.title('San Francisco', fontsize = 20, pad=670) stl4 = STL(phtr, seasonal=13) res4 = stl.fit() res4.plot() plt.title('Philadelphia', fontsize = 20, pad=670) #ax.xaxis.set_major_formatter(DateFormatter('%b %Y'))

What is the definition of InstantVector in Prometheus?
Instant vector  a set of time series containing a single sample for each time series, all sharing the same timestamp
http_request_count => results in:
 http_request_count{status=“200”} 20
 http_request_count{status=“404”} 3
 http_request_count{status=“500”} 5
Questions
what is the
single sample
means from the definition? And what does it mean that they all share the same timestamp?How is it possible that multiple values such that each of them has a different timestamp (date?) has the same timestamp?

Models are not all fitted to the same size of dataset during ANOVA analysis
I have these two simple regressions
lm3 = lm(mntlhlth ~ attend + rstress, data=g) lm4 = lm(mntlhlth ~ attend + rstress + married, data=g)
And I try to do an anova analysis.
anova(lm3, lm4)
However, I get the error "Error in anova.lmlist(object, ...) : models were not all fitted to the same size of dataset"
How can I fix it?
Thank you!

Why do I have a negative R squared if my model has an intercept?
Don't know what else to say, I am running a first difference panel IV model and getting a negative R squared. I imagine it has something to do with instrumental variables but can't figure out what. I am usig the
ivreg
function in R fromAER
and have 700+ observations. Any ideas?
Having a intercept is relevant because if my model has an intercept and a negative R squared it would have a better fit (or a lower sum of squared errors), simply by setting all coefficients to 0 and setting the intercept equal to the dependent variable mean.
Here are the last few lines of
summary(model)
:Residual standard error: 5.281 on 646 degrees of freedom Multiple RSquared: 11.07, Adjusted Rsquared: 12.28 Wald test: 0.6517 on 65 and 646 DF, pvalue: 0.984