Keep Constant Columns h2o
I'm trying to implement a gradient boosting machine model using R's h2o package. However, the model keeps dropping a certain column that I know from other model build ups that this column is important.
Warning message: In .h2o.startModelJob(algo, params, h2oRestApiVersion) : Dropping bad and constant columns:['mycolumn']
How do I stop h2o from dropping this column? Here is what I tried:
gbm_fit<-h2o.gbm(x,y,train_set,nfolds = 10, ntrees = 250, learn_rate = 0.15, max_depth = 7, validation_frame = validate_set,seed = 233, ignore_const_cols = F )
Make sure the column class is correct and accepted by the function.
See also questions close to this topic
DiagrammeR export_graph Invalid asm.js
I'm having a problem exporting graphs in
Rto PDFs using
Example below to reproduce the problem. The PDFs are produced inconsistently so sometimes not at all.
The error message I get is on calling the export_graph in the code snipet below.
I'm using RStudio Version 1.1.463 and R 3.5.2 on Windows 10.
"\<"unknown">":1919791: Invalid asm.js: Function definition doesn't match use"
library(data.tree) library(yaml) library(DiagrammeR) library(DiagrammeRsvg) fileName <- system.file("extdata", "jennylind.yaml", package="data.tree") cat(readChar(fileName, file.info(fileName)$size)) lol <- yaml.load_file(fileName) jl <- as.Node(lol) pic <- ToDiagrammeRGraph(jl) render_graph(pic) export_graph(pic, "C:/Tmp/plot.pdf", file_type = "pdf")
Plotting in ggplot after converting to data.frame with a single column?
I'm trying to convert some simple data into a form I thought ggplot2 would accept.
I snag some simple stock data and now I just want to plot, later I want to plot say a 10-day moving average or a 30-day historical volatility period to go with it, which is I'm using ggplot.
I thought it would work something like this line of pseudocode
library(quantmod) library(ggplot2) start = as.Date("2008-01-01") end = as.Date("2019-02-13") start tickers = c("AMD") getSymbols(tickers, src = 'yahoo', from = start, to = end) closing_prices = as.data.frame(AMD$AMD.Close) ggplot(closing_prices, aes(y='AMD.Close'))
But I can't even get this to work. The problem of course appears to be that I don't have an x-axis. How do I tell ggplot to use the index column as a. Can this not work? Do I have to create a new "date" or "day" column?
This line for instance using the Regular R plot function works just fine
This works without requiring me to enter a hard x-axis, and produces a graph, however I haven't figured out how to layer other lines onto this same graph, evidently ggplot is better so I tried that.
Using scale_color_gradient2 with a variable of class Date
I'm trying to color by date with ggplot2, but when I try to customize the color using
scale_color_gradient2, I get an error saying
Error in as.Date.numeric(value) : 'origin' must be supplied.
I can't seem to figure out how to pass the origin to
I've provided an example below. Any advice?
set.seed(1) x1 <- rnorm(100) x2 <- rnorm(100) day <- sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 100) myData <- data.frame(x1, x2, day) # this plot works as expected ggplot(myData, aes(x = x1, y = x2, color = day)) + geom_point() # scale_color_gradient2() asks for an origin, but I can't figure out how to supply one ggplot(myData, aes(x = x1, y = x2, color = day)) + geom_point() + scale_color_gradient2()
Word2vec compact models
Tell me if there are any w2v models that do not require a dictionary. So, everything that I found in torchtext first wants to know the dictionary build_vocab. But if I have a huge body of text, I would like to have a model that works at the level of phrases. But I did not find one.
supervised learning for parcours
for my school project i got to implement a neural network for a parcours. I know it useless but i want the neural net to learn a simple algorithm:
if front right is bigger than front left -> go right, else -> go left.
I wanna use supervised learning. I got 2 inputs neurons, 2 hidden neurons and 1 output neuron. The goal is that when the player has to go left the output gives a number under 0.5 and if the player has to go right the nn has to return a number greater that 0.5.
Somehow I made a mistake and the nn always tries to return 0.5. Do so know what i did wrong and what i can do now.
thats how the parcours looks like
Categorical Variables and too many NA for ML model
We have a data set of 250 variables and 50,000 records. One variable is numeric, 248 variables are categorical and one variable is binary (the target variable). Each category variable has more than 3000 levels. We have many NA. Each row is the record of diseases that a patient has suffered. That's why there are so many NAs. Because a patient may have suffered 100 diseases, and another has suffered only one. The objective is to be able to predict if patients can have a specific disease from the information of other diseases they have suffered. How can this data set be handled in machine learning?
H2o.ai automl exclude_algo = "GBM" not working , how to fix it
ai for a ML problem and I am using its h2o.automl function
I want that it not uses a GBM algorithm, so per the documentation I use it like this
mA <- h2o.automl(2:4, 5,train, max_runtime_secs = 120, exclude_algos = "GBM")
it runs and it creates its leaderboard but when I print the leaderboard the winner is a GBM
Slot "leaderboard": model_id mean_residual_deviance rmse mse mae rmsle 1 GBM_grid_0_AutoML_20190213_233340_model_7 0.3920007 0.6260996 0.3920007 0.4174691 0.05897232 2 DeepLearning_grid_0_AutoML_20190213_233340_model_3 0.4267542 0.6532643 0.4267542 0.4636161 0.06139260
Is there a way I can stop it from using a GBM or to save the second place model, the deeplearning one.
Local interpretation of H2O model using Lime
H2O models are now supported in LIME and that means the only problem is to read data into R memory for building the explainer. If that is not possible and the data is too big (and that is the reason why probably anyone would use H2O or Spark), I guess we just sample parts of the train data and build the explainer on that.
However, when I build the model using H2o and try LIME, I get this error:
td = as.data.frame(na.omit(train[train$cost> 1000,])) #1% of the train data ts = as.data.frame(na.omit(test[test$cost> 1000000,])) #only 2 observations # Building an explainer explainer <- lime(x = td , model = h2o.xgb) summary(explainer) # Explaining selecting samples explanation_xgb<- explain(x = ts, explainer = explainer, n_permutations = 5000, dist_fun = "gower", kernel_width = .75, n_features = 10, feature_select = "highest_weights") Error in UseMethod("explain") : no applicable method for 'explain' applied to an object of class "data.frame"
H2OConnectionError: Unexpected HTTP error: How to increase memory in H2O?
While loading my dataset using python code on the AWS server using Spyder, I get the following error:
File "<ipython-input-19-7b2e7b5812b3>", line 1, in <module> ffemq12 = load_h2odataframe_returns(femq12) #; ffemq12 = add_fold_column(ffemq12) File "D:\Ashwin\do\init_sm.py", line 106, in load_h2odataframe_returns fr=h2o.H2OFrame(python_obj=returns) File "C:\Program Files\Anaconda2\lib\site-packages\h2o\frame.py", line 106, in __init__ column_names, column_types, na_strings, skipped_columns) File "C:\Program Files\Anaconda2\lib\site-packages\h2o\frame.py", line 147, in _upload_python_object self._upload_parse(tmp_path, destination_frame, 1, separator, column_names, column_types, na_strings, skipped_columns) File "C:\Program Files\Anaconda2\lib\site-packages\h2o\frame.py", line 321, in _upload_parse ret = h2o.api("POST /3/PostFile", filename=path) File "C:\Program Files\Anaconda2\lib\site-packages\h2o\h2o.py", line 104, in api return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to) File "C:\Program Files\Anaconda2\lib\site-packages\h2o\backend\connection.py", line 415, in request raise H2OConnectionError("Unexpected HTTP error: %s" % e)
I am running this python code on Spyder on the AWS server. The code works fine up to half the dataset (1.5gb/3gb) but throws an error if I increase the data size. I tried increasing the RAM from 61gb to 122 GB but it is still giving me the same error.
Loading the data file
femq12 = pd.read_csv(r"H:\Ashwin\dta\datafile.csv") ffemq12 = load_h2odataframe_returns(femq12)
h2o.init(nthreads = -1,max_mem_size="150G")
Connecting to H2O server at http://127.0.0.1:54321... successful. -------------------------- ------------------------------------ H2O cluster uptime: 01 secs H2O cluster timezone: UTC H2O data parsing timezone: UTC H2O cluster version: 18.104.22.168 H2O cluster version age: 18 days H2O cluster total nodes: 1 H2O cluster free memory: 133.3 Gb H2O cluster total cores: 16 H2O cluster allowed cores: 16 H2O cluster status: accepting new members, healthy H2O connection proxy: H2O internal security:
False H2O API Extensions: Algos, AutoML, Core V3, Core V4 Python version: 2.7.15 final
I suspect it is a memory issue. But even after increasing RAM and max_mem_size, the dataset is not loading.
Any ideas to fix the error would be appreciated. Thank you.
How do Gradient Boosted Trees calculate errors in classification?
I understand how gradient boosting works for regression when we build the next model on the residual error of the previous model - if we use for example linear regression then it will be the residual errror as the target of the next model then sums all the models at the end to get a strong leaner
But how is this done in gradient boosted classification trees? Lets say we have a binary classification model with outcome 0/1 - what is the residual error for the next model to be trained on? And how is it calculated because it will not be y minus y predicted as is the case in linear regression.
I am really stuck on this one! The error of one binary classification tree is the ones it missclassifies - so is the target for the next model the missclasified points only?
Implausible variable importance for GBM survival: constant difference in importance
I have a question about a GBM survival analysis. I'm trying to quantify variable importances for my variables (n=453), in a data set of 3614 individuals. The resulting graph wi th variable importances looks suspiciously arranged. I have computed GBMs before but never seen this gradual pattern in importance. There are usually varying distances between the importance bars; in this case it appears that there is a constant difference in importance. My data frame is called df. I cannot upload sample data due to the sensitivity of data. Instead my question concerns the plausibility of obtaining these variable importances.
from sksurv.ensemble import GradientBoostingSurvivalAnalysis from sklearn import crossvalidation, metrics, model_selection from sklearn.grid_search import GridSearchCV import matplotlib.pylab as plt %matplotlib inline from matplotlib.pylab import rcParams rcParams['figure.figsize'] = 12, 4 from sklearn.datasets import make_regression predictors = [x for x in df.columns if x not in 'death','surv_death']] target = ['death','surv_death'] df_X=df[predictors] df_y=df[target] X=df_X.values arr_y=df_y.values y= np.zeros((n,), dtype=[('death','bool'),('surv_death', 'f8')]) y['death']=arr_y[:,1].flatten() y['surv_death']=arr_y[:,1].flatten() gbm0 = GradientBoostingSurvivalAnalysis(criterion='friedman_mse', dropout_rate=0 .0, learning_rate=0.01, loss='coxph', max_depth=100, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=10, min_samples_split=20, min_weight_fraction_leaf=0.0, n_estimators=1000, random_state=10, subsample=1.0, verbose=0) dropout_rate=0.0, learning_rate=0.01, loss='coxph', max_depth=100, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=10, min_samples_split=20, min_weight_fraction_leaf=0.0, n_estimators=1000, random_state=10, subsample=1.0, verbose=0) gbm0.fit(X, y) feature_importance = gbm0.feature_importances_ feature_importance = 100.0 * (feature_importance /feature_importance.max()) sorted_idx = np.argsort(feature_importance) preds=np.array(predictors)[sorted_idx] pos = np.arange(sorted_idx.shape) + .5 plt.figure(figsize=(10, 100)) plt.subplot(1, 1, 1) plt.barh(preds,pos,align='center') plt.xlabel('Relative Importance') plt.title('Variable Importance') plt.savefig("df.png") plt.show()
Implement null distribution for gbm interaction strength
I am trying to determine which interactions in a gbm model are significant using the method described in [Friedman and Popescu 2008]. My gbm is a classification model with 9 different classes. I'm struggling with how to translate Section 8.3 into code to run in R.
I think the overall process is to:
- Train a version of the model with max.depth = 1
- Simulate response data from this model
- Train a new model on this data with max.depth the same as the real model
- Get interaction strength for this model
- Repeat steps 1-4 to create a null distribution of interaction strengths
The part that I am finding most confusing is implementing equations 48 and 49. (You will have to look at the linked article since I can't reproduce them here)
This is what I think I understand but please correct me if I'm wrong:
y_i is a new vector of the response that we will use to train a new model which will provide the null distribution of interaction statistics.
F_A(x_i) is the prediction from a version of the gbm model trained with max.depth = 1
b_i is a probability between 0 and 1 based on the prediction from the additive model F_A(x_i)
- What is subscript i? Is it the number of iterations in the bootstrap?
- How is each artificial data set different from the others?
- Are we subbing the Pr(b_i = 1) into equation 48?
- How can this be done with multinomial classification?
- How would one implement this in R? Preferably using the gbm package.
Any ideas or references are welcome!