What's the best way to select variable in random forest model?
I am training RF models in R. What is the best way of selecting variables for my models (the datasets were pretty big, each has around 120 variables in total). I know that there is a cross-validation way of selecting variables for other classification algorithms such as KNN. Is that also a thing or if there exists a similar way for parameter tuning in RF model training?
do you know?
how many words do you know
See also questions close to this topic
-
pivot_wider does not keep all the variables
I would like to keep the variable
cat
(category) in the output of my function. However, I am not able to keep it. The idea is to apply a similar function tom <- 1 - (1 - se * p2)^df$n
based on the category. But in order to perform that step, I need to keep the variable category.Here's the code:
#script3 suppressPackageStartupMessages({ library(mc2d) library(tidyverse) }) sim_one <- function() { df<-data.frame(id=c(1:30),cat=c(rep("a",12),rep("b",18)),month=c(1:6,1,6,4,1,5,2,3,2,5,4,6,3:6,4:6,1:5,5),n=rpois(30,5)) nr <- nrow(df) df$n[df$n == "0"] <- 3 se <- rbeta(nr, 96, 6) epi.a <- rpert(nr, min = 1.5, mode = 2, max = 3) p <- 0.2 p2 <- epi.a*p m <- 1 - (1 - se * p2)^df$n results <- data.frame(month = df$month, m, df$cat) results %>% arrange(month) %>% group_by(month) %>% mutate(n = row_number(), .groups = "drop") %>% pivot_wider( id_cols = n, names_from = month, names_glue = "m_{.name}", values_from =m ) } set.seed(99) iters <- 1000 sim_list <- replicate(iters, sim_one(), simplify = FALSE) sim_list[[1]] #> # A tibble: 7 x 7 #> n m_1 m_2 m_3 m_4 m_5 m_6 #> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1 0.970 0.623 0.905 0.998 0.929 0.980 #> 2 2 0.912 0.892 0.736 0.830 0.890 0.862 #> 3 3 0.795 0.932 0.553 0.958 0.931 0.798 #> 4 4 0.950 0.892 0.732 0.649 0.777 0.743 #> 5 5 NA NA NA 0.657 0.980 0.945 #> 6 6 NA NA NA 0.976 0.836 NA #> 7 7 NA NA NA NA 0.740 NA
Created on 2022-05-07 by the reprex package (v2.0.1)
-
calculate weighted average over several columns with NA
I have a data frame like this one:
ID duration1 duration2 total_duration quantity1 quantity2 1 5 2 7 3 1 2 NA 4 4 3 4 3 5 NA 5 2 NA
I would like to do a weighted mean for each subject like this:
df$weighted_mean<- ((df$duration1*df$quantity1) + (df$duration2*df$quantity2) ) / (df$total_duration)
But as I have NA, this command does not work and it is not very nice....
The result would be this:
ID duration1 duration2 total_duration quantity1 quantity2 weighted_mean 1 5 2 7 3 1 2.43 2 NA 4 4 3 4 4 3 5 NA 5 2 NA 2
Thanks in advance for the help
-
I am to extract data from netCDF file using R for specific loaction the code i've written as showen and I have an error at the end of the code
I need some help with extracting date from NetCDF files using R , I downloaded them from cordex (The Coordinated Regional climate Downscaling Experiment). In total I have some files. This files have dimensions of (longitude, latitude, time) and the variable is maximum temperature (tasmax). At specific location, I need to extract data of tasmax at different time. In total I have some files. This files have dimensions of (longitude, latitude, time) and variable maximum temperature (tasmax). At specific location, I need to extract data of tasmax at different time.I wrote the code using R but at the end of code, an error appeared. Error ( location subscript out of bounds)
getwd() setwd("C:/Users/20120/climate change/rcp4.5/tasmax")
dir() library ("ncdf4") libra,-ry(ncdf4.helpers) library ("chron") ncin <- nc_open("tasmax_AFR-44_ICHEC-EC-EARTH_rcp45_r1i1p1_KNMI-RACMO22T_v1_mon_200601-201012.nc") lat <- ncvar_get(ncin, "lat") lon <- ncvar_get(ncin, "lon") tori <- ncvar_get(ncin, "time") title <- ncatt_get(ncin,0,"title") institution <- ncatt_get(ncin,0,"institution") datasource <- ncatt_get(ncin,0,"source") references <- ncatt_get(ncin,0,"references") history <- ncatt_get(ncin,0,"history") Conventions <- ncatt_get(ncin,0,"Conventions") tustr <- strsplit(tunits$value,"") ncin$dim$time$units ncin$dim$time$calendar tas_time <- nc.get.time.series(ncin, v = "tasmax", time.dim.name = "time") tas_time[c(1:3, length(tas_time) - 2:0)] tmp.array <- ncvar_get(ncin,"tasmax") dunits <- ncatt_get(ncin,"tasmax","units") tmp.array <- tmp.array-273.15 tunits <- ncatt_get(ncin,"time","units") nc_close(ncin) which.min(abs(lat-28.9)) which.min(abs(lon-30.2)) tmp.slice <- tmp.array[126,32981,] tmp.slice
Error in tmp.array[126, 32981, ] : subscript out of bounds
-
Training an ML model on two different datasets before using test data?
So I have the task of using a CNN for facial recognition. So I am using it for the classification of faces to different classes of people, each individual person being a it's own separate class. The training data I am given is very limited - I only have one image for each class. I have 100 classes (so I have 100 images in total, one image of each person). The approach I am using is transfer learning of the GoogLenet architecture. However, instead of just training the googLenet on the images of the people I have been given, I want to first train the googLenet on a separate larger set of different face images, so that by the time I train it on the data I have been given, my model has already learnt the features it needs to be able to classify faces generally. Does this make sense/will this work? Using Matlab, as of now, I have changed the fully connected layer and the classification layer to train it on the Yale Face database, which consists of 15 classes. I achieved a 91% validation accuracy using this database. Now I want to retrain this saved model on my provided data (100 classes with one image each). What would I have to do to this now saved model to be able to train it on this new dataset without losing the features it has learned from training it on the yale database? Do I just change the last fully connected and classification layer again and retrain? Will this be pointless and mean I just lose all of the progress from before? i.e will it make new weights from scratch or will it use the previously learned weights to train even better to my new dataset? Or should I train the model with my training data and the yale database all at once? I have a separate set of test data provided for me which I do not have the labels for, and this is what is used to test the final model on and give me my score/grade. Please help me understand if what I'm saying is viable or if it's nonsense, I'm confused so I would appreciate being pointed in the right direction.
-
How would I put my own dataset into this code?
I have been looking at a Tensorflow tutorial for unsupervised learning, and I'd like to put in my own dataset; the code currently uses the MNIST dataset. I know how to create my own datasets in Tensorflow, but I have trouble setting the code used here to my own. I am pretty new to Tensorflow, and the filepath to my dataset in my project is
\data\training
and\data\test-val\
# Python ≥3.5 is required import sys assert sys.version_info >= (3, 5) # Scikit-Learn ≥0.20 is required import sklearn assert sklearn.__version__ >= "0.20" # TensorFlow ≥2.0-preview is required import tensorflow as tf from tensorflow import keras assert tf.__version__ >= "2.0" # Common imports import numpy as np import os (X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data() X_train_full = X_train_full.astype(np.float32) / 255 X_test = X_test.astype(np.float32) / 255 X_train, X_valid = X_train_full[:-5000], X_train_full[-5000:] y_train, y_valid = y_train_full[:-5000], y_train_full[-5000:] def rounded_accuracy(y_true, y_pred): return keras.metrics.binary_accuracy(tf.round(y_true), tf.round(y_pred)) tf.random.set_seed(42) np.random.seed(42) conv_encoder = keras.models.Sequential([ keras.layers.Reshape([28, 28, 1], input_shape=[28, 28]), keras.layers.Conv2D(16, kernel_size=3, padding="SAME", activation="selu"), keras.layers.MaxPool2D(pool_size=2), keras.layers.Conv2D(32, kernel_size=3, padding="SAME", activation="selu"), keras.layers.MaxPool2D(pool_size=2), keras.layers.Conv2D(64, kernel_size=3, padding="SAME", activation="selu"), keras.layers.MaxPool2D(pool_size=2) ]) conv_decoder = keras.models.Sequential([ keras.layers.Conv2DTranspose(32, kernel_size=3, strides=2, padding="VALID", activation="selu", input_shape=[3, 3, 64]), keras.layers.Conv2DTranspose(16, kernel_size=3, strides=2, padding="SAME", activation="selu"), keras.layers.Conv2DTranspose(1, kernel_size=3, strides=2, padding="SAME", activation="sigmoid"), keras.layers.Reshape([28, 28]) ]) conv_ae = keras.models.Sequential([conv_encoder, conv_decoder]) conv_ae.compile(loss="binary_crossentropy", optimizer=keras.optimizers.SGD(lr=1.0), metrics=[rounded_accuracy]) history = conv_ae.fit(X_train, X_train, epochs=5, validation_data=[X_valid, X_valid]) conv_encoder.summary() conv_decoder.summary() conv_ae.save("\models")
Do note that I got this code from another StackOverflow answer.
-
Log-Likelihood for Random Forest models
I'm trying to compare multiple species distribution modeling approaches via k-fold cross-validation. Currently I'm calculating the RSME and AUC to compare model-performance. A friend suggested to further use the sum of log-likelihoods as metric to compare models. However, one of the models is a random forest fitted with the ranger package. If actually possible how would I calculate the log-likelihood for a random forest model and would it actually be a comparable metric to use with other models (GAM, GLM).
Thanks for your help.
-
Using the random forest method for classification to train my model, tuning your model based on the validation data set.Not using cross validation
I separate my dataset into three sets. train set, validation set, and test set. I want to use random forest method to train the data. But, To find the best ntree, mytry, and nnodes I want to use a validation set and see which are the best parameters. Then, I want to use those parameters for my training set. I do not want to use the caret package since it used cross-validation. I am dealing with classification problem.
a=as.numeric(2:15) for (i in 2:15){ model2= randomForest(as.factor(V2)~ .,data = vset, ntree=500, mtry=i, importance=TRUE) predValid2 = predict(model2, newdata = test, type = "class") a[i-1]= mean(predValid2 == test$V2) } n.tree=seq(from = 100, to = 5000, by = 100) n.mtry= seq(from = 1, to = 15, by = 1) model3= randomForest(as.factor(V2)~ .,data = vset, ntree=n.tree, mtry=n.mtry, importance=TRUE)
I use the above codes to write a loop but I believe they are not correct. I'd appreciate it if you could help me to find the best parameters based on validation set not cross validation