Results of Ridgeregression
I've faced a problem connected with Ridgeregression.
As it's known, Ridgeregression is used in cases of strong conditionality of features. This is just my case: the determinant of my matrix of interfactor correlation is of the order of 10^(18)
. Multicollinearity is my case. The sampling of data consists of 8 quantitative features.
Ridge regression gives worse (or just the same) results than the standard linear regression.
What leads to this result? How can the results be improved?
1 answer

Ridge regression has one obvious disadvantage. Unlike best subset, forward stepwise, and backward stepwise selection, which will generally select models that involve just subset of the variables, ridge regression will include all predictiors in the final model. The lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. However, have you already considered selecting the tuning parameter using crossvalidation? reference: Chapter 6 Linear Model Selection and Regularization; ISLR
See also questions close to this topic

I am looking for a ML framework which has standardized way of integrating with Hadoop database?
I am looking for an ML framework which has a standardized way of integrating with Hadoop database?

Finding Weights ! MAchine Learning
When you are given a dataset for which you do not have access the the target function f which maps X to Y . You learn it from the data. In this problem, it is akin to learning the parameters of the line that separates the two classes. The same line above can be represented as ∑W(i)X(i)=0
i is from 0 to D
The goal here is to correctly find out W . The algorithm to find it is a simple iterative process.
Randomly choose a W to begin with. Keep on adjusting the value of W as follows until all data samples are correctly classified:
Randomly choose a sample from the dataset without replacement and see if it is correctly classified. If yes, move on to another sample. If not, then update the weights as W(t+1)=W(t)+y⋅x and go back to the previous step (of randomly chosing a sample)
W(t+1) is value of W at iteration t+1 W(t)is value of ww at iteration t y is the class label for the sample under consideration x is the datapoint under consideration Write a function that implements this learning algorithm. The input to the function is going to be a dataset represented by the input variable X and the target variable y. The output of the function should be the chosen w .

How to improve Kaggle's Titanic problem score
I've just submitted my code for Kaggle's titanic problem. I got a lousy accuracy score of 0.69. What are some ways would you suggest that I go about increasing my accuracy score?
For my code, I am only using a Decision Tree Classifier. I'm also adding 0s to NaN fields so I assume these 2 are the reasons why I got such a low prediction score.
The following is my code:
import pandas as pd from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split """Assigning the train & test datasets' adresses to variables""" train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv" test_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\test.csv" """Using pandas' read_csv() function to read the datasets and then assigning them to their own variables""" train_data = pd.read_csv(train_path) test_data = pd.read_csv(test_path) """Using pandas' factorize() function to represent genders (male/female) with binary values (0/1)""" train_data['Sex'] = pd.factorize(train_data.Sex)[0] test_data['Sex'] = pd.factorize(test_data.Sex)[0] """Replacing missing values in the training and test dataset with 0""" train_data.fillna(0.0, inplace = True) test_data.fillna(0.0, inplace = True) """Selecting features for training""" columns_of_interest = ['Pclass', 'Sex', 'Age'] """Dropping missing/NaN values from the training dataset""" filtered_titanic_data = train_data.dropna(axis=0) """Using the predictory features in the data to handle the x axis""" x = filtered_titanic_data[columns_of_interest] """The survival (what we're trying to find) is the y axis""" y = filtered_titanic_data.Survived """Splitting the train data with test""" train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0) """Assigning the DecisionClassifier model to a variable""" titanic_model = DecisionTreeClassifier() """Fitting the x and y values with the model""" titanic_model.fit(train_x, train_y) """Predicting the xaxis""" val_predictions = titanic_model.predict(val_x) """Assigning the feature columns from the test to a variable""" test_x = test_data[columns_of_interest] """Predicting the test by feeding its x axis into the model""" test_predictions = titanic_model.predict(test_x) submission = pd.DataFrame({ 'PassengerId': test_data.PassengerId.values, 'Survived': test_predictions }) submission.to_csv("my_submission.csv", index=False)

How to get exact relative topic prevalence out of STM, over multiple time periods?
I am trying to use the
stm
package inR
to calculate the relative prevalence or proportion of topics in a corpus at different periods. For example, let's say at time period 1, the composition of topics among the documents of that period could be: 80% of topic A, 10% topic B, and 10% topic C. And in the next time period, containing again a bunch of documents, the composition could be 70% A, 30% B and 0% C.Structural topic models sounded like they'd be well suited for this, as you can define covariates, such as time. Except... I can't seem to figure out how to do that exactly. Here' is a minimal example using some data from the
quanteda
package:library(quanteda) library(stm) library(ggplot2) library(dplyr) library(tidyr) # get data: US presidents inaugural speeches df = dfm(data_corpus_inaugural, tolower = T, stem=T,remove=stopwords(), remove_punct=T) %>% dfm_trim(min_termfreq = 10, max_docfreq = 0.75, docfreq_type="prop") %>% convert("stm") df$meta$Decade = as.numeric(gsub("^(...).*","\\1", df$meta$Year)) # grouped the speeches by decade, instead of year smod < stm(df$documents, df$vocab, K = 5, verbose = FALSE, prevalence = ~Decade, data=df$meta ) summary(smod) # attempt 1: use doctopic proportions  but this is per document, not per decade... labs = labelTopics(smod) rownames(smod$theta) = 1:58 colnames(smod$theta) = 1:5 d=as.data.frame.table(smod$theta) ggplot(d, aes(x = Var1, y = Freq, group = Var2, colour = Var2)) + geom_point() + geom_line() # attempt 1.2: use doctopic proportions, but take mean per time period (decade) # this sort of gives an idea how much on average each topic was present among the documents d2 = cbind(smod$theta, df$meta$Decade); colnames(d2)[6] = "decade" d2 %>% as.data.frame() %>% gather(topic, proportion, 1:5, factor_key = T) %>% group_by(decade, topic) %>% summarise(mean=mean(proportion)) # problem, won't sum to one, also not sure if correct approach # attempt 2: try using the prevalence estimation est = estimateEffect(1:5 ~ s(Decade), smod, df$meta) # based on stm help plot(est, "Decade", model=smod, method="continuous") abline(h=0)
The last bit produces an object that, when plotted, does look like prevalence over time. Some observations though: it's smoothed (due to the
s()
in the formula), while I am looking for exact periodbyperiod topic proportions; also some values go below 0 (not just the confidence intervals, some estimate curves as well). Just having "Decades" in the regression formula gets me linear regression, which is useless for this task. Theest$parameters
object contains an Intercept and value for each spline (times 25 simulation); saving theplot()
as an object, with themethod="point"
gives access to calculatedmeans
proportions values for each topic (but it's still from the smooth spline  increasing thedf
value fors()
gives more values though). Anyway, the smooth regression, via a plot command, seems like a roundabout way of getting the proportions.Thus the question: how to
conjureproperly get periodbyperiod relative topic compositions (proportions) out of anstm
model?(Alternatively: if that is a completely wrong method for getting such topic proportions, what'd be a better way?)

What is the error in the iterative implementation of gradient descent algorithm below?
I have attempted to implement the iterative version of gradient descent algorithm which however is not working correctly. The vectorized implementation of the same algorithm however works correctly.
Here is the iterative implementation :function [theta] = gradientDescent_i(X, y, theta, alpha, iterations) % get the number of rows and columns nrows = size(X, 1); ncols = size(X, 2); % initialize the hypothesis vector h = zeros(nrows, 1); % initialize the temporary theta vector theta_temp = zeros(ncols, 1); % run gradient descent for the specified number of iterations count = 1; while count <= iterations % calculate the hypothesis values and fill into the vector for i = 1 : nrows for j = 1 : ncols term = theta(j) * X(i, j); h(i) = h(i) + term; end end % calculate the gradient for j = 1 : ncols for i = 1 : nrows term = (h(i)  y(i)) * X(i, j); theta_temp(j) = theta_temp(j) + term; end end % update the gradient with the factor fact = alpha / nrows; for i = 1 : ncols theta_temp(i) = fact * theta_temp(i); end % update the theta for i = 1 : ncols theta(i) = theta(i)  theta_temp(i); end % update the count count += 1; end end
And below is the vectorized implementation of the same algorithm :
function [theta, theta_all, J_cost] = gradientDescent(X, y, theta, alpha) % set the learning rate learn_rate = alpha; % set the number of iterations n = 1500; % number of training examples m = length(y); % initialize the theta_new vector l = length(theta); theta_new = zeros(l,1); % initialize the cost vector J_cost = zeros(n,1); % initialize the vector to store all the calculated theta values theta_all = zeros(n,2); % perform gradient descent for the specified number of iterations for i = 1 : n % calculate the hypothesis hypothesis = X * theta; % calculate the error err = hypothesis  y; % calculate the gradient grad = X' * err; % calculate the new theta theta_new = (learn_rate/m) .* grad; % update the old theta theta = theta  theta_new; % update the cost J_cost(i) = computeCost(X, y, theta); % store the calculated theta value if i < n index = i + 1; theta_all(index,:) = theta'; end end
Link to the dataset can be found here
The filename is ex1data1.txt
ISSUES
For initial theta = [0, 0] (this is a vector!), learning rate of 0.01 and running this for 1500 iterations I get the optimal theta as :
 theta0 = 3.6303
 theta1 = 1.1664
The above is the output for the vectorized implementation which I know I have implemented correctly (it passed all the test cases on Coursera).
However, when I implemented the same algorithm using the iterative method (1st code I mentioned) the theta values I get are (alpha = 0.01, iterations = 1500):
 theta0 = 0.20720
 theta1 = 0.77392
This implementation fails to pass the test cases and I know therefore that the implementation is incorrect.
I am however unable to understand where I am going wrong as the iterative code does the same job, same multiplications as the vectorized one and when I tried to trace the output of 1 iteration of both the codes, the values came same (on pen and paper!) but failed when I ran them on Octave.
Any help regarding this would be of great help especially if you could point out where I went wrong and what exactly was the cause of failure.
Points to consider
 The implementation of hypothesis is correct as I tested it out and both the codes gave the same results, so no issues here.
 I printed the output of the gradient vector in both the codes and realised that the error lies here because the outputs here were very different!
Additionally, here is the code for preprocessing the data :
function[X, y] = fileReader(filename) % load the dataset dataset = load(filename); % get the dimensions of the dataset nrows = size(dataset, 1); ncols = size(dataset, 2); % generate the X matrix from the dataset X = dataset(:, 1 : ncols  1); % generate the y vector y = dataset(:, ncols); % append 1's to the X matrix X = [ones(nrows, 1), X]; end

Partitioned simple linear regression. Find the variance of estimators
After many tries, I still can't solve this question. Does anyone know how to solve it? Thank you!