- Thread starter rogojel
- Start date
- Tags bookclub statistical learning

It's a pretty good book, by the way. Just in the sweet spot between accesible and rigorous, for me at least.

For anyone reading this, the book is freely and legally available as pdf.

Hi,

I just pledged to work my way through this book (Introduction to Statistical Learning by James, Witten, Hastie, Tibshirani). Anyone willing to join? Keeping the discipline would be more fun if we worked together.

regards

I just pledged to work my way through this book (Introduction to Statistical Learning by James, Witten, Hastie, Tibshirani). Anyone willing to join? Keeping the discipline would be more fun if we worked together.

regards

Aha, so IRSL means "Introduction to Statistical Learning with Applications in R". I guess that Hastie-Tibshirani have made the right decision to skip 'estimation' and call it 'learning' and by that include the machine-learning people.

I have read portions, and may be interested, but I need goals and dialogue between us. In 20 days my schedule opens up and I gonna do all kinds of reading.

Interesting enough I bought the new Hastie and efron book last night.

Interesting enough I bought the new Hastie and efron book last night.

I am glancing a book by McElreath "Statistical Rethinking A Bayesian Course with Examples in R and Stan" which starts at an elementary level but is interesting.

I am also looking in Schweder Hjort "Confidence, Likelihood, Probability". A preliminary version of that book can be found here.

It seem like that book talks about some of the areas as Efron Hastie does.

http://stan.fit/2016/10/27/intro-to-bayes-webinar

I have also been eyeing the ESL book, but figure you were suppose to read their first book ISLR beforehand. I will let you look that one up GG.

Lastly, I keep wanting to read this book in its entirety, which they are working on a follow-up book to it.

http://www.targetedlearningbook.com/

PS, I also need to read Gelman's multilevel book and I need to finally read a longitudinal book. More for leisure I am planning to read Black Swan at the end of the month.

my first musing /question about the book: in the chapter about Classification we discuss linear and quadratic discriminant analysis. It is also said that for more than two categories LDA is preferred over logistic regression. However there was no significance calculation given for LDA and there is none in the R output either. Also, I do not recall anyone ever recomending LDA in this forum instead of a logistic regression, not to mention recomending QDA. Is this because we can have no p-values? How would one calculate the sample size?

So, I did the logistic regression in 3 cases : with all the available data I got Lag2 as the significant predictor. However, if I ran the logistic regression on the train data alone, then Lag1 was significant and Lag2 was not. I guess this means that probably neither of them is significant, all we see is some fluke in the data. I the decided to take a completely random selection of 800 points as the train data – and sure enough there were no significant predictors there.

Now, apart from the true objective of the exercise, this raises interesting questions about our use of regression and model selection. I would have accepted either Lag1 or Lag2 as a legitimate predictor in any analysis, and I guess anyone would have accepted them as well. Given the recent discussions on the value of the p-value as a tool, this is quite sobering. Maybe, one could extend the p-value testing to require that train and test samples should be used as well? I am thinking of something like either build the model by using a training set and validate it using a test set – or work backwards, find the model using all the data but then to require that there should be some indication of the effect if we used a smaller random subset of the original data?

BTW of the possible choices for a classification algorithm logistic regression with a threshold of 0.5 behaved very poorly, and LDA was only marginally better . Both algorithms essentially bet on a upward movement – the logistic regression only predicted 7 downward movements out of a total of 289. Because there were more upward movements then downward this got them a true positive rate of around 51-52% . Surprisingly QDA got a whooping 58.5% with KNN with k=1 being as bad as logistic regression but k=5 slightly better than logistic regression and QDA. I actually never had QDA on my radar, I guess this will change now.

Yes, these authors highly champion the use of cross-validation if sample is marginally large and Training, Testing, and validation sets for larger data, or leave-one-out for small samples. My problem is that I usually don't get large datasets.

Did you do any model scoring? That is something that I have not really done.

Not too familiar with lags. Were they calling the older dataset the lag. I always think of it as the run-in data, so say prior 3 days or something like that in panel data. I am right in thinking this?

@hlsmith: the data is a time series and the lags are simply data shifted by one, two....five periods.

I only rated the models based on the true positive rate - and sortof based on aestetics as well, where I regard a confusion matrix that is well balanced aesthetically more satisfying as one that only gets one category right.

Vanilla attempt - on a training set of 80% randomly selected data : 87% TPR but not nice at all,the model basically just decided to always pick TRUE . It did not get the FALSEs at all but I had mostly Ts in the test dataset so..

Trying the QDA method and voila: 93% TPR and a well balanced confusion matrix.

My problem with LDA is that it does not give a p-value or any clue as to which variables are important in the model and which aren`t - so I just decided to prune the model based on the group means reported in the output on the basis of "large difference - stays, small difference goes".

Group means:

zn indus chas nox rm

FALSE 22.882353 6.406639 0.05462185 0.4635324 6.421639

TRUE 1.831325 13.940723 0.13253012 0.6295120 6.206181

age dis rad tax ptratio

FALSE 49.58824 5.270933 4.189076 296.1261 17.76891

TRUE 85.70301 2.601809 10.518072 434.3253 18.39518

black lstat medv

FALSE 388.8326 9.114664 25.40756

TRUE 367.2677 14.566928 22.40964

zn indus chas nox rm

FALSE 22.882353 6.406639 0.05462185 0.4635324 6.421639

TRUE 1.831325 13.940723 0.13253012 0.6295120 6.206181

age dis rad tax ptratio

FALSE 49.58824 5.270933 4.189076 296.1261 17.76891

TRUE 85.70301 2.601809 10.518072 434.3253 18.39518

black lstat medv

FALSE 388.8326 9.114664 25.40756

TRUE 367.2677 14.566928 22.40964

So, trying the other standard methods - logistic regression performed about as well as lda but pruning based on p-values reduced model oerformance a lot.

knn was almost as good as the qda method (almost) . Interestingly increasibg k from 1 to 5 did not improve the model at all. I really expected that it would but apparently k=1 was already capturing all structure in the data .

a bit late, due to the year-end hassle, but still keeping at it - I am working now on chap. 6 - Regression and especially model selection, ridge and lasso.

I just finished ex. 8 where I had to generate a random X and a Y that was a polynomial function of X with degree 3 plus noise of course. Then generate the powers of X up to 10 and try to find a regression model correctly describing the X-Y relationship.

My first surprise was that the regsubsets function from the leaps package did a pretty good job identifying the model with 3 variables . I tried three selection criteria, cp, rsquared.adjusted and BIC . If I went for the minimum then only BIC picked the right model but if I went for the "knee" in the graphical representation then all three were obviously identifying the model with 3 parameters as the best one.

Using the lasso the "best" model found by cross validation also identified 3 parameters, but only if I picked the lambda.1se and not the lambda.min- which was my intuitive choice anyway.

As I knew the parameter values I could also compare the lm model's guess to that of the lasso - and interestingly the lm model was somewhat better. Also comparing the MSE on a new set of similarly generated data lm performed better.

So, I repetead the exercise by adding a lot more noise . In this case the performance of the lasso MSE-wise was closer to tjat of the lm but still the simple lm model was better.

regards

The task is to generate all the models that were developed in the chapter. I generated a random sample of 100 datapoints for testing and left 406 in the training.

The first thing I learned is that in the presence of some outliers the test-set performance of the models can be hugely variable. For the exact same model, depending on the test-set I could get an MSE of 100 or 10 . The effect depended on whether some outliers got into the test-set or not - of course an outlier in the test - set meant that it had no influence on the model but generated a large residual.

So, comparing the methods - again the simple regression (with interactions) performed on the average better then either the lasso or the regression. PCR was somewhere in between the regression and the lasso while PLS got very close to the simple regression. Given how much more difficult it would be to explain a PLS as compared to the regression the simple regression still seems to be the winner - but the number of variables was really not high enough to see the advantages of the more sophisticated methods.

Another point - it does make sense to include nonlinearities and interactions into the models - This would be easy with a simplle regression - for all the others I just added product columns to the dataset (could try squares as well). The tendency did not change as far as model performance was concerned, but the MSEs went down for all the models.

Also, the outliers complicate the modelling a lot - so exploratory analysis would be a must for any modelling . This does not seem to be a great discovery, but one tends to forget this in the heat of a project.

So, on to chapter 7...