Max Kuhn Gives Talk at NYC Open Data Meetup
Posted by manager
Updated: Feb 19, 2015
Predictive Analytics
NYC Open Data Meetup was pleased to host Max Kuhn on February 17, who gave a wonderful talk about predictive analytics. He focused on several key points: prediction vs inferential statistics, the process of model building, the choice of methodology, and next steps in data science. He also spoke about the ethics that can come into play when building predictive models.
Predictive Modeling
"Predictive modeling is the process of creating a model whose primary goal is to achieve high levels of accuracy". The objective here is to make the best possible prediction in an individual data instance. While this might seem like an obvious and trivial definition, Max Kuhn was making a point about the difference between predictive analytics and traditional inferential statistics. They are measuring different things. He quotes Friedman (2001), who describes an example related to boosted trees and MLE.
"...degrading the likelihood by overfitting actually improves misclassification error rates. Although perhaps counterintuitive, this is not a contradiction; likelihood and error rate measure different aspects of fit quality."
Inferential Statistics
Traditional inferential statistics focus on the appropriateness of models as related to things like distributional assumptions, parsimony, and degrees of freedom. This is not the case in predictive modeling. In fact, in some cases, the entire concept of degrees of freedom may be meaningless, as for example when you have more predictors than you have observations. So one of the key takeaways from Max Kuhn's talk was that is it okay to throw the inferential book away if it makes your predictions better.
Issues in Predictive Modeling
The issues in predictive modeling are not so much related to confidence intervals or probabilities. The issues to focus on are overfitting vs underfitting and bias vs variance. The choice of model may have some constraints. The nature of the data is one, for example, if there is a lot of multicollinearity or a lot of unlabeled data. Cross validated cost estimates will contribute to model selection. And at times, ease of use may be a consideration, if the model is going to be deployed.
Primarily, though, when doing predictive modeling he suggests that one not be deterred by the complexity of the model itself. Complex or non-linear pre-processing can make predictions better; so too can ensembles of models. Over parameterized (by traditional statistical criteria) models that are highly regularized and non-linear can often make excellent predictions.
Model Improvements
Many different models can work well to solve a problem. In fact, Max Kuhn believes that the current available models are good for solving most problems. Except for a few exceptions, model improvement, he believes, will have more to do with feature selection than with a new algorithm. The exceptions, areas where he believes we could use better models are: using unlabeled data, severe class imbalances, feature engineering, applicability domain techniques and confidence assessments. In most cases, though, it will be feature selection that improves models.
Noisy Data
Max Kuhn told a story about analyzing molecules in an assay that had a lot of predictors but also a lot of imperfections, so the data was noisy. The assay that was more pure could only provide a limited number of predictors. Yet it was the latter that gave better results. So big data is not always better data, and except for anomaly detection, where one might need a lot of data to get a large enough sample of the target of interest, smaller data sets do just as well.
Advocating for a Particular Model
Finally, one of the biggest takeaways of the evening was the importance of advocating for a model that is a superior predictor, even if it is difficult to interpret. A somewhat trivial example Max Kuhn used was a spam filter. Nobody will care if the model is difficult to interpret if it is accurate in filtering spam and keeping important email from going into the junk folder. In Max Kuhn's personal history he needed to defend a model that does diagnostic work when the FDA was asking for a simpler model. But his predictive model as excellent, and he did not want to compromise accuracy. When his daughter had an illness that required this very diagnostic tool, and the instrument used in his daughter's hospital was made by a different company, he wondered if the other group of data scientists had also stood their ground.
The work data scientists do is important, and in some cases can be a matter of life and death.
Apply for the Upcoming NYC Data Science Bootcamp
The first step in becoming a data scientist is to complete your Data Science Bootcamp Application. Just click the button to apply. It's free and will only take you about 5 minutes.
manager
View all articlesTopics from this blog: data science statistics NYC NYC Open Data prediction Meetup data
Subscribe Here
Posts by Tag
- Meetup (101)
- data science (68)
- Community (60)
- R (48)
- Alumni (46)
- NYC (43)
- Data Science News and Sharing (41)
- nyc data science academy (38)
- python (32)
- alumni story (28)
- data (28)
- Featured (14)
- Machine Learning (14)
- data science bootcamp (14)
- Big Data (13)
- NYC Open Data (12)
- statistics (11)
- visualization (11)
- Hadoop (10)
- hiring partner events (10)
- D3.js (9)
- Data Scientist (9)
- NYCDSA (8)
- Web Scraping (8)
- Career (7)
- Data Scientist Jobs (6)
- Data Visualization (6)
- Hiring (6)
- Open Data (6)
- R Workshop (6)
- APIs (5)
- Alumni Spotlight (5)
- Best Bootcamp (5)
- Best Data Science 2019 (5)
- Best Data Science Bootcamp (5)
- Data Science Academy (5)
- Demo Day (5)
- Job Placement (5)
- NYCDSA Alumni (5)
- Tableau (5)
- alumni interview (5)
- API (4)
- Career Education (4)
- Deep Learning (4)
- Get Hired (4)
- Kaggle (4)
- NYC Data Science (4)
- Networking (4)
- Student Works (4)
- employer networking (4)
- prediction (4)
- Data Analyst (3)
- Job (3)
- Maps (3)
- New Courses (3)
- Python Workshop (3)
- R Shiny (3)
- Shiny (3)
- Top Data Science Bootcamp (3)
- bootcamp (3)
- recommendation (3)
- 2019 (2)
- Alumnus (2)
- Book-Signing (2)
- Bootcamp Alumni (2)
- Bootcamp Prep (2)
- Capstone (2)
- Career Day (2)
- Data Science Reviews (2)
- Data science jobs (2)
- Discount (2)
- Events (2)
- Full Stack Data Scientist (2)
- Hiring Partners (2)
- Industry Experts (2)
- Jobs (2)
- Online Bootcamp (2)
- Spark (2)
- Testimonial (2)
- citibike (2)
- clustering (2)
- jp morgan chase (2)
- pandas (2)
- python machine learning (2)
- remote data science bootcamp (2)
- #trainwithnycdsa (1)
- ACCET (1)
- AWS (1)
- Accreditation (1)
- Alex Baransky (1)
- Alumni Reviews (1)
- Application (1)
- Best Data Science Bootcamp 2020 (1)
- Best Data Science Bootcamp 2021 (1)
- Best Ranked (1)
- Book Launch (1)
- Bundles (1)
- California (1)
- Cancer Research (1)
- Coding (1)
- Complete Guide To Become A Data Scientist (1)
- Course Demo (1)
- Course Report (1)
- Finance (1)
- Financial Data Science (1)
- First Step to Become Data Scientist (1)
- How To Learn Data Science From Scratch (1)
- Instructor Interview (1)
- Jon Krohn (1)
- Lead Data Scienctist (1)
- Lead Data Scientist (1)
- Medical Research (1)
- Meet the team (1)
- Neural networks (1)
- Online (1)
- Part-time (1)
- Portfolio Development (1)
- Prework (1)
- Programming (1)
- PwC (1)
- R Programming (1)
- R language (1)
- Ranking (1)
- Remote (1)
- Selenium (1)
- Skills Needed (1)
- Special (1)
- Special Summer (1)
- Sports (1)
- Student Interview (1)
- Student Showcase (1)
- Switchup (1)
- TensorFlow (1)
- Weekend Course (1)
- What to expect (1)
- artist (1)
- bootcamp experience (1)
- data scientist career (1)
- dplyr (1)
- interview (1)
- linear regression (1)
- nlp (1)
- painter (1)
- python web scraping (1)
- python webscraping (1)
- regression (1)
- team (1)
- twitter (1)