Preparing Data to Measure True Machine Learning Model Performance

#python #datascience #python2 #python3 #learnpython #django #coding #programming #development #developer #dev #code #opensource #software #testing #technology

A weekly Python podcast hosted by Christopher Bailey with interviews, coding tips, and conversation with guests from the Python community. The show covers a wide range of topics including Python programming best practices, career tips, and related software development topics. Join us every Friday morning to hear what's new in the world of Python programming and become a more effective Pythonista.

Preparing Data to Measure True Machine Learning Model Performance

December 02, 2022 0:57:45 56.44 MB Downloads: 0

How do you prepare a dataset for machine learning (ML)? How do you go beyond cleaning the data and move toward measuring how the model performs? This week on the show, Jodie Burchell, developer advocate for data science at JetBrains, returns to talk about strategies for better ML model performance.

Jodie starts by defining some terms for the conversation. We talk about targets, features, and supervised learning.

We discuss three common ways that data can alter model performance and which Python tools can help spot and avoid them. Jodie shares personal experiences of working through these pitfalls. We also share a healthy collection of resources to explore and learn more.

Course Spotlight: Combining Data in pandas With concat() and merge()

In this video course, you’ll learn two techniques for combining data in pandas: merge() and concat(). Combining Series and DataFrame objects in pandas is a powerful way to gain new insights into your data.

Topics:

00:00:00 – Introduction
00:01:46 – Recent conference talks
00:03:24 – How to prepare your data for model performance
00:04:24 – Vocabulary: target, features, and supervised learning
00:06:28 – The curse of dimensionality
00:08:57 – Overfitting
00:11:08 – Underfitting
00:12:11 – Splitting the dataset
00:13:39 – K-fold cross validation
00:18:30 – Data leakage
00:21:36 – Checking for duplicates
00:26:23 – Applying transformations only after splitting data
00:31:16 – Imbalanced data
00:36:36 – Using ML to balance data
00:41:05 – Informing your model of the imbalance
00:42:56 – Video Course Spotlight
00:44:20 – Accuracy used as a measure
00:49:05 – Scikit-learn method classification_table
00:50:43 – Jet Brains blog post and conference talk
00:52:18 – How can people follow your work online?
00:54:39 – Upcoming webinars
00:56:20 – Thanks and goodbye

Show Links:

Support the podcast & join our community of Pythonistas

Add New Podcast

Subscribe to this podcast

Preparing Data to Measure True Machine Learning Model Performance