Scikit-learn Test Train Split

June 28, 2023

The train_test_split function in scikit-learn is used to split a dataset into two subsets: a training set and a test set. The training set is used to train a machine learning model, and the test set is used to evaluate the performance of the model. The train_test_split function is a valuable tool for machine learning practitioners. It allows you to train a model on a subset of the data and then evaluate the performance of the model on a separate subset of the data. This helps to ensure that the model is not overfitting the training data and that it is able to generalize to new data.

The train_test_split function takes few arguments:

The first argument is the dataset to be split.
The second argument is the proportion of the dataset that should be included in the training set.
The third argument is the random_state parameter, which can be used to control the shuffling of the data before the split.

The train_test_split function takes a few parameters, including:

X: The dataset to be split. The features of the dataset.
y: The target variable. The labels of the dataset.
test_size: The proportion of the dataset to be used as the test set.
random_state: A random number generator seed. This is used to shuffle the data before splitting it.

The train_test_split function returns four objects:

X_train: The training set features.
y_train: The training set labels.
X_test: The test set features.
y_test: The test set labels.

The train_test_split function will randomly split the dataset into two subsets, with the test_size proportion of the data going into the test set and the remaining data going into the training set. The random number generator seed can be used to ensure that the same split is produced each time the function is called.

Example 1, the following code will split the iris dataset into a training set and a test set, with 25% of the data going into the test set:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the iris dataset
iris = pd.read_csv("iris.csv")

# Split the dataset into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(iris.values[:, :-1], iris.values[:, -1], test_size=0.25)

# Print the shapes of the training and test sets
print(X_train.shape)
print(X_test.shape)

This code will print the following output:

(120, 4)
(30, 4)

The output shows that the training set has 120 rows and 4 columns, and the test set has 30 rows and 4 columns.

Example 2, the following code splits a dataset into a training set and a test set, with 75% of the data in the training set and 25% of the data in the test set:

import numpy as np
from sklearn.model_selection import train_test_split

data = np.random.randint(0, 100, size=100)

X_train, X_test, y_train, y_test = train_test_split(data, data, test_size=0.25)

The X_train and y_train variables contain the training data, and the X_test and y_test variables contain the test data.

Example 3: how to use the train_test_split function:

import numpy as np
from sklearn.model_selection import train_test_split

# Create a dataset
X = np.random.randint(0, 10, (100, 2))
y = np.random.randint(0, 2, 100)

# Split the dataset into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Print the sizes of the training set and the test set
print(X_train.shape)
print(X_test.shape)

This code will create a dataset of 100 data points with 2 features each. The labels for the data points will be 0 or 1. The train_test_split function will split the dataset into a training set of 75 data points and a test set of 25 data points.

Example 4: how to use the train_test_split function:

import numpy as np
from sklearn.model_selection import train_test_split

# Create a dataset
X = np.random.randint(0, 10, (100, 2))
y = np.random.randint(0, 2, 100)

# Split the dataset into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Print the sizes of the training set and the test set
print(X_train.shape)
print(X_test.shape)

The train_test_split function is a valuable tool for machine learning practitioners. It allows you to train a model on a subset of the data and then evaluate the performance of the model on a separate subset of the data. This helps to ensure that the model is not overfitting the training data and that it is able to generalize to new data.

Search This Blog

Data Science

Scikit-learn Test Train Split

Comments

Post a Comment

Popular posts from this blog

Image Processing Using NumPy - Part 2

Safety-Critical Systems and Large Language Models

Anomaly Detection and Datamining