Data Pre-processing in Machine Learning
Introduction:
Data pre-processing is an essential step in machine learning projects. It involves transforming raw data into a format that is suitable for training machine learning models. Pre-processing helps to improve the quality of the data, handle missing values, normalize features, and handle categorical variables. In this tutorial, we will cover the most common techniques used in data pre-processing for machine learning.
1. Importing Libraries:
Before we start, let's import the necessary libraries commonly used in data pre-processing tasks:
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer, StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
2. Loading the Dataset:
First, we need to load the dataset into our program. Typically, datasets are available in various formats like CSV, Excel, or SQL databases. For this tutorial, we will use the `pandas` library to load a CSV file:
dataset = pd.read_csv('dataset.csv')
3. Handling Missing Data:
Missing data is a common issue in datasets. We need to handle missing values before training our models. There are several strategies to handle missing data, such as removing rows with missing values, imputing missing values with the mean or median, or using more advanced techniques. Here, we will use the `Imputer` class from scikit-learn to replace missing values with the mean:
imputer = Imputer(missing_values=np.nan, strategy='mean', axis=0)
dataset['column_name'] = imputer.fit_transform(dataset[['column_name']])
Replace `'column_name'` with the name of the column containing missing values.
4. Encoding Categorical Variables:
Machine learning models often require numerical inputs, so we need to encode categorical variables into numerical representations. Two common techniques for encoding categorical variables are Label Encoding and One-Hot Encoding.
- Label Encoding:
Label Encoding replaces each category with a unique integer. It is suitable for ordinal variables, where the order matters.
label_encoder = LabelEncoder()
dataset['column_name'] = label_encoder.fit_transform(dataset['column_name'])
Replace `'column_name'` with the name of the categorical column to be label encoded.
- One-Hot Encoding:
One-Hot Encoding creates binary columns for each category, representing the presence or absence of a category.
one_hot_encoder = OneHotEncoder()
encoded_data = one_hot_encoder.fit_transform(dataset[['column_name']])
Replace `'column_name'` with the name of the categorical column to be one-hot encoded.
5. Feature Scaling:
Feature scaling is important to ensure that all features have a similar scale. It helps algorithms converge faster and prevents features with larger scales from dominating the learning process. The two most common techniques for feature scaling are Standardization and Normalization.
- Standardization:
Standardization scales the features to have zero mean and unit variance.
python code
scaler = StandardScaler()
dataset['column_name'] = scaler.fit_transform(dataset[['column_name']])
Replace `'column_name'` with the name of the column to be standardized.
- Normalization:
Normalization scales the features to a specific range, often between 0 and 1.
python code
scaler = MinMaxScaler()
dataset['column_name'] = scaler.fit_transform(dataset[['column_name']])
Replace `'column_name'` with the name of the column to be normalized.
6. Splitting the Dataset:
To evaluate the performance of our machine learning models, we need to split the dataset into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate its performance on unseen data.