data preprocessing

machine learning
공개

2025년 2월 26일

Load Library and data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

dataset = pd.read_csv('_data/00-data.csv')
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
x
array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)
y
array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

Taking care of Missing data

  1. delete
  2. replace
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])
print(x)
[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]

Encoding Cagegorical data

  • 단순히 categorical 변수를 1, 2, 3으로 변형하면 순서가 고려된 것으로 간주될 수 있다.
  • 그래서 [0, 0, 1], [1, 0, 1] 이런 식으로 one hot encoding을 진행한다.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
x = np.array(ct.fit_transform(x))
print(x)
[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y)
print(y)
[0 1 0 0 1 1 0 1 0 1]

Split dataset into training set and test set

  • feature scaling 이전에 진행되어야함. (test set은 모델이 모르는 정보가 되야하기 때문)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

feature scaling

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])
print(X_train)
[[0.0 1.0 0.0 -1.2231822690784795 -1.074632541818236]
 [1.0 0.0 0.0 0.628120624661922 0.5410562913562817]
 [0.0 1.0 0.0 1.4215361505506654 1.5284216894073759]
 [0.0 1.0 0.0 0.09917694073609294 -0.19697441021726317]
 [1.0 0.0 0.0 -0.2975308222082788 0.09225383769669344]
 [0.0 0.0 1.0 -0.16529490122682156 -0.4463091066948125]
 [0.0 0.0 1.0 -1.6198900320228513 -1.6131954862097422]
 [1.0 0.0 0.0 1.157064308587751 1.1693797264797055]]
print(X_test)
[[0.0 0.0 1.0 -0.06244474046346582 -1.2541535232820715]
 [1.0 0.0 0.0 -0.5620026641711934 -0.7155905788905655]]
맨 위로