Data Transformation

data mining

공개

2025년 4월 2일

concat

axis=0 row
axis=1 column

to_csv

train test split

Hold out method: sub sampling

validation data set: 하이퍼파라미터를 튜닝
층화추출법: stratified sampling

resampling method: Hold out을 여러번 반복. variance 극복 bias 극복 x
1. cross validation
- k-fold
- leave-one-out:
1. bootstrap

data의 특성

volume - Data size big - Data size small(long data) velocity variety veracity(정확성) value

import numpy as np
from sklearn.model_selection import LeaveOneOut
X = np.array([[1, 2], [3, 4], [5, 6]])
y = np.array(['a', 'b', 'c'])
loo = LeaveOneOut()
loo.get_n_splits(X)

from sklearn.model_selection import KFold

X = np.array([[10, 20], [30, 40], [15, 19], [34, 41], [11, 21], [33, 39]])
y = np.array([0, 1, 0, 1, 0, 1])
kf = KFold(n_splits=3)

kf.get_n_splits(X)
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

TRAIN: [2 3 4 5] TEST: [0 1]
TRAIN: [0 1 4 5] TEST: [2 3]
TRAIN: [0 1 2 3] TEST: [4 5]

맨 위로