Data Transformation

data mining
공개

2025년 4월 2일

  1. concat
  1. to_csv

train test split

  1. Hold out method: sub sampling
  • validation data set: 하이퍼파라미터를 튜닝
  • 층화추출법: stratified sampling
  1. resampling method: Hold out을 여러번 반복. variance 극복 bias 극복 x
    1. cross validation
    • k-fold
    • leave-one-out:
    1. bootstrap

data의 특성

volume - Data size big - Data size small(long data) velocity variety veracity(정확성) value

import numpy as np
from sklearn.model_selection import LeaveOneOut
X = np.array([[1, 2], [3, 4], [5, 6]])
y = np.array(['a', 'b', 'c'])
loo = LeaveOneOut()
loo.get_n_splits(X)
3
from sklearn.model_selection import KFold

X = np.array([[10, 20], [30, 40], [15, 19], [34, 41], [11, 21], [33, 39]])
y = np.array([0, 1, 0, 1, 0, 1])
kf = KFold(n_splits=3)

kf.get_n_splits(X)
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
TRAIN: [2 3 4 5] TEST: [0 1]
TRAIN: [0 1 4 5] TEST: [2 3]
TRAIN: [0 1 2 3] TEST: [4 5]
맨 위로