텍스트 분석

머신 러닝

공개

2025년 7월 30일

overview

NLP vs 텍스트 분석

NLP(자연어 처리)는 컴퓨터가 인간의 언어를 이해하고 처리하는 기술을 의미
텍스트 분석은 주로 비정형 텍스트 데이터를 머신러닝, 통계 등의 방법으로 예측 분석이나 유용한 정보를 추출하는 데 중점을 둔다.

종류

텍스트 분류: 문서가 특정 분류 또는 카테고리에 속하는 것을 예측 (연예 / 정치 / 스포츠 같은 카테고리 분류 혹은 스팸 메일 검출). 지도 학습
감성 분석: 텍스트에서 주관적 요소를 분석하는 기법. 지도 혹은 비지도.
텍스트 요약: 텍스트 내에서 주제나 중심 사상을 추출
텍스트 군집화: 비슷한 유형의 문서를 군집화 하는 것. 비지도 학습

프로세스

텍스트 전처리: 대 / 소문자 변경, 특수 문자 제거, 토큰화, 불용어 제거, 어근 추출 등의 정규화 작업
피처 벡터화 / 추출: 텍스트에서 피처를 추출하고 벡터 값을 할당. BOW와 Word2Vec이 대표적
ML 모델 수립 및 학습 / 예측 / 평가

전처리

클렌징: 문자, 기호 등을 사전에 제거
토큰화
- 문장 토큰화: 마침표, 개행문자 등을 기준으로 문장을 분리. 각 문장이 가지는 의미가 중요한 경우 사용.
- 단어 토큰화: 공백, 콤마, 마침표, 개행문자 등으로 단어를 분리.
  - n-gram: 단어의 연속된 n개를 묶어서 하나의 단위로 처리하는 방법. 문장이 가지는 의미를 조금이라도 보존할 수 있다.

from nltk import sent_tokenize
import nltk
nltk.download('punkt') # 문장을 분리하는 마침표, 개행문자 등의 데이터 셋 다운로드
nltk.download('punkt_tab')

text_sample = "The Matrix is everywhere its all around us, here even in this room. You can see it when you look out your window or when you turn on your television. You can feel it when you go to work, when you go to church, when you pay your taxes."
sentences = sent_tokenize(text_sample)
print(sentences)

['The Matrix is everywhere its all around us, here even in this room.', 'You can see it when you look out your window or when you turn on your television.', 'You can feel it when you go to work, when you go to church, when you pay your taxes.']

[nltk_data] Downloading package punkt to
[nltk_data]     /home/cryscham123/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/cryscham123/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!

from nltk import word_tokenize

sentence = "The Matrix is everywhere its all around us, here even in this room."
words = word_tokenize(sentence)
print(words)

['The', 'Matrix', 'is', 'everywhere', 'its', 'all', 'around', 'us', ',', 'here', 'even', 'in', 'this', 'room', '.']

def tokenize_text(text):
    sentences = sent_tokenize(text)
    words = [word_tokenize(sentence) for sentence in sentences]

    return words

word_tokens = tokenize_text(text_sample)
word_tokens

[['The',
  'Matrix',
  'is',
  'everywhere',
  'its',
  'all',
  'around',
  'us',
  ',',
  'here',
  'even',
  'in',
  'this',
  'room',
  '.'],
 ['You',
  'can',
  'see',
  'it',
  'when',
  'you',
  'look',
  'out',
  'your',
  'window',
  'or',
  'when',
  'you',
  'turn',
  'on',
  'your',
  'television',
  '.'],
 ['You',
  'can',
  'feel',
  'it',
  'when',
  'you',
  'go',
  'to',
  'work',
  ',',
  'when',
  'you',
  'go',
  'to',
  'church',
  ',',
  'when',
  'you',
  'pay',
  'your',
  'taxes',
  '.']]

stopword 제거: 분석에 필요하지 않은 단어를 제거하는 작업. 예) 관사, 전치사, 접속사 등

from nltk.corpus import stopwords
nltk.download('stopwords')  # stopwords 데이터 셋 다운로드

stopwords.words('english')[:20]

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/cryscham123/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been']

sw = stopwords.words('english')
all_tokens = []
for sentence in word_tokens:
    filtered_words = []
    for word in sentence:
        word = word.lower()
        if word not in sw:
            filtered_words.append(word)
    all_tokens.append(filtered_words)
print(all_tokens)

[['matrix', 'everywhere', 'around', 'us', ',', 'even', 'room', '.'], ['see', 'look', 'window', 'turn', 'television', '.'], ['feel', 'go', 'work', ',', 'go', 'church', ',', 'pay', 'taxes', '.']]

stemming, lemmatization: 문법적 또는 의미적으로 변화하는 단어의 원형을 찾는 것
- stemming이 더 단순하고 빠르지만 lemmatization 이 더 저오학함

from nltk.stem import LancasterStemmer

stemmer = LancasterStemmer()

print(stemmer.stem('working'), stemmer.stem('works'), stemmer.stem('worked'))
print(stemmer.stem('amusing'), stemmer.stem('amuses'), stemmer.stem('amused'))
print(stemmer.stem('happier'), stemmer.stem('happiest'))
print(stemmer.stem('fancier'), stemmer.stem('fanciest'))

work work work
amus amus amus
happy happiest
fant fanciest

from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')

lemma = WordNetLemmatizer()

print(lemma.lemmatize('amusing', 'v'), lemma.lemmatize('amuses', 'v'), lemma.lemmatize('amused', 'v'))
print(lemma.lemmatize('happier', 'a'), lemma.lemmatize('happiest', 'a'))
print(lemma.lemmatize('fancier', 'a'), lemma.lemmatize('fanciest', 'a'))

amuse amuse amuse
happy happy
fancy fancy

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/cryscham123/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

BOW

문서가 가지는 모든 단어를 문맥이나 순서를 무시하고 빈도 값을 부여해 피처 값을 추출하는 모델
count 기반 벡터화: 빈도가 높을수록 중요한 단어로 인식
TF-IDF(term frequency - inverse document frequency) 기반 벡터화: 빈도가 높을수록 좋으나, 모든 문서에서 전반적으로 나타나는 단어에 대해서는 패털티를 줌
- \(TF_i * log\frac{N}{DF_i}\)
  - \(TF_i\): 개별 문서에서의 단어 i 빈도
  - \(DF_i\): 단어 i를 가지고 있는 문서 개수
  - N: 전체 문서 개수
희소행렬 문제: 불필요한 0 값이 많아지는 문제
- COO
- CSR
- 혹은 희소행렬을 잘 처리하는 알고리즘: 로지스틱 회귀, 선형 svm, 나이브 베이즈 등

COO

0이 아닌 데이터만 별도의 array에 저장.

import numpy as np
from scipy import sparse

dense = np.array([[3, 0, 1], [0, 2, 0]])
data = np.array([3, 1, 2])
row_pos = np.array([0, 0, 1])
col_pos = np.array([0, 2, 1])
sparse_coo = sparse.coo_matrix((data, (row_pos, col_pos)))
sparse_coo

<COOrdinate sparse matrix of dtype 'int64'
    with 3 stored elements and shape (2, 3)>

sparse_coo.toarray()

array([[3, 0, 1],
       [0, 2, 0]])

CSR

COO + 시작위치만 기록하는 방법

from scipy import sparse

dense2 = np.array([[0, 0, 1, 0, 0, 5],
                   [1, 4, 0, 3, 2, 5],
                   [0, 6, 0, 3, 0, 0],
                   [2, 0, 0, 0, 0, 0],
                   [0, 0, 0, 7, 0, 8],
                   [1, 0, 0, 0, 0, 0]])
data2 = np.array([1, 5, 1, 4, 3, 2, 5, 6, 3, 2, 7, 8, 1])
row_pos = np.array([0, 0, 1, 1, 1, 1, 1, 2, 2, 3, 4, 4, 5])
col_pos = np.array([2, 5, 0, 1, 3, 4, 5, 1, 3, 0, 3, 5, 0])
row_pos_ind = np.array([0, 2, 7, 9, 10, 12, 13])

sparse_csr = sparse.csr_matrix((data2, col_pos, row_pos_ind))
sparse_csr.toarray()

array([[0, 0, 1, 0, 0, 5],
       [1, 4, 0, 3, 2, 5],
       [0, 6, 0, 3, 0, 0],
       [2, 0, 0, 0, 0, 0],
       [0, 0, 0, 7, 0, 8],
       [1, 0, 0, 0, 0, 0]])

sparse_csr = sparse.csr_matrix(dense2)
sparse_csr.toarray()

array([[0, 0, 1, 0, 0, 5],
       [1, 4, 0, 3, 2, 5],
       [0, 6, 0, 3, 0, 0],
       [2, 0, 0, 0, 0, 0],
       [0, 0, 0, 7, 0, 8],
       [1, 0, 0, 0, 0, 0]])

맨 위로