카운트 기반의 문서 표현 (1)

4.1 카운트 기반 문서 표현의 개념

문서의 의미를 반영해 벡터를 만드는 과정
텍스트 마이닝에서는 텍스트의 특성을 정의하고 그 값으로 텍스트를 구분
카운트 기반의 문서표현에서는 텍스트의 특성을 단어로 표현하고, 특성이 갖는 값은 그 단어가 텍스트에서 나타나는 횟수로 표현

텍스트는 정의한 특성에 대한 특성 값의 집합으로 변환
카운트 기반의 문서표현에서 단어가 특성, 단어의 빈도가 특성의 값

4.2 BOW 기반의 카운트 벡터 생성

NLTK가 제공하는 영화 리뷰 예시

import nltk
nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')

"""
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
True
"""

from nltk.corpus import movie_reviews

print('#review count:', len(movie_reviews.fileids())) #영화 리뷰 문서의 id를 반환
print('#samples of file ids:', movie_reviews.fileids()[:10]) #id를 10개까지만 출력
print('#categories of reviews:', movie_reviews.categories()) # label, 즉 긍정인지 부정인지에 대한 분류
print('#num of "neg" reviews:', len(movie_reviews.fileids(categories='neg'))) #label이 부정인 문서들의 id를 반환
print('#num of "pos" reviews:', len(movie_reviews.fileids(categories='pos'))) #label이 긍정인 문서들의 id를 반환
fileid = movie_reviews.fileids()[0] #첫번째 문서의 id를 반환
print('#id of the first review:', fileid)
print('#first review content:\n', movie_reviews.raw(fileid)[:200]) #첫번째 문서의 내용을 200자까지만 출력
print()
print('#sentence tokenization result:', movie_reviews.sents(fileid)[:2]) #첫번째 문서를 sentence tokenize한 결과 중 앞 두 문장
print('#word tokenization result:', movie_reviews.words(fileid)[:20]) #첫번째 문서를 word tokenize한 결과 중 앞 스무 단어

"""
#review count: 2000
#samples of file ids: ['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt', 'neg/cv005_29357.txt', 'neg/cv006_17022.txt', 'neg/cv007_4992.txt', 'neg/cv008_29326.txt', 'neg/cv009_29417.txt']
#categories of reviews: ['neg', 'pos']
#num of "neg" reviews: 1000
#num of "pos" reviews: 1000
#id of the first review: neg/cv000_29416.txt
#first review content:
 plot : two teen couples go to a church party , drink and then drive . 
they get into an accident . 
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . 
w

#sentence tokenization result: [['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.'], ['they', 'get', 'into', 'an', 'accident', '.']]
#word tokenization result: ['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.', 'they', 'get', 'into', 'an']
"""

문서 집합으로부터 특성 벡터 추출 과정
- 1) 각 문서에 대해 텍스트 전처리를 수행해 의미가 있는 최소 단위의 리스트로 변환
- 2) 추출 대상이 되는 단어 집합, 즉 특성 집합을 구성
- 3) 각 문서별로 특성 추출 대상 단어들에 대해 단어의 빈도를 계산해 특성 벡터를 추출

documents = [list(movie_reviews.words(fileid)) for fileid in movie_reviews.fileids()]
print(documents[0][:50]) #첫째 문서의 앞 50개 단어를 출력

## ['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.', 'they', 'get', 'into', 'an', 'accident', '.', 'one', 'of', 'the', 'guys', 'dies', ',', 'but', 'his', 'girlfriend', 'continues', 'to', 'see', 'him', 'in', 'her', 'life', ',', 'and', 'has', 'nightmares', '.', 'what', "'", 's', 'the', 'deal', '?', 'watch']

word_count = {}
for text in documents:
    for word in text:
        word_count[word] = word_count.get(word, 0) + 1

sorted_features = sorted(word_count, key=word_count.get, reverse=True)
for word in sorted_features[:10]:
    print(f"count of '{word}': {word_count[word]}", end=', ')
    
## count of ',': 77717, count of 'the': 76529, count of '.': 65876, count of 'a': 38106, count of 'and': 35576, count of 'of': 34123, count of 'to': 31937, count of ''': 30585, count of 'is': 25195, count of 'in': 21822,

from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords #일반적으로 분석대상이 아닌 단어들

tokenizer = RegexpTokenizer("[\w']{3,}") # 정규포현식으로 토크나이저를 정의
english_stops = set(stopwords.words('english')) #영어 불용어를 가져옴

#words() 대신 raw()를 이용해 원문을 가져옴
documents = [movie_reviews.raw(fileid) for fileid in movie_reviews.fileids()]

# stopwords의 적용과 토큰화를 동시에 수행.
tokens = [[token for token in tokenizer.tokenize(doc) if token not in english_stops] for doc in documents]

word_count = {}
for text in tokens:
    for word in text:
        word_count[word] = word_count.get(word, 0) + 1

sorted_features = sorted(word_count, key=word_count.get, reverse=True)

print('num of features:', len(sorted_features))
for word in sorted_features[:10]:
    print(f"count of '{word}': {word_count[word]}", end=', ')
    
    
"""
num of features: 43030
count of 'film': 8935, count of 'one': 5791, count of 'movie': 5538, count of 'like': 3690, count of 'even': 2564, count of 'time': 2409, count of 'good': 2407, count of 'story': 2136, count of 'would': 2084, count of 'much': 2049, 
"""

word_features = sorted_features[:1000] #빈도가 높은 상위 1000개의 단어만 추출하여 features를 구성

def document_features(document, word_features):
    word_count = {}
    for word in document: #document에 있는 단어들에 대해 빈도수를 먼저 계산
        word_count[word] = word_count.get(word, 0) + 1

    features = []
    for word in word_features: #word_features의 단어에 대해 계산된 빈도수를 feature에 추가
        features.append(word_count.get(word, 0)) #빈도가 없는 단어는 0을 입력
    return features

word_features_ex = ['one', 'two', 'teen', 'couples', 'solo']
doc_ex = ['two', 'two', 'couples']
print(document_features(doc_ex, word_features_ex))

## [0, 2, 0, 1, 0]

feature_sets = [document_features(d, word_features) for d in tokens]

# 첫째 feature set의 내용을 앞 20개만 word_features의 단어와 함께 출력
for i in range(20):
    print(f'({word_features[i]}, {feature_sets[0][i]})', end=', ')
    
    
## (film, 5), (one, 3), (movie, 6), (like, 3), (even, 3), (time, 0), (good, 2), (story, 0), (would, 1), (much, 0), (also, 1), (get, 3), (character, 1), (two, 2), (well, 1), (first, 0), (characters, 1), (see, 2), (way, 3), (make, 5),

print(feature_sets[0][-20:])

## [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

※ 해당 내용은 <파이썬 텍스트 마이닝 완벽 가이드>의 내용을 토대로 학습하며 정리한 내용입니다.

저작자표시 동일조건 (새창열림)

'텍스트 마이닝' 카테고리의 다른 글

카운트 기반의 문서 표현 (3) (0)	2023.06.27
카운트 기반의 문서 표현 (2) (0)	2023.06.26
그래프와 워드 클라우드 (3) (0)	2023.06.24
그래프와 워드 클라우드 (2) (0)	2023.06.23
그래프와 워드 클라우드 (1) (0)	2023.06.22

IT & technology

카운트 기반의 문서 표현 (1)

4.1 카운트 기반 문서 표현의 개념

4.2 BOW 기반의 카운트 벡터 생성

'텍스트 마이닝' 카테고리의 다른 글

티스토리툴바

카운트 기반의 문서 표현 (1)

4.1 카운트 기반 문서 표현의 개념

4.2 BOW 기반의 카운트 벡터 생성

'텍스트 마이닝' 카테고리의 다른 글

'텍스트 마이닝' Related Articles

티스토리툴바