Gensim的应用
Why use Gensim?
- Scalable statistical semantics 可伸缩的统计语法
- Analyze plain-text documents for semantic structure 纯文本的分析语法结构
- Retrieve semantically similar documents 检索相似语义文档
Example
语料处理
from collections import defaultdict
raw_corpus = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
##移除虚词
stoplist = set('for a of the and to in'.split(' '))
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in raw_corpus]
#移除只出现一次的单词
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
precessed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]
向量化
from gensim import corpora
dictionary = corpora.Dictionary(precessed_corpus)
#指定单词的向量及数量
new_doc = "human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)
## 统计每一个单词
bow_corpus = [dictionary.doc2bow(text) for text in precessed_corpus]
print(bow_corpus)
TF-IF模型
计算每个单词的权重
tfidf = models.TfidfModel(bow_corpus)
string = "system minors"
string_bow = dictionary.doc2bow(string.lower().split())
string_tfidf = tfidf[string_bow]
Word2vec 模型
计算相似度
vec=Word2Vec(precessed_corpus, min_count=1)
sim=vec.most_similar("human")
print(sim)
print("ok")