Gensim的应用

Gensim的介绍和应用

Posted by YangLong on January 20, 2018

Gensim的应用

Why use Gensim?

  • Scalable statistical semantics 可伸缩的统计语法
  • Analyze plain-text documents for semantic structure 纯文本的分析语法结构
  • Retrieve semantically similar documents 检索相似语义文档

Example

语料处理

from collections import defaultdict
raw_corpus = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]
 
##移除虚词 
stoplist = set('for a of the and to in'.split(' '))
texts = [[word for word in document.lower().split() if word not in stoplist]
          for document in raw_corpus]

 
#移除只出现一次的单词
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

precessed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]

向量化

from gensim import corpora
dictionary = corpora.Dictionary(precessed_corpus)

#指定单词的向量及数量
new_doc = "human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)

## 统计每一个单词
bow_corpus = [dictionary.doc2bow(text) for text in precessed_corpus]
print(bow_corpus)

TF-IF模型

计算每个单词的权重

tfidf = models.TfidfModel(bow_corpus)
string = "system minors"
string_bow = dictionary.doc2bow(string.lower().split())
string_tfidf = tfidf[string_bow]

Word2vec 模型

计算相似度

vec=Word2Vec(precessed_corpus, min_count=1)
sim=vec.most_similar("human")
print(sim)
print("ok")