词嵌入

词嵌入（英語：Word embedding）是自然语言处理（NLP）中语言模型与表征学习技术的统称。概念上而言，它是指把一个维数为所有词的数量的高维空间嵌入到一个维数低得多的连续向量空间中，每个单词或词组被映射为实数域上的向量。

词嵌入的方法包括人工神经网络^[1]、对词语同现矩阵（英语：co-occurrence matrix）降维^[2]^[3]^[4]、機率模型^[5]以及单词所在上下文的显式表示等。^[6]

在底层输入中，使用词嵌入来表示词组的方法极大提升了NLP中语法分析器^[7]和文本情感分析等的效果。^[8]

发展历程

词嵌入技术起源于2000年。约书亚·本希奥等人在一系列论文中使用了神经機率语言模型（Neural probabilistic language models）使机器“习得词语的分布式表示（learning a distributed representation for words）”，从而达到将词语空间降维的目的。^[9]罗维斯（Roweis）与索尔（Saul）在《科学》上发表了用局部线性嵌入（LLE）来学习高维数据结构的低维表示方法^[10]。这个领域开始时稳步发展，在2010年后突飞猛进；一定程度上而言，这是因为这段时间里向量的品質与模型的训练速度有极大的突破。

词嵌入领域的分支繁多，有许多学者致力于其研究。2013年，谷歌一个托马斯·米科洛维（Tomas Mikolov）领导的团队发明了一套工具word2vec来进行词嵌入，训练向量空间模型的速度比以往的方法都快。^[11]许多新兴的词嵌入基于人工神经网络，而不是过去的n元语法模型和非监督式学习。^[12]

生物序列中的应用：BioVectors

阿斯加里（Asgari）和莫夫拉德（Mofrad）提出了生物信息学中生物序列（DNA、RNA和蛋白质等）基于n元语法的词嵌入技术。^[13]bio-vectors（BioVec）表示生物序列的统称，protein-vectors（ProtVec）表示蛋白质（氨基酸序列），gene-vectors（GeneVec）表示基因序列。BioVec在蛋白质组学与基因组学的深度学习中有广泛应用。他们提出的结果表明，BioVectors可描述生物化学与生物物理学意义下生物序列的基本模式。^[13]

Thought vectors

将词嵌入扩展到对句子或整个文本的嵌入后得到的结果称为Thought vectors。部分研究者期望用Thought vectors来提升机器翻译的质量。^[14]^[15]

软件实现

使用词嵌入技术的训练软件包括托马斯·米科洛维的Word2vec、斯坦福大学的GloVe（英语：GloVe (machine learning)）^[16]和Deeplearning4j。主成分分析（PCA）和t-分布邻域嵌入算法（英语：t-distributed stochastic neighbor embedding）（t-SNE）也可以用来对词语空间降维，并实现词嵌入的可视化与词义感应（英语：Word-sense induction）。^[17]

参见

参考文献

^ Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg; Dean, Jeffrey. Distributed Representations of Words and Phrases and their Compositionality. 2013. arXiv:1310.4546  [cs.CL].
^ Lebret, Rémi; Collobert, Ronan. Word Emdeddings through Hellinger PCA. 2013. arXiv:1312.5542  [cs.CL].
^ Levy, Omer; Goldberg, Yoav. Neural Word Embedding as Implicit Matrix Factorization (PDF). NIPS. 2014 [2016-12-28]. （原始内容存档 (PDF)于2016-11-14）.
^ Li, Yitan; Xu, Linli. Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective (PDF). Int'l J. Conf. on Artificial Intelligence (IJCAI). 2015 [2016-12-28]. （原始内容 (PDF)存档于2015-09-06）.
^ Globerson, Amir. Euclidean Embedding of Co-occurrence Data (PDF). Journal of Machine learning research. 2007 [2016-12-28]. （原始内容 (PDF)存档于2016-09-21）.
^ Levy, Omer; Goldberg, Yoav. Linguistic Regularities in Sparse and Explicit Word Representations (PDF). CoNLL: 171–180. 2014 [2016-12-28]. （原始内容 (PDF)存档于2017-09-25）.
^ Socher, Richard; Bauer, John; Manning, Christopher; Ng, Andrew. Parsing with compositional vector grammars (PDF). Proc. ACL Conf. 2013 [2016-12-28]. （原始内容 (PDF)存档于2016-08-11）.
^ Socher, Richard; Perelygin, Alex; Wu, Jean; Chuang, Jason; Manning, Chris; Ng, Andrew; Potts, Chris. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank (PDF). EMNLP. 2013 [2016-12-28]. （原始内容存档 (PDF)于2016-12-28）.
^ A Neural Probabilistic Language Model: 1. [2016-12-28]. doi:10.1007/3-540-33486-6_6. （原始内容存档于2017-01-17）.
^ Roweis, Sam T.; Saul, Lawrence K. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science. 2000, 290 (5500): 2323 [2016-12-28]. Bibcode:2000Sci...290.2323R. PMID 11125150. doi:10.1126/science.290.5500.2323. （原始内容存档于2016-12-06）.
^ word2vec. [2016-12-28]. （原始内容存档于2017-02-11）.
^ A Scalable Hierarchical Distributed Language Model. [2016-12-28]. （原始内容存档于2016-08-06）.
^ ^13.0 ^13.1 Asgari, Ehsaneddin; Mofrad, Mohammad R.K. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PloS one. 2015, 10 (11): e0141287 [2020-09-19]. Bibcode:2015PLoSO..1041287A. doi:10.1371/journal.pone.0141287. （原始内容存档于2020-08-15）.
^ Kiros, Ryan; Zhu, Yukun; Salakhutdinov, Ruslan; Zemel, Richard S.; Torralba, Antonio; Urtasun, Raquel; Fidler, Sanja. skip-thought vectors. 2015. arXiv:1506.06726  [cs.CL].
^ thoughtvectors. [2016-12-28]. （原始内容存档于2017-02-11）.
^ GloVe. [2016-12-28]. （原始内容存档于2016-12-19）.
^ Ghassemi, Mohammad; Mark, Roger; Nemati, Shamim. A Visualization of Evolving Clinical Sentiment Using Vector Representations of Clinical Notes (PDF). Computing in Cardiology. 2015 [2016-12-28]. （原始内容 (PDF)存档于2016-05-31）.

[1] Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg; Dean, Jeffrey. Distributed Representations of Words and Phrases and their Compositionality. 2013. arXiv:1310.4546  [cs.CL].

[2] Lebret, Rémi; Collobert, Ronan. Word Emdeddings through Hellinger PCA. 2013. arXiv:1312.5542  [cs.CL].

[3] Levy, Omer; Goldberg, Yoav. Neural Word Embedding as Implicit Matrix Factorization (PDF). NIPS. 2014 [2016-12-28]. （原始内容存档 (PDF)于2016-11-14）.

[4] Li, Yitan; Xu, Linli. Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective (PDF). Int'l J. Conf. on Artificial Intelligence (IJCAI). 2015 [2016-12-28]. （原始内容 (PDF)存档于2015-09-06）.

[5] Globerson, Amir. Euclidean Embedding of Co-occurrence Data (PDF). Journal of Machine learning research. 2007 [2016-12-28]. （原始内容 (PDF)存档于2016-09-21）.

[6] Levy, Omer; Goldberg, Yoav. Linguistic Regularities in Sparse and Explicit Word Representations (PDF). CoNLL: 171–180. 2014 [2016-12-28]. （原始内容 (PDF)存档于2017-09-25）.

[7] Socher, Richard; Bauer, John; Manning, Christopher; Ng, Andrew. Parsing with compositional vector grammars (PDF). Proc. ACL Conf. 2013 [2016-12-28]. （原始内容 (PDF)存档于2016-08-11）.

[8] Socher, Richard; Perelygin, Alex; Wu, Jean; Chuang, Jason; Manning, Chris; Ng, Andrew; Potts, Chris. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank (PDF). EMNLP. 2013 [2016-12-28]. （原始内容存档 (PDF)于2016-12-28）.

[9] A Neural Probabilistic Language Model: 1. [2016-12-28]. doi:10.1007/3-540-33486-6_6. （原始内容存档于2017-01-17）.

[10] Roweis, Sam T.; Saul, Lawrence K. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science. 2000, 290 (5500): 2323 [2016-12-28]. Bibcode:2000Sci...290.2323R. PMID 11125150. doi:10.1126/science.290.5500.2323. （原始内容存档于2016-12-06）.

[11] word2vec. [2016-12-28]. （原始内容存档于2017-02-11）.

[12] A Scalable Hierarchical Distributed Language Model. [2016-12-28]. （原始内容存档于2016-08-06）.

[:0-13] 13.0 ^13.1 Asgari, Ehsaneddin; Mofrad, Mohammad R.K. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PloS one. 2015, 10 (11): e0141287 [2020-09-19]. Bibcode:2015PLoSO..1041287A. doi:10.1371/journal.pone.0141287. （原始内容存档于2020-08-15）.

[14] Kiros, Ryan; Zhu, Yukun; Salakhutdinov, Ruslan; Zemel, Richard S.; Torralba, Antonio; Urtasun, Raquel; Fidler, Sanja. skip-thought vectors. 2015. arXiv:1506.06726  [cs.CL].

[15] thoughtvectors. [2016-12-28]. （原始内容存档于2017-02-11）.

[16] GloVe. [2016-12-28]. （原始内容存档于2016-12-19）.

[17] Ghassemi, Mohammad; Mark, Roger; Nemati, Shamim. A Visualization of Evolving Clinical Sentiment Using Vector Representations of Clinical Notes (PDF). Computing in Cardiology. 2015 [2016-12-28]. （原始内容 (PDF)存档于2016-05-31）.

[1]

查论编自然语言处理
一般术语	语料库口语语料库停用词词袋完全人工智慧（英语：AI-complete） n元语法（双字母组、三元语法（英语：Trigrams））
文本挖掘	文本分割词性标注（英语：Part-of-speech tagging）拆句处理（英语：Shallow parsing）复合词处理（英语：Compound term processing）搭配提取（英语：Collocation extraction）词干提取词形还原命名实体识别指代文本情感分析概念挖掘（英语：Concept mining）语法分析词义消歧术语提取（英语：Terminology extraction）真实大小写处理（英语：Truecasing）
自动摘要（英语：Automatic summarization）	多文档摘要（英语：Multi-document summarization）句子抽取（英语：Sentence extraction）文本简化（英语：Text simplification）
分佈語義（英语：Distributional semantics）模型	潜在语义学 Seq2Seq模型 Word2vec 語言模型大型语言模型基础模型 LLaMA ChatGPT GPT-4 文心一言词嵌入
机器翻译	電腦輔助翻譯基于实例（英语：Example-based machine translation）基于规则（英语：Rule-based machine translation）
自动识别与数据采集	语音识别语音合成光学字符识别自然语言生成提示工程
主题模型	弹珠分布（英语：Pachinko allocation）隐含狄利克雷分布潜在语义索引
计算机辅助审查（英语：Computer-assisted reviewing）	自动作文评分（英语：Automated essay scoring）语料库检索工具（英语：Concordancer）文法检查器（英语：Grammar checker）预测文本（英语：Predictive text）拼寫檢查语法猜测（英语：Syntax guessing）
自然语言用户界面（英语：Natural language user interface）	自动在线助手聊天機器人文字冒险游戏問答系統