詞嵌入

詞嵌入（英語：Word embedding）是自然語言處理（NLP）中語言模型與表徵學習技術的統稱。概念上而言，它是指把一個維數為所有詞的數量的高維空間嵌入到一個維數低得多的連續向量空間中，每個單詞或詞組被映射為實數域上的向量。

詞嵌入的方法包括人工神經網絡^[1]、對詞語同現矩陣（英語：co-occurrence matrix）降維^[2]^[3]^[4]、機率模型^[5]以及單詞所在上下文的顯式表示等。^[6]

在底層輸入中，使用詞嵌入來表示詞組的方法極大提升了NLP中語法分析器^[7]和文本情感分析等的效果。^[8]

發展歷程

詞嵌入技術起源於2000年。約書亞·本希奧等人在一系列論文中使用了神經機率語言模型（Neural probabilistic language models）使機器「習得詞語的分布式表示（learning a distributed representation for words）」，從而達到將詞語空間降維的目的。^[9]羅維斯（Roweis）與索爾（Saul）在《科學》上發表了用局部線性嵌入（LLE）來學習高維資料結構的低維表示方法^[10]。這個領域開始時穩步發展，在2010年後突飛猛進；一定程度上而言，這是因為這段時間裡向量的品質與模型的訓練速度有極大的突破。

詞嵌入領域的分支繁多，有許多學者致力於其研究。2013年，谷歌一個托馬斯·米科洛維（Tomas Mikolov）領導的團隊發明了一套工具word2vec來進行詞嵌入，訓練向量空間模型的速度比以往的方法都快。^[11]許多新興的詞嵌入基於人工神經網絡，而不是過去的n元語法模型和非監督式學習。^[12]

生物序列中的應用：BioVectors

阿斯加里（Asgari）和莫夫拉德（Mofrad）提出了生物信息學中生物序列（DNA、RNA和蛋白質等）基於n元語法的詞嵌入技術。^[13]bio-vectors（BioVec）表示生物序列的統稱，protein-vectors（ProtVec）表示蛋白質（胺基酸序列），gene-vectors（GeneVec）表示基因序列。BioVec在蛋白質組學與基因組學的深度學習中有廣泛應用。他們提出的結果表明，BioVectors可描述生物化學與生物物理學意義下生物序列的基本模式。^[13]

Thought vectors

將詞嵌入擴展到對句子或整個文本的嵌入後得到的結果稱為Thought vectors。部分研究者期望用Thought vectors來提升機器翻譯的質量。^[14]^[15]

軟體實現

使用詞嵌入技術的訓練軟體包括托馬斯·米科洛維的Word2vec、史丹佛大學的GloVe（英語：GloVe (machine learning)）^[16]和Deeplearning4j。主成分分析（PCA）和t-分布鄰域嵌入算法（英語：t-distributed stochastic neighbor embedding）（t-SNE）也可以用來對詞語空間降維，並實現詞嵌入的可視化與詞義感應（英語：Word-sense induction）。^[17]

參見

參考文獻

^ Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg; Dean, Jeffrey. Distributed Representations of Words and Phrases and their Compositionality. 2013. arXiv:1310.4546  [cs.CL].
^ Lebret, Rémi; Collobert, Ronan. Word Emdeddings through Hellinger PCA. 2013. arXiv:1312.5542  [cs.CL].
^ Levy, Omer; Goldberg, Yoav. Neural Word Embedding as Implicit Matrix Factorization (PDF). NIPS. 2014 [2016-12-28]. （原始內容存檔 (PDF)於2016-11-14）.
^ Li, Yitan; Xu, Linli. Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective (PDF). Int'l J. Conf. on Artificial Intelligence (IJCAI). 2015 [2016-12-28]. （原始內容 (PDF)存檔於2015-09-06）.
^ Globerson, Amir. Euclidean Embedding of Co-occurrence Data (PDF). Journal of Machine learning research. 2007 [2016-12-28]. （原始內容 (PDF)存檔於2016-09-21）.
^ Levy, Omer; Goldberg, Yoav. Linguistic Regularities in Sparse and Explicit Word Representations (PDF). CoNLL: 171–180. 2014 [2016-12-28]. （原始內容 (PDF)存檔於2017-09-25）.
^ Socher, Richard; Bauer, John; Manning, Christopher; Ng, Andrew. Parsing with compositional vector grammars (PDF). Proc. ACL Conf. 2013 [2016-12-28]. （原始內容 (PDF)存檔於2016-08-11）.
^ Socher, Richard; Perelygin, Alex; Wu, Jean; Chuang, Jason; Manning, Chris; Ng, Andrew; Potts, Chris. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank (PDF). EMNLP. 2013 [2016-12-28]. （原始內容存檔 (PDF)於2016-12-28）.
^ A Neural Probabilistic Language Model: 1. [2016-12-28]. doi:10.1007/3-540-33486-6_6. （原始內容存檔於2017-01-17）.
^ Roweis, Sam T.; Saul, Lawrence K. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science. 2000, 290 (5500): 2323 [2016-12-28]. Bibcode:2000Sci...290.2323R. PMID 11125150. doi:10.1126/science.290.5500.2323. （原始內容存檔於2016-12-06）.
^ word2vec. [2016-12-28]. （原始內容存檔於2017-02-11）.
^ A Scalable Hierarchical Distributed Language Model. [2016-12-28]. （原始內容存檔於2016-08-06）.
^ ^13.0 ^13.1 Asgari, Ehsaneddin; Mofrad, Mohammad R.K. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PloS one. 2015, 10 (11): e0141287 [2020-09-19]. Bibcode:2015PLoSO..1041287A. doi:10.1371/journal.pone.0141287. （原始內容存檔於2020-08-15）.
^ Kiros, Ryan; Zhu, Yukun; Salakhutdinov, Ruslan; Zemel, Richard S.; Torralba, Antonio; Urtasun, Raquel; Fidler, Sanja. skip-thought vectors. 2015. arXiv:1506.06726  [cs.CL].
^ thoughtvectors. [2016-12-28]. （原始內容存檔於2017-02-11）.
^ GloVe. [2016-12-28]. （原始內容存檔於2016-12-19）.
^ Ghassemi, Mohammad; Mark, Roger; Nemati, Shamim. A Visualization of Evolving Clinical Sentiment Using Vector Representations of Clinical Notes (PDF). Computing in Cardiology. 2015 [2016-12-28]. （原始內容 (PDF)存檔於2016-05-31）.

[1] Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg; Dean, Jeffrey. Distributed Representations of Words and Phrases and their Compositionality. 2013. arXiv:1310.4546  [cs.CL].

[2] Lebret, Rémi; Collobert, Ronan. Word Emdeddings through Hellinger PCA. 2013. arXiv:1312.5542  [cs.CL].

[3] Levy, Omer; Goldberg, Yoav. Neural Word Embedding as Implicit Matrix Factorization (PDF). NIPS. 2014 [2016-12-28]. （原始內容存檔 (PDF)於2016-11-14）.

[4] Li, Yitan; Xu, Linli. Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective (PDF). Int'l J. Conf. on Artificial Intelligence (IJCAI). 2015 [2016-12-28]. （原始內容 (PDF)存檔於2015-09-06）.

[5] Globerson, Amir. Euclidean Embedding of Co-occurrence Data (PDF). Journal of Machine learning research. 2007 [2016-12-28]. （原始內容 (PDF)存檔於2016-09-21）.

[6] Levy, Omer; Goldberg, Yoav. Linguistic Regularities in Sparse and Explicit Word Representations (PDF). CoNLL: 171–180. 2014 [2016-12-28]. （原始內容 (PDF)存檔於2017-09-25）.

[7] Socher, Richard; Bauer, John; Manning, Christopher; Ng, Andrew. Parsing with compositional vector grammars (PDF). Proc. ACL Conf. 2013 [2016-12-28]. （原始內容 (PDF)存檔於2016-08-11）.

[8] Socher, Richard; Perelygin, Alex; Wu, Jean; Chuang, Jason; Manning, Chris; Ng, Andrew; Potts, Chris. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank (PDF). EMNLP. 2013 [2016-12-28]. （原始內容存檔 (PDF)於2016-12-28）.

[9] A Neural Probabilistic Language Model: 1. [2016-12-28]. doi:10.1007/3-540-33486-6_6. （原始內容存檔於2017-01-17）.

[10] Roweis, Sam T.; Saul, Lawrence K. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science. 2000, 290 (5500): 2323 [2016-12-28]. Bibcode:2000Sci...290.2323R. PMID 11125150. doi:10.1126/science.290.5500.2323. （原始內容存檔於2016-12-06）.

[11] word2vec. [2016-12-28]. （原始內容存檔於2017-02-11）.

[12] A Scalable Hierarchical Distributed Language Model. [2016-12-28]. （原始內容存檔於2016-08-06）.

[:0-13] 13.0 ^13.1 Asgari, Ehsaneddin; Mofrad, Mohammad R.K. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PloS one. 2015, 10 (11): e0141287 [2020-09-19]. Bibcode:2015PLoSO..1041287A. doi:10.1371/journal.pone.0141287. （原始內容存檔於2020-08-15）.

[14] Kiros, Ryan; Zhu, Yukun; Salakhutdinov, Ruslan; Zemel, Richard S.; Torralba, Antonio; Urtasun, Raquel; Fidler, Sanja. skip-thought vectors. 2015. arXiv:1506.06726  [cs.CL].

[15] thoughtvectors. [2016-12-28]. （原始內容存檔於2017-02-11）.

[16] GloVe. [2016-12-28]. （原始內容存檔於2016-12-19）.

[17] Ghassemi, Mohammad; Mark, Roger; Nemati, Shamim. A Visualization of Evolving Clinical Sentiment Using Vector Representations of Clinical Notes (PDF). Computing in Cardiology. 2015 [2016-12-28]. （原始內容 (PDF)存檔於2016-05-31）.

[1]

閱論編自然語言處理
一般術語	語料庫口語語料庫停用詞詞袋完全人工智慧（英語：AI-complete） n元語法（雙字母組、三元語法（英語：Trigrams））
文本挖掘	文本分割詞性標註（英語：Part-of-speech tagging）拆句處理（英語：Shallow parsing）複合詞處理（英語：Compound term processing）搭配提取（英語：Collocation extraction）詞幹提取詞形還原命名實體識別指代文本情感分析概念挖掘（英語：Concept mining）語法分析詞義消歧術語提取（英語：Terminology extraction）真實大小寫處理（英語：Truecasing）
自動摘要（英語：Automatic summarization）	多文檔摘要（英語：Multi-document summarization）句子抽取（英語：Sentence extraction）文本簡化（英語：Text simplification）
分佈語義（英語：Distributional semantics）模型	潛在語義學 Seq2Seq模型 Word2vec 語言模型大型語言模型基礎模型 LLaMA ChatGPT GPT-4 文心一言詞嵌入
機器翻譯	電腦輔助翻譯基於實例（英語：Example-based machine translation）基於規則（英語：Rule-based machine translation）
自動識別與數據採集	語音識別語音合成光學字符識別自然語言生成提示工程
主題模型	彈珠分布（英語：Pachinko allocation）隱含狄利克雷分布潛在語義索引
計算機輔助審查（英語：Computer-assisted reviewing）	自動作文評分（英語：Automated essay scoring）語料庫檢索工具（英語：Concordancer）文法檢查器（英語：Grammar checker）預測文本（英語：Predictive text）拼寫檢查語法猜測（英語：Syntax guessing）
自然語言用戶界面（英語：Natural language user interface）	自動在線助手聊天機器人文字冒險遊戲問答系統