ตัวเข้ารหัสอัตโนมัติ

ตัวเข้ารหัสอัตโนมัติ (autoencoder) เป็นขั้นตอนวิธีสำหรับการลดมิติโดยใช้โครงข่ายประสาทเทียมในการเรียนรู้ของเครื่อง วิธีนี้ถูกเสนอครั้งแรกโดยเจฟฟรีย์ ฮินตันในปี 2006^[1]

ภาพรวม

ตัวเข้ารหัสอัตโนมัติเป็นโครงข่ายประสาทเทียมสามชั้นที่ทำการเรียนรู้แบบไม่มีผู้สอนโดยใช้ข้อมูลเดียวกันสำหรับชั้นป้อนเข้าและชั้นขาออก เมื่อข้อมูลการฝึกเป็นมูลค่าจริงและไม่มีการแบ่งเป็นช่วง ฟังก์ชันกระตุ้นของชั้นขาออกมักจะถูกเลือกเป็นฟังก์ชันเอกลักษณ์ (นั่นคือชั้นขาออกเป็นการแปลงเชิงเส้น) หากเราเลือกใช้ฟังก์ชันเอกลักษณะเป็นฟังก์ชันกระตุ้นของชั้นตรงกลาง ผลลัพธ์จะแทบไม่ต่างจากการวิเคราะห์องค์ประกอบหลัก ในทางปฏิบัติวิธีนี้สามารถใช้เพื่อทำการตรวจหาความผิดปกติโดยพิจารณาความแตกต่างระหว่างค่าข้อมูลป้อนเข้าและข้อมูลขาออก

ลักษณะเด่นและข้อจำกัด

ตัวเข้ารหัสอัตโนมัติได้รับการออกแบบให้มีคุณสมบัติที่จำเป็นสำหรับการลดมิติ

โครงสร้างภายในตัวเข้ารหัสอัตโนมัติถูกออกแบบให้จำนวนขนาดของชั้��ที่ซ่อนอยู่ $d_{m}$ มีขนาดเล็กกว่าจำนวนของชั้นป้อนเข้าและชั้นขาออก $d_{i,o}$ เนื่องจากว่าถ้าหาก $d_{i,o}\leqq d_{m}$ แล้ว ตัวเข้ารหัสอัตโนมัติจะสามารถทำให้ผิดพลาดในการสร้างใหม่เป็นศูนย์ได้โดยใช้เพียงการแปลงเอกลักษณ์เท่านั้น^[2]

ตัวเข้ารหัสอัตโนมัติสามารถทำการลดมิติข้อมุลลง แต่ไม่ได้หมายความว่าจะสามารถใช้เป็นการเรียนรู้ต้วแทนที่ดีเสมอไป^[3] การลดค่า $d_{m}$ ลงจะทำให้คงไว้แค่ค่าลักษณะที่มีปริมาณข้อมูลมากภายในค่าป้อนเข้า เรียกว่าเป็นการบีบอัดคงข้อมูลหลัก

ทฤษฎี

ได้มีการวิเคราะห์ทางทฤษฎีถึงเหตุผลที่การเข้ารหัสอัตโนมัติสามารถเรียนรู้การสร้างใหม่พร้อมทั้งทำการลดมิติได้

โครงข่ายตัวเข้ารหัสอัตโนมัติ $AE_{\phi ,\theta }(x)$ ประกอบขึ้นจากโครงข่ายตัวเข้ารหัส $NN_{\phi }(x)$ และโครงข่ายตัวถอดรหัส $NN_{\theta }(x)$ ในการตีความเชิงกำหนด AE จะให้ข้อมูลที่สร้างขึ้นใหม่จากข้อมูลขาเข้าที่ป้อนเข้าไปโดยตรง นั่นคือ ${\hat {x}}=AE_{\phi ,\theta }(x)=NN_{\theta }(NN_{\phi }(x))$

การตีความเชิงความน่าจะเป็น

ตัวเข้ารหัสอัตโนมัติถือได้ว่าเป็นแบบจำลองตัวแปรแฝงเชิงลึกประเภทหนึ่งจากมุมมองของ แบบจำลองความน่าจะเป็น และสามารถเขียนเป็นสูตรได้ดังต่อไปนี้

{\begin{aligned}z_{|x}\sim p_{\phi }(Z|X)&=p(Z|\lambda =NN_{\phi }(X))=\delta (Z-NN_{\phi }(X))\\{\hat {x}}_{|z}\sim p_{\theta }({\hat {X}}|Z)&=p({\hat {X}}|\mu =NN_{\theta }(Z))\end{aligned}}

นั่นคือสามารถอธิบายได้ว่า $NN_{\phi }(x),NN_{\theta }(x)$ จะให้ค่าพารามิเตอร์การแจกแจง $\lambda ,\mu$ และได้ค่า $z,{\hat {x}}$ โดยการแจกแจง^[4]^[5] เมื่อใช้ $NN_{\phi }(x),NN_{\theta }(x)$ ร่วมกันภายในตัวเข้ารหัสอัตโนมัติสามารถแสดงได้ในรูปนิพจน์ความน่าจะเป็นดังต่อไปนี้:

{\hat {x}}_{|x}\sim p({\hat {X}}|\mu =AE_{\phi ,\theta }(X))

ฟังก์ชันการสูญเสียต่าง ๆ รวมถึงค่าคลาดเคลื่อนกำลังสองเฉลี่ย (MSE, L₂) ถูกนำมาใช้เชิงประจักษ์ (จากมุมมองที่กำหนด) สำหรับการเรียนรู้ของตัวเข้ารหัสอัตโนมัติ ผลที่ได้เป็นเพียงเชิงประจักษ์และไม่อาจรับประกันได้ว่าการเรียนรู้จะสิ้นสุดโดยลู่เข้าเสมอไป

แบบจำลองการแจกแจงแบบปรกติความแปรปรวนคงที่

เมื่อพิจารณาการแจกแจงแบบปกติที่มีความแปรปรวนคงที่ $N(X|\mu _{\theta },\sigma )$ ค่าลบของลอการิทึม ภาวะน่าจะเป็น $L_{n}(\theta )$ จะได้เป็น:

L_{n}(\theta )={\frac {\|x-\mu _{\theta }\|^{2}}{2\sigma ^{2}}}-\log({\sqrt {2\pi \sigma ^{2}}})\propto \|x-\mu _{\theta }\|^{2}

ซึ่งสามารถตีความได้ว่าเป็นค่าคลาดเคลื่อนกำลังสองของ $x$ และ $\mu _{\theta }$ นั่นคือการทำให้ค่าลบของลอการิทึมภาวะน่าจะเป็นของ $N(X|\mu _{\theta }=AE_{\phi ,\theta }(x),\sigma )$ มีค่าต่ำสุด ถือได้ว่าเทียบเท่ากับการทำให้ค่าคลาดเคลื่อนกำลังสองของ ${\hat {x}}=AE_{\phi ,\theta }(x)$ มีค่าต่ำสุด^[6] กล่าวอีกนัยหนึ่งคือ แบบจำลองการเข้ารหัสอัตโนมัติที่ได้รับการฝึกให้เรียนรู้โดยมีค่าคลาดเคลื่อนกำลังสองสามารถมองได้ว่าเป็น แบบจำลองสุ่มตัวอย่างค่าความถี่สูงสุดจากการแจกแจงแบบปรกติความแปรปรวนคงที่ซึ่งถูกประมาณว่าภาวะน่าจะเป็นสูงสุด $N(X|\mu _{\theta }=AE_{\phi ,\theta }(x),\sigma )$

อ้างอิง

↑ Geoffrey E. Hinton; R. R. Salakhutdinov (2006-07-28). "Reducing the Dimensionality of Data with Neural Networks" (PDF). Science. 313 (5786): 504–507.
↑ "autoencoder where Y is of the same dimensionality as X (or larger) can achieve perfect reconstruction simply by learning an identity mapping." Vincent. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.
↑ "The criterion that representation Y should retain information about input X is not by itself sufficient to yield a useful representation." Vincent. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.
↑ "a deterministic mapping from X to Y, that is, ... equivalently $q(Y|X;\theta )=\delta (Y-f_{\theta }(X))$ ... The deterministic mapping $f_{\theta }$ that transforms an input vector ${\boldsymbol {x}}$ into hidden representation ${\boldsymbol {y}}$ is called the encoder." Vincent. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.
↑ " ${\boldsymbol {z}}=g_{\theta ^{'}}({\boldsymbol {y}})$ . This mapping $g_{\theta ^{'}}$ is called the decoder. ... In general ${\boldsymbol {z}}$ is not to be interpreted as an exact reconstruction of ${\boldsymbol {x}}$ , but rather in probabilistic terms as the parameters (typically the mean) of a distribution $p(X|Z={\boldsymbol {z}})$ " Vincent. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.
↑ " $g_{\theta ^{'}}$ is called the decoder ... $Z=g_{\theta ^{'}}({\boldsymbol {y}})$ ... associated loss function $L({\boldsymbol {x}},{\boldsymbol {z}})$ ... $X|{\boldsymbol {z}}\sim N({\boldsymbol {z}},{\boldsymbol {\sigma }}^{2}{\boldsymbol {I}})$ ... This yields $L({\boldsymbol {x}},{\boldsymbol {z}})=L_{2}({\boldsymbol {x}},{\boldsymbol {z}})=C(\sigma ^{2})\|{\boldsymbol {x}}-{\boldsymbol {z}}\|^{2}$ ... This is the squared error objective found in most traditional autoencoders." Vincent. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.

[hinton2006-1] Geoffrey E. Hinton; R. R. Salakhutdinov (2006-07-28). "Reducing the Dimensionality of Data with Neural Networks" (PDF). Science. 313 (5786): 504–507.

[2] "autoencoder where Y is of the same dimensionality as X (or larger) can achieve perfect reconstruction simply by learning an identity mapping." Vincent. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.

[3] "The criterion that representation Y should retain information about input X is not by itself sufficient to yield a useful representation." Vincent. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.

[4] "a deterministic mapping from X to Y, that is, ... equivalently $q(Y|X;\theta )=\delta (Y-f_{\theta }(X))$ ... The deterministic mapping $f_{\theta }$ that transforms an input vector ${\boldsymbol {x}}$ into hidden representation ${\boldsymbol {y}}$ is called the encoder." Vincent. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.

[5] " ${\boldsymbol {z}}=g_{\theta ^{'}}({\boldsymbol {y}})$ . This mapping $g_{\theta ^{'}}$ is called the decoder. ... In general ${\boldsymbol {z}}$ is not to be interpreted as an exact reconstruction of ${\boldsymbol {x}}$ , but rather in probabilistic terms as the parameters (typically the mean) of a distribution $p(X|Z={\boldsymbol {z}})$ " Vincent. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.

[6] " $g_{\theta ^{'}}$ is called the decoder ... $Z=g_{\theta ^{'}}({\boldsymbol {y}})$ ... associated loss function $L({\boldsymbol {x}},{\boldsymbol {z}})$ ... $X|{\boldsymbol {z}}\sim N({\boldsymbol {z}},{\boldsymbol {\sigma }}^{2}{\boldsymbol {I}})$ ... This yields $L({\boldsymbol {x}},{\boldsymbol {z}})=L_{2}({\boldsymbol {x}},{\boldsymbol {z}})=C(\sigma ^{2})\|{\boldsymbol {x}}-{\boldsymbol {z}}\|^{2}$ ... This is the squared error objective found in most traditional autoencoders." Vincent. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.

[1]