Revision as of 16:04, 15 July 2024 editFyrael (talk | contribs)Extended confirmed users, New page reviewers39,398 edits →Regularized autoencoders: remove abbreviation from section title← Previous edit | Latest revision as of 19:38, 30 December 2024 edit undoLfstevens (talk | contribs)Extended confirmed users68,776 edits Filled in 0 bare reference(s) with reFill 2 | ||
(36 intermediate revisions by 14 users not shown) | |||
Line 2: | Line 2: | ||
{{Distinguish|Autocoder|Autocode}} | {{Distinguish|Autocoder|Autocode}} | ||
{{Use dmy dates|date=March 2020|cs1-dates=y}} | {{Use dmy dates|date=March 2020|cs1-dates=y}} | ||
] | |||
{{Machine learning|Artificial neural network}} | {{Machine learning|Artificial neural network}} | ||
An '''autoencoder''' is a type of ] used to learn ] of unlabeled data (]).<ref name=":12">{{cite journal|doi=10.1002/aic.690370209|title=Nonlinear principal component analysis using autoassociative neural networks|journal=AIChE Journal|volume=37|issue=2|pages=233–243|date=1991|last1=Kramer|first1=Mark A.|bibcode=1991AIChE..37..233K |url= https://www.researchgate.net/profile/Abir_Alobaid/post/To_learn_a_probability_density_function_by_using_neural_network_can_we_first_estimate_density_using_nonparametric_methods_then_train_the_network/attachment/59d6450279197b80779a031e/AS:451263696510979@1484601057779/download/NL+PCA+by+using+ANN.pdf}}</ref><ref name=":13">{{Cite journal |last=Kramer |first=M. A. |date=1992-04-01 |title=Autoassociative neural networks |url=https://dx.doi.org/10.1016/0098-1354%2892%2980051-A |journal=Computers & Chemical Engineering |series=Neutral network applications in chemical engineering |language=en |volume=16 |issue=4 |pages=313–328 |doi=10.1016/0098-1354(92)80051-A |issn=0098-1354}}</ref> An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an ] (encoding) for a set of data, typically for ]. | |||
An '''autoencoder''' is a type of ] used to learn ] of unlabeled data (]). An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for ], to generate lower-dimensional embeddings for subsequent use by other ] algorithms.<ref>{{Cite book|last1=Bank |first1=Dor |last2=Koenigstein |first2=Noam |last3=Giryes |first3=Raja |year=2023 |chapter=Autoencoders |editor-last1=Rokach |editor-first1=Lior |editor-last2=Maimon |editor-first2=Oded |editor-last3=Shmueli |editor-first3=Erez |title=Machine learning for data science handbook |chapter-url=https://link.springer.com/chapter/10.1007/978-3-031-24628-9_16 |language=en |pages=353–374 |doi=10.1007/978-3-031-24628-9_16|isbn=978-3-031-24627-2 }}</ref> | |||
⚫ | Variants exist |
||
⚫ | Variants exist which aim to make the learned representations assume useful properties.<ref name=":0" /> Examples are regularized autoencoders (''sparse'', ''denoising'' and ''contractive'' autoencoders), which are effective in learning representations for subsequent ] tasks,<ref name=":4" /> and ], which can be used as ]s.<ref name=":11">{{cite journal |arxiv=1906.02691|doi=10.1561/2200000056|bibcode=2019arXiv190602691K|title=An Introduction to Variational Autoencoders|date=2019|last1=Welling|first1=Max|last2=Kingma|first2=Diederik P.|journal=Foundations and Trends in Machine Learning|volume=12|issue=4|pages=307–392|s2cid=174802445}}</ref> Autoencoders are applied to many problems, including ],<ref>Hinton GE, Krizhevsky A, Wang SD. In International Conference on Artificial Neural Networks 2011 Jun 14 (pp. 44-51). Springer, Berlin, Heidelberg.</ref> ],<ref name=":2">{{Cite book|last=Géron|first=Aurélien|title=Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow|publisher=O’Reilly Media, Inc.|year=2019|location=Canada|pages=739–740}}</ref> ], and ].<ref>{{cite journal|doi=10.1016/j.neucom.2008.04.030|title=Modeling word perception using the Elman network|journal=Neurocomputing|volume=71|issue=16–18|pages=3150|date=2008|last1=Liou|first1=Cheng-Yuan|last2=Huang|first2=Jau-Chi|last3=Yang|first3=Wen-Chie|url=http://ntur.lib.ntu.edu.tw//handle/246246/155195 }}</ref><ref>{{cite journal|doi=10.1016/j.neucom.2013.09.055|title=Autoencoder for words|journal=Neurocomputing|volume=139|pages=84–96|date=2014|last1=Liou|first1=Cheng-Yuan|last2=Cheng|first2=Wei-Chen|last3=Liou|first3=Jiun-Wei|last4=Liou|first4=Daw-Ran}}</ref> In terms of ], autoencoders can also be used to randomly generate new data that is similar to the input (training) data.<ref name=":2" /> | ||
{{Toclimit|3}} | {{Toclimit|3}} | ||
Line 12: | Line 14: | ||
=== Definition === | === Definition === | ||
An autoencoder is defined by the following components: <blockquote>Two sets: the space of decoded messages <math>\mathcal X</math>; the space of encoded messages <math>\mathcal Z</math>. |
An autoencoder is defined by the following components: <blockquote>Two sets: the space of decoded messages <math>\mathcal X</math>; the space of encoded messages <math>\mathcal Z</math>. Typically <math>\mathcal X</math> and <math>\mathcal Z</math> are ]s, that is, <math>\mathcal X = \R^m, \mathcal Z = \R^n</math> with <math>m > n.</math> </blockquote><blockquote>Two ] families of functions: the encoder family <math>E_\phi:\mathcal{X} \rightarrow \mathcal{Z}</math>, parametrized by <math>\phi</math>; the decoder family <math>D_\theta:\mathcal{Z} \rightarrow \mathcal{X}</math>, parametrized by <math>\theta</math>.</blockquote>For any <math>x\in \mathcal X</math>, we usually write <math>z = E_\phi(x)</math>, and refer to it as the code, the ], latent representation, latent vector, etc. Conversely, for any <math>z\in \mathcal Z</math>, we usually write <math>x' = D_\theta(z)</math>, and refer to it as the (decoded) message. | ||
Usually, both the encoder and the decoder are defined as ]s. For example, a one-layer-MLP encoder <math>E_\phi</math> is: | Usually, both the encoder and the decoder are defined as ]s (MLPs). For example, a one-layer-MLP encoder <math>E_\phi</math> is: | ||
:<math>E_\phi(\mathbf x) = \sigma(Wx+b)</math> | :<math>E_\phi(\mathbf x) = \sigma(Wx+b)</math> | ||
where <math>\sigma</math> is an element-wise ] such as a ] or a ], <math>W</math> is a |
where <math>\sigma</math> is an element-wise ], <math>W</math> is a "weight" matrix, and <math>b</math> is a "bias" vector. | ||
=== Training an autoencoder === | === Training an autoencoder === | ||
Line 27: | Line 29: | ||
In most situations, the reference distribution is just the ] given by a dataset <math>\{x_1, ..., x_N\} \subset \mathcal X</math>, so that<math display="block">\mu_{ref} = \frac{1}{N}\sum_{i=1}^N \delta_{x_i}</math> | In most situations, the reference distribution is just the ] given by a dataset <math>\{x_1, ..., x_N\} \subset \mathcal X</math>, so that<math display="block">\mu_{ref} = \frac{1}{N}\sum_{i=1}^N \delta_{x_i}</math> | ||
where <math>\delta_{x_i}</math> is the ], the quality function is just L2 loss: <math>d(x, x') = \|x - x'\|_2^2</math>, and <math>\|\cdot\|_2</math> is the Euclidean norm. Then the problem of searching for the optimal autoencoder is just a ] optimization:<math display="block">\min_{\theta, \phi} L(\theta, \phi), \text{where } L(\theta, \phi) = \frac{1}{N}\sum_{i=1}^N \|x_i - D_\theta(E_\phi(x_i))\|_2^2</math> | where <math>\delta_{x_i}</math> is the ], the quality function is just L2 loss: <math>d(x, x') = \|x - x'\|_2^2</math>, and <math>\|\cdot\|_2</math> is the ]. Then the problem of searching for the optimal autoencoder is just a ] optimization:<math display="block">\min_{\theta, \phi} L(\theta, \phi),\qquad \text{where } L(\theta, \phi) = \frac{1}{N}\sum_{i=1}^N \|x_i - D_\theta(E_\phi(x_i))\|_2^2</math> | ||
=== Interpretation === | === Interpretation === | ||
⚫ | ] | ||
An autoencoder has two main parts: an encoder that maps the message to a code, and a decoder that reconstructs the message from the code. An optimal autoencoder would perform as close to perfect reconstruction as possible, with "close to perfect" defined by the reconstruction quality function <math>d</math>. | An autoencoder has two main parts: an encoder that maps the message to a code, and a decoder that reconstructs the message from the code. An optimal autoencoder would perform as close to perfect reconstruction as possible, with "close to perfect" defined by the reconstruction quality function <math>d</math>. | ||
The simplest way to perform the copying task perfectly would be to duplicate the signal. To suppress this behavior, the code space <math>\mathcal Z</math> usually has fewer dimensions than the message space <math>\mathcal{X}</math>. | The simplest way to perform the copying task perfectly would be to duplicate the signal. To suppress this behavior, the code space <math>\mathcal Z</math> usually has fewer dimensions than the message space <math>\mathcal{X}</math>. | ||
Such an autoencoder is called ''undercomplete''. It can be interpreted as ] the message, or ].<ref name=":12" /><ref name=":7" /> | Such an autoencoder is called ''undercomplete''. It can be interpreted as ] the message, or ].<ref name=":12">{{cite journal |last1=Kramer |first1=Mark A. |date=1991 |title=Nonlinear principal component analysis using autoassociative neural networks |url=https://www.researchgate.net/profile/Abir_Alobaid/post/To_learn_a_probability_density_function_by_using_neural_network_can_we_first_estimate_density_using_nonparametric_methods_then_train_the_network/attachment/59d6450279197b80779a031e/AS:451263696510979@1484601057779/download/NL+PCA+by+using+ANN.pdf |journal=AIChE Journal |volume=37 |issue=2 |pages=233–243 |bibcode=1991AIChE..37..233K |doi=10.1002/aic.690370209}}</ref><ref name=":7" /> | ||
At the limit of an ideal undercomplete autoencoder, every possible code <math>z</math> in the code space is used to encode a message <math>x</math> that really appears in the distribution <math>\mu_{ref}</math>, and the decoder is also perfect: <math>D_\theta(E_\phi(x)) = x</math>. This ideal autoencoder can then be used to generate messages indistinguishable from real messages, by feeding its decoder arbitrary code <math>z</math> and obtaining <math>D_\theta(z)</math>, which is a message that really appears in the distribution <math>\mu_{ref}</math>. | At the limit of an ideal undercomplete autoencoder, every possible code <math>z</math> in the code space is used to encode a message <math>x</math> that really appears in the distribution <math>\mu_{ref}</math>, and the decoder is also perfect: <math>D_\theta(E_\phi(x)) = x</math>. This ideal autoencoder can then be used to generate messages indistinguishable from real messages, by feeding its decoder arbitrary code <math>z</math> and obtaining <math>D_\theta(z)</math>, which is a message that really appears in the distribution <math>\mu_{ref}</math>. | ||
Line 42: | Line 43: | ||
In the ideal setting, the code dimension and the model capacity could be set on the basis of the complexity of the data distribution to be modeled. A standard way to do so is to add modifications to the basic autoencoder, to be detailed below.<ref name=":0" /> | In the ideal setting, the code dimension and the model capacity could be set on the basis of the complexity of the data distribution to be modeled. A standard way to do so is to add modifications to the basic autoencoder, to be detailed below.<ref name=":0" /> | ||
⚫ | ==Variations== | ||
⚫ | ===Variational autoencoder (VAE)=== | ||
⚫ | == History == | ||
] | |||
The autoencoder was first proposed as a nonlinear generalization of ] (PCA) by Kramer.<ref name=":12" /> The autoencoder has also been called the autoassociator,<ref>{{Cite journal |last1=Japkowicz |first1=Nathalie |author-link=Nathalie Japkowicz |last2=Hanson |first2=Stephen José |author-link2=Stephen José Hanson |last3=Gluck |first3=Mark A. |date=2000-03-01 |title=Nonlinear Autoassociation Is Not Equivalent to PCA |journal=Neural Computation |volume=12 |issue=3 |pages=531–545 |doi=10.1162/089976600300015691 |issn=0899-7667 |pmid=10769321 |s2cid=18490972}}</ref> or Diabolo network.<ref>{{Cite journal |last1=Schwenk |first1=Holger |last2=Bengio |first2=Yoshua |date=1997 |title=Training Methods for Adaptive Boosting of Neural Networks |url=https://proceedings.neurips.cc/paper/1997/hash/9cb67ffb59554ab1dabb65bcb370ddd9-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=MIT Press |volume=10}}</ref><ref name="bengio" /> Its first applications date to early 1990s.<ref name=":0" /><ref>{{Cite journal |last=Schmidhuber |first=Jürgen |date=January 2015 |title=Deep learning in neural networks: An overview |journal=Neural Networks |volume=61 |pages=85–117 |arxiv=1404.7828 |doi=10.1016/j.neunet.2014.09.003 |pmid=25462637 |s2cid=11715509}}</ref><ref>Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length and Helmholtz free energy. In ''Advances in neural information processing systems 6'' (pp. 3-10).</ref> Their most traditional application was ] or ], but the concept became widely used for learning ]s of data.<ref name="VAE">{{cite arXiv |eprint=1312.6114 |class=stat.ML |author1=Diederik P Kingma |first2=Max |last2=Welling |title=Auto-Encoding Variational Bayes |date=2013}}</ref><ref name="gan_faces">Generating Faces with Torch, Boesen A., Larsen L. and Sonderby S.K., 2015 {{url|http://torch.ch/blog/2015/11/13/gan.html}}</ref> Some of the most powerful ] in the 2010s involved autoencoders stacked inside ] neural networks.<ref name="domingos">{{cite book |last1=Domingos |first1=Pedro |title=The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World |title-link=The Master Algorithm |date=2015 |publisher=Basic Books |isbn=978-046506192-1 |at="Deeper into the Brain" subsection |chapter=4 |author-link=Pedro Domingos}}</ref> | |||
{{Main|Variational autoencoder}} | |||
⚫ | ]s (VAEs) belong to the families of ]. Despite the architectural similarities with basic autoencoders, VAEs are architected with different goals and have a different mathematical formulation. The latent space is, in this case, composed of a mixture of distributions instead of fixed vectors. | ||
⚫ | ==Variations== | ||
⚫ | Given an input dataset <math>x</math> characterized by an unknown probability function <math>P(x)</math> and a multivariate latent encoding vector <math>z</math>, the objective is to model the data as a distribution <math>p_\theta(x)</math>, with <math>\theta</math> defined as the set of the network parameters so that <math>p_\theta(x) = \int_{z}p_\theta(x,z)dz </math>. | ||
=== Regularized autoencoders === | |||
Various techniques exist to prevent autoencoders from learning the ] and to improve their ability to capture important information and learn richer representations. | |||
===Sparse autoencoder (SAE)=== | |||
Inspired by the ] hypothesis in neuroscience, sparse autoencoders (SAE) are variants of autoencoders, such that the codes <math>E_\phi(x)</math> for messages tend to be ''sparse codes'', that is, <math>E_\phi(x)</math> is close to zero in most entries. Sparse autoencoders may include more (rather than fewer) hidden units than inputs, but only a small number of the hidden units are allowed to be active at the same time.<ref name="domingos" |
Inspired by the ] hypothesis in neuroscience, ''sparse autoencoders'' (SAE) are variants of autoencoders, such that the codes <math>E_\phi(x)</math> for messages tend to be ''sparse codes'', that is, <math>E_\phi(x)</math> is close to zero in most entries. Sparse autoencoders may include more (rather than fewer) hidden units than inputs, but only a small number of the hidden units are allowed to be active at the same time.<ref name="domingos">{{cite book |last1=Domingos |first1=Pedro |author-link=Pedro Domingos |title=The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World |title-link=The Master Algorithm |date=2015 |publisher=Basic Books |isbn=978-046506192-1 |at="Deeper into the Brain" subsection |chapter=4}}</ref> Encouraging sparsity improves performance on classification tasks.<ref name=":1" /> ] | ||
There are two main ways to enforce sparsity. One way is to simply clamp all but the highest-k activations of the latent code to zero. This is the '''k-sparse autoencoder'''.<ref name=":1">{{cite arXiv |eprint=1312.5663 |class=cs.LG |first1=Alireza |last1=Makhzani |first2=Brendan |last2=Frey |title=K-Sparse Autoencoders |date=2013}}</ref> | There are two main ways to enforce sparsity. One way is to simply clamp all but the highest-k activations of the latent code to zero. This is the '''k-sparse autoencoder'''.<ref name=":1">{{cite arXiv |eprint=1312.5663 |class=cs.LG |first1=Alireza |last1=Makhzani |first2=Brendan |last2=Frey |title=K-Sparse Autoencoders |date=2013}}</ref> | ||
Line 59: | Line 61: | ||
Backpropagating through <math>f_k</math> is simple: set gradient to 0 for <math>b_i = 0</math> entries, and keep gradient for <math>b_i=1</math> entries. This is essentially a generalized ] function.<ref name=":1" /> | Backpropagating through <math>f_k</math> is simple: set gradient to 0 for <math>b_i = 0</math> entries, and keep gradient for <math>b_i=1</math> entries. This is essentially a generalized ] function.<ref name=":1" /> | ||
The other way is a ] of the k-sparse autoencoder. Instead of forcing sparsity, we add a '''sparsity regularization loss''', then optimize for<math display="block">\min_{\theta, \phi}L(\theta, \phi) + \lambda L_{ |
The other way is a ] of the k-sparse autoencoder. Instead of forcing sparsity, we add a '''sparsity regularization loss''', then optimize for<math display="block">\min_{\theta, \phi}L(\theta, \phi) + \lambda L_{\text{sparse}} (\theta, \phi)</math>where <math>\lambda > 0</math> measures how much sparsity we want to enforce.<ref name=":6" /> | ||
Let the autoencoder architecture have <math>K</math> layers. To define a sparsity regularization loss, we need a "desired" sparsity <math>\hat \rho_k</math> for each layer, a weight <math>w_k</math> for how much to enforce each sparsity, and a function <math>s: \times \to </math> to measure how much two sparsities differ. | Let the autoencoder architecture have <math>K</math> layers. To define a sparsity regularization loss, we need a "desired" sparsity <math>\hat \rho_k</math> for each layer, a weight <math>w_k</math> for how much to enforce each sparsity, and a function <math>s: \times \to </math> to measure how much two sparsities differ. | ||
Line 65: | Line 67: | ||
For each input <math>x</math>, let the actual sparsity of activation in each layer <math>k</math> be<math display="block">\rho_k(x) = \frac 1n \sum_{i=1}^n a_{k, i}(x)</math>where <math>a_{k, i}(x)</math> is the activation in the <math>i</math> -th neuron of the <math>k</math> -th layer upon input <math>x</math>. | For each input <math>x</math>, let the actual sparsity of activation in each layer <math>k</math> be<math display="block">\rho_k(x) = \frac 1n \sum_{i=1}^n a_{k, i}(x)</math>where <math>a_{k, i}(x)</math> is the activation in the <math>i</math> -th neuron of the <math>k</math> -th layer upon input <math>x</math>. | ||
The sparsity loss upon input <math>x</math> for one layer is <math>s(\hat\rho_k, \rho_k(x))</math>, and the sparsity regularization loss for the entire autoencoder is the expected weighted sum of sparsity losses:<math display="block">L_{ |
The sparsity loss upon input <math>x</math> for one layer is <math>s(\hat\rho_k, \rho_k(x))</math>, and the sparsity regularization loss for the entire autoencoder is the expected weighted sum of sparsity losses:<math display="block">L_{\text{sparse}}(\theta, \phi) = \mathbb \mathbb E_{x\sim\mu_X}\left</math>Typically, the function <math>s</math> is either the ], as<ref name=":1" /><ref name=":6">Ng, A. (2011). . ''CS294A Lecture notes'', ''72''(2011), 1-19.</ref><ref>{{Cite journal|last1=Nair|first1=Vinod|last2=Hinton|first2=Geoffrey E.|date=2009|title=3D Object Recognition with Deep Belief Nets|url=http://dl.acm.org/citation.cfm?id=2984093.2984244|journal=Proceedings of the 22nd International Conference on Neural Information Processing Systems|series=NIPS'09|location=USA|publisher=Curran Associates Inc.|pages=1339–1347|isbn=9781615679119}}</ref><ref>{{Cite journal|last1=Zeng|first1=Nianyin|last2=Zhang|first2=Hong|last3=Song|first3=Baoye|last4=Liu|first4=Weibo|last5=Li|first5=Yurong|last6=Dobaie|first6=Abdullah M.|date=2018-01-17|title=Facial expression recognition via learning deep sparse autoencoders|journal=Neurocomputing|volume=273|pages=643–649|doi=10.1016/j.neucom.2017.08.043|issn=0925-2312}}</ref> | ||
::<math>s(\rho, \hat\rho) = KL(\rho || \hat{\rho}) = \rho \log \frac{\rho}{\hat{\rho}}+(1- \rho)\log \frac{1-\rho}{1-\hat{\rho}}</math> | ::<math>s(\rho, \hat\rho) = KL(\rho || \hat{\rho}) = \rho \log \frac{\rho}{\hat{\rho}}+(1- \rho)\log \frac{1-\rho}{1-\hat{\rho}}</math> | ||
or the L1 loss, as <math>s(\rho, \hat\rho) = |\rho- \hat\rho|</math>, or the L2 loss, as <math>s(\rho, \hat\rho) = |\rho- \hat\rho|^2</math>. | or the L1 loss, as <math>s(\rho, \hat\rho) = |\rho- \hat\rho|</math>, or the L2 loss, as <math>s(\rho, \hat\rho) = |\rho- \hat\rho|^2</math>. | ||
Alternatively, the sparsity regularization loss may be defined without reference to any "desired sparsity", but simply force as much sparsity as possible. In this case, one can define the sparsity regularization loss as <math display="block">L_{ |
Alternatively, the sparsity regularization loss may be defined without reference to any "desired sparsity", but simply force as much sparsity as possible. In this case, one can define the sparsity regularization loss as <math display="block">L_{\text{sparse}}(\theta, \phi) = \mathbb \mathbb E_{x\sim\mu_X}\left[ | ||
\sum_{k\in 1:K} w_k \|h_k\| | \sum_{k\in 1:K} w_k \|h_k\| | ||
\right]</math>where <math>h_k</math> is the activation vector in the <math>k</math>-th layer of the autoencoder. The norm <math>\|\cdot\|</math> is usually the L1 norm (giving the L1 sparse autoencoder) or the L2 norm (giving the L2 sparse autoencoder). | \right]</math>where <math>h_k</math> is the activation vector in the <math>k</math>-th layer of the autoencoder. The norm <math>\|\cdot\|</math> is usually the L1 norm (giving the L1 sparse autoencoder) or the L2 norm (giving the L2 sparse autoencoder). | ||
===Denoising autoencoder (DAE)=== | |||
⚫ | ] | ||
Denoising autoencoders (DAE) try to achieve a ''good'' representation by changing the ''reconstruction criterion''.<ref name=":0" /><ref name=":4" /> | ''Denoising autoencoders'' (DAE) try to achieve a ''good'' representation by changing the ''reconstruction criterion''.<ref name=":0" /><ref name=":4" /> | ||
A DAE, originally called a "robust autoassociative network",<ref name=":13"/> is trained by intentionally corrupting the inputs of a standard autoencoder during training. A noise process is defined by a probability distribution <math>\mu_T</math> over functions <math>T:\mathcal X \to \mathcal X</math>. That is, the function <math>T</math> takes a message <math>x\in \mathcal X</math>, and corrupts it to a noisy version <math>T(x)</math>. The function <math>T</math> is selected randomly, with a probability distribution <math>\mu_T</math>. | A DAE, originally called a "robust autoassociative network" by Mark A. Kramer,<ref name=":13">{{Cite journal |last=Kramer |first=M. A. |date=1992-04-01 |title=Autoassociative neural networks |url=https://dx.doi.org/10.1016/0098-1354%2892%2980051-A |journal=Computers & Chemical Engineering |series=Neutral network applications in chemical engineering |language=en |volume=16 |issue=4 |pages=313–328 |doi=10.1016/0098-1354(92)80051-A |issn=0098-1354}}</ref> is trained by intentionally corrupting the inputs of a standard autoencoder during training. A noise process is defined by a probability distribution <math>\mu_T</math> over functions <math>T:\mathcal X \to \mathcal X</math>. That is, the function <math>T</math> takes a message <math>x\in \mathcal X</math>, and corrupts it to a noisy version <math>T(x)</math>. The function <math>T</math> is selected randomly, with a probability distribution <math>\mu_T</math>. | ||
Given a task <math>(\mu_{ref}, d)</math>, the problem of training a DAE is the optimization problem:<math display="block">\min_{\theta, \phi}L(\theta, \phi) = \mathbb \mathbb E_{x\sim \mu_X, T\sim\mu_T}</math>That is, the optimal DAE should take any noisy message and attempt to recover the original message without noise, thus the name "denoising"''.'' | Given a task <math>(\mu_{\text{ref}}, d)</math>, the problem of training a DAE is the optimization problem:<math display="block">\min_{\theta, \phi}L(\theta, \phi) = \mathbb \mathbb E_{x\sim \mu_X, T\sim\mu_T}</math>That is, the optimal DAE should take any noisy message and attempt to recover the original message without noise, thus the name "denoising"''.'' | ||
Usually, the noise process <math>T</math> is applied only during training and testing, not during downstream use. | Usually, the noise process <math>T</math> is applied only during training and testing, not during downstream use. | ||
Line 94: | Line 97: | ||
* salt-and-pepper noise (a fraction of the input is randomly chosen and randomly set to its minimum or maximum value).<ref name=":4" /> | * salt-and-pepper noise (a fraction of the input is randomly chosen and randomly set to its minimum or maximum value).<ref name=":4" /> | ||
=== Contractive autoencoder (CAE) === | |||
A contractive autoencoder adds the contractive regularization loss to the standard autoencoder loss:<math display="block">\min_{\theta, \phi}L(\theta, \phi) + \lambda L_{ |
A ''contractive autoencoder'' (CAE) adds the contractive regularization loss to the standard autoencoder loss:<math display="block">\min_{\theta, \phi}L(\theta, \phi) + \lambda L_{\text{cont}} (\theta, \phi)</math>where <math>\lambda > 0</math> measures how much contractive-ness we want to enforce. The contractive regularization loss itself is defined as the expected ] of the ] of the encoder activations with respect to the input:<math display="block">L_{\text{cont}}(\theta, \phi) = \mathbb E_{x\sim \mu_{ref}} \|\nabla_x E_\phi(x) \|_F^2</math>To understand what <math>L_{\text{cont}}</math> measures, note the fact<math display="block">\|E_\phi(x + \delta x) - E_\phi(x)\|_2 \leq \|\nabla_x E_\phi(x) \|_F \|\delta x\|_2</math>for any message <math>x\in \mathcal X</math>, and small variation <math>\delta x</math> in it. Thus, if <math>\|\nabla_x E_\phi(x) \|_F^2</math> is small, it means that a small neighborhood of the message maps to a small neighborhood of its code. This is a desired property, as it means small variation in the message leads to small, perhaps even zero, variation in its code, like how two pictures may look the same even if they are not exactly the same. | ||
The DAE can be understood as an infinitesimal limit of CAE: in the limit of small Gaussian input noise, DAEs make the reconstruction function resist small but finite-sized input perturbations, while CAEs make the extracted features resist infinitesimal input perturbations. | The DAE can be understood as an infinitesimal limit of CAE: in the limit of small Gaussian input noise, DAEs make the reconstruction function resist small but finite-sized input perturbations, while CAEs make the extracted features resist infinitesimal input perturbations. | ||
=== Minimum description length autoencoder (MDL-AE) === | |||
A ''minimum description length autoencoder'' (MDL-AE) is an advanced variation of the traditional autoencoder, which leverages principles from information theory, specifically the ]. The MDL principle posits that the best model for a dataset is the one that provides the shortest combined encoding of the model and the data. In the context of ], this principle is applied to ensure that the learned representation is not only compact but also interpretable and efficient for reconstruction. | |||
⚫ | <ref>{{Cite journal |last1=Hinton |first1=Geoffrey E |last2=Zemel |first2=Richard |date=1993 |title=Autoencoders, Minimum Description Length and Helmholtz Free Energy |url=https://proceedings.neurips.cc/paper/1993/hash/9e3cfc48eccf81a0d57663e129aef3cb-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Morgan-Kaufmann |volume=6}}</ref> | ||
{{Empty section|date=March 2024}} | |||
The MDL-AE seeks to minimize the total description length of the data, which includes the size of the ] (code length) and the error in reconstructing the original data. The objective can be expressed as | |||
=== Concrete autoencoder === | |||
⚫ | <math>L_{\text{code}} + L_{\text{error}}</math>, where <math>L_{\text{code}}</math> represents the length of the compressed latent representation and <math>L_{\text{error}}</math> denotes the reconstruction error.<ref name=":5">{{Cite journal |last1=Hinton |first1=Geoffrey E |last2=Zemel |first2=Richard |date=1993 |title=Autoencoders, Minimum Description Length and Helmholtz Free Energy |url=https://proceedings.neurips.cc/paper/1993/hash/9e3cfc48eccf81a0d57663e129aef3cb-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Morgan-Kaufmann |volume=6}}</ref> | ||
⚫ | The concrete autoencoder is designed for discrete feature selection.<ref>{{cite arXiv|last1=Abid|first1=Abubakar|last2=Balin|first2=Muhammad Fatih|last3=Zou|first3=James|date=2019-01-27|title=Concrete Autoencoders for Differentiable Feature Selection and Reconstruction|eprint=1901.09346|class=cs.LG}}</ref> A concrete autoencoder forces the latent space to consist only of a user-specified number of features. The concrete autoencoder uses a continuous ] of the ] to allow gradients to pass through the feature selector layer, which makes it possible to use standard ] to learn an optimal subset of input features that minimize reconstruction loss. | ||
=== |
=== Concrete autoencoder (CAE) === | ||
⚫ | The ''concrete autoencoder'' is designed for discrete feature selection.<ref>{{cite arXiv|last1=Abid|first1=Abubakar|last2=Balin|first2=Muhammad Fatih|last3=Zou|first3=James|date=2019-01-27|title=Concrete Autoencoders for Differentiable Feature Selection and Reconstruction|eprint=1901.09346|class=cs.LG}}</ref> A concrete autoencoder forces the latent space to consist only of a user-specified number of features. The concrete autoencoder uses a continuous ] of the ] to allow gradients to pass through the feature selector layer, which makes it possible to use standard ] to learn an optimal subset of input features that minimize reconstruction loss. | ||
⚫ | |||
⚫ | ]s (VAEs) belong to the families of ]. Despite the architectural similarities with basic autoencoders, VAEs are architected with different goals and have a different mathematical formulation. The latent space is, in this case, composed of a mixture of distributions instead of fixed vectors. | ||
⚫ | Given an input dataset <math>x</math> characterized by an unknown probability function <math>P(x)</math> and a multivariate latent encoding vector <math>z</math>, the objective is to model the data as a distribution <math>p_\theta(x)</math>, with <math>\theta</math> defined as the set of the network parameters so that <math>p_\theta(x) = \int_{z}p_\theta(x,z)dz </math>. | ||
==Advantages of depth== | ==Advantages of depth== | ||
Line 125: | Line 123: | ||
Researchers have debated whether joint training (i.e. training the whole architecture together with a single global reconstruction objective to optimize) would be better for deep auto-encoders.<ref name=":9">{{cite arXiv |eprint=1405.1380|last1=Zhou|first1=Yingbo|last2=Arpit|first2=Devansh|last3=Nwogu|first3=Ifeoma|last4=Govindaraju|first4=Venu|title=Is Joint Training Better for Deep Auto-Encoders?|class=stat.ML|date=2014}}</ref> A 2015 study showed that joint training learns better data models along with more representative features for classification as compared to the layerwise method.<ref name=":9" /> However, their experiments showed that the success of joint training depends heavily on the regularization strategies adopted.<ref name=":9" /><ref>R. Salakhutdinov and G. E. Hinton, “Deep Boltzmann machines,” in AISTATS, 2009, pp. 448–455.</ref> | Researchers have debated whether joint training (i.e. training the whole architecture together with a single global reconstruction objective to optimize) would be better for deep auto-encoders.<ref name=":9">{{cite arXiv |eprint=1405.1380|last1=Zhou|first1=Yingbo|last2=Arpit|first2=Devansh|last3=Nwogu|first3=Ifeoma|last4=Govindaraju|first4=Venu|title=Is Joint Training Better for Deep Auto-Encoders?|class=stat.ML|date=2014}}</ref> A 2015 study showed that joint training learns better data models along with more representative features for classification as compared to the layerwise method.<ref name=":9" /> However, their experiments showed that the success of joint training depends heavily on the regularization strategies adopted.<ref name=":9" /><ref>R. Salakhutdinov and G. E. Hinton, “Deep Boltzmann machines,” in AISTATS, 2009, pp. 448–455.</ref> | ||
⚫ | == History == | ||
(Oja, 1982)<ref>{{Cite journal |last=Oja |first=Erkki |date=1982-11-01 |title=Simplified neuron model as a principal component analyzer |url=https://link.springer.com/article/10.1007/BF00275687 |journal=Journal of Mathematical Biology |language=en |volume=15 |issue=3 |pages=267–273 |doi=10.1007/BF00275687 |pmid=7153672 |issn=1432-1416}}</ref> noted that PCA is equivalent to a neural network with one hidden layer with identity activation function. In the language of autoencoding, the input-to-hidden module is the encoder, and the hidden-to-output module is the decoder. Subsequently, in (Baldi and Hornik, 1989)<ref name="auto">{{Cite journal |last1=Baldi |first1=Pierre |last2=Hornik |first2=Kurt |date=1989-01-01 |title=Neural networks and principal component analysis: Learning from examples without local minima |url=https://www.sciencedirect.com/science/article/abs/pii/0893608089900142 |journal=Neural Networks |volume=2 |issue=1 |pages=53–58 |doi=10.1016/0893-6080(89)90014-2 |issn=0893-6080}}</ref> and (Kramer, 1991)<ref name=":12" /> generalized PCA to autoencoders, which they termed as "nonlinear PCA". | |||
Immediately after the resurgence of neural networks in the 1980s, it was suggested in 1986<ref>{{Cite book |last1=Rumelhart |first1=David E. |url=https://direct.mit.edu/books/book/4424/Parallel-Distributed-ProcessingExplorations-in-the |title=Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations |last2=McClelland |first2=James L. |last3=AU |date=1986 |publisher=The MIT Press |isbn=978-0-262-29140-8 |language=en |chapter=2. A General Framework for Parallel Distributed Processing |doi=10.7551/mitpress/5236.001.0001}}</ref> that a neural network be put in "auto-association mode". This was then implemented in (Harrison, 1987)<ref>Harrison TD (1987) A Connectionist framework for continuous speech recognition. Cambridge University Ph. D. dissertation</ref> and (Elman, Zipser, 1988)<ref>{{Cite journal |last1=Elman |first1=Jeffrey L. |last2=Zipser |first2=David |date=1988-04-01 |title=Learning the hidden structure of speech |url=https://pubs.aip.org/jasa/article/83/4/1615/826094/Learning-the-hidden-structure-of-speechLearning |journal=The Journal of the Acoustical Society of America |language=en |volume=83 |issue=4 |pages=1615–1626 |doi=10.1121/1.395916 |pmid=3372872 |bibcode=1988ASAJ...83.1615E |issn=0001-4966}}</ref> for speech and in (Cottrell, Munro, Zipser, 1987)<ref>{{Cite journal |last1=Cottrell |first1=Garrison W. |last2=Munro |first2=Paul |last3=Zipser |first3=David |date=1987 |title=Learning Internal Representation From Gray-Scale Images: An Example of Extensional Programming |url=https://escholarship.org/uc/item/2zs7w6z8 |journal=Proceedings of the Annual Meeting of the Cognitive Science Society |language=en |volume=9 }}</ref> for images.<ref name=":14" /> In (Hinton, Salakhutdinov, 2006),<ref name=":72">{{cite journal |last1=Hinton |first1=G. E. |last2=Salakhutdinov |first2=R.R. |date=28 July 2006 |title=Reducing the Dimensionality of Data with Neural Networks |journal=Science |volume=313 |issue=5786 |pages=504–507 |bibcode=2006Sci...313..504H |doi=10.1126/science.1127647 |pmid=16873662 |s2cid=1658773}}</ref> ] were developed. These train a pair ] as encoder-decoder pairs, then train another pair on the latent representation of the first pair, and so on.<ref name="scholar">{{Cite journal |vauthors=Hinton G |year=2009 |title=Deep belief networks |journal=Scholarpedia |volume=4 |issue=5 |pages=5947 |bibcode=2009SchpJ...4.5947H |doi=10.4249/scholarpedia.5947 |doi-access=free}}</ref> | |||
The first applications of AE date to early 1990s.<ref name=":0" /><ref>{{Cite journal |last=Schmidhuber |first=Jürgen |date=January 2015 |title=Deep learning in neural networks: An overview |journal=Neural Networks |volume=61 |pages=85–117 |arxiv=1404.7828 |doi=10.1016/j.neunet.2014.09.003 |pmid=25462637 |s2cid=11715509}}</ref><ref name=":5" /> Their most traditional application was ] or ], but the concept became widely used for learning ]s of data.<ref name="VAE">{{cite arXiv |eprint=1312.6114 |class=stat.ML |author1=Diederik P Kingma |first2=Max |last2=Welling |title=Auto-Encoding Variational Bayes |date=2013}}</ref><ref name="gan_faces">Generating Faces with Torch, Boesen A., Larsen L. and Sonderby S.K., 2015 {{url|http://torch.ch/blog/2015/11/13/gan.html}}</ref> Some of the most powerful ] in the 2010s involved autoencoder modules as a component of larger AI systems, such as VAE in ], discrete VAE in Transformer-based image generators like ], etc. | |||
During the early days, when the terminology was uncertain, the autoencoder has also been called identity mapping,<ref name="auto"/><ref name=":12" /> auto-associating,<ref>{{Cite journal |last1=Ackley |first1=D |last2=Hinton |first2=G |last3=Sejnowski |first3=T |date=March 1985 |title=A learning algorithm for boltzmann machines |url=http://doi.wiley.com/10.1016/S0364-0213(85)80012-4 |journal=Cognitive Science |language=en |volume=9 |issue=1 |pages=147–169 |doi=10.1016/S0364-0213(85)80012-4}}</ref> ] ],<ref name=":12" /> or Diabolo network.<ref>{{Cite journal |last1=Schwenk |first1=Holger |last2=Bengio |first2=Yoshua |date=1997 |title=Training Methods for Adaptive Boosting of Neural Networks |url=https://proceedings.neurips.cc/paper/1997/hash/9cb67ffb59554ab1dabb65bcb370ddd9-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=MIT Press |volume=10}}</ref><ref name="bengio" /> | |||
== Applications == | == Applications == | ||
The two main applications of autoencoders are dimensionality reduction and information retrieval,<ref name=":0">{{Cite book|url=http://www.deeplearningbook.org|title=Deep Learning|last1=Goodfellow|first1=Ian|last2=Bengio|first2=Yoshua|last3=Courville|first3=Aaron|publisher=MIT Press|date=2016|isbn=978-0262035613}}</ref> but modern variations have been applied to other tasks. | The two main applications of autoencoders are ] and ] (or ]),<ref name=":0">{{Cite book|url=http://www.deeplearningbook.org|title=Deep Learning|last1=Goodfellow|first1=Ian|last2=Bengio|first2=Yoshua|last3=Courville|first3=Aaron|publisher=MIT Press|date=2016|isbn=978-0262035613}}</ref> but modern variations have been applied to other tasks. | ||
=== Dimensionality reduction === | === Dimensionality reduction === | ||
Line 138: | Line 145: | ||
==== Principal component analysis ==== | ==== Principal component analysis ==== | ||
].<ref name=":10" />]] | ].<ref name=":10" />]] | ||
If linear activations are used, or only a single sigmoid hidden layer, then the optimal solution to an autoencoder is strongly related to ] (PCA).<ref>{{Cite journal|last1=Bourlard|first1=H.|last2=Kamp|first2=Y.|date=1988|title=Auto-association by multilayer perceptrons and singular value decomposition|journal=Biological Cybernetics|volume=59|issue=4–5|pages=291–294|doi=10.1007/BF00332918|pmid=3196773|s2cid=206775335|url=http://infoscience.epfl.ch/record/82601}}</ref><ref>{{cite book|title=Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics - BCB '14|last1=Chicco|first1=Davide|last2=Sadowski|first2=Peter|last3=Baldi|first3=Pierre|date=2014|isbn=9781450328944|pages=533|chapter=Deep autoencoder neural networks for gene ontology annotation predictions|doi=10.1145/2649387.2649442|hdl=11311/964622|s2cid=207217210|url=http://dl.acm.org/citation.cfm?id=2649442}}</ref> The weights of an autoencoder with a single hidden layer of size <math>p</math> (where <math>p</math> is less than the size of the input) span the same vector subspace as the one spanned by the first <math>p</math> principal components, and the output of the autoencoder is an orthogonal projection onto this subspace. The autoencoder weights are not equal to the principal components, and are generally not orthogonal, yet the principal components may be recovered from them using the ].<ref>{{cite arXiv|last1=Plaut|first1=E|title=From Principal Subspaces to Principal Components with Linear Autoencoders|eprint=1804.10253|date=2018|class=stat.ML}}</ref> | If linear activations are used, or only a single sigmoid hidden layer, then the optimal solution to an autoencoder is strongly related to ] (PCA).<ref name=":14">{{Cite journal|last1=Bourlard|first1=H.|last2=Kamp|first2=Y.|date=1988|title=Auto-association by multilayer perceptrons and singular value decomposition|journal=Biological Cybernetics|volume=59|issue=4–5|pages=291–294|doi=10.1007/BF00332918|pmid=3196773|s2cid=206775335|url=http://infoscience.epfl.ch/record/82601}}</ref><ref>{{cite book|title=Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics - BCB '14|last1=Chicco|first1=Davide|last2=Sadowski|first2=Peter|last3=Baldi|first3=Pierre|date=2014|isbn=9781450328944|pages=533|chapter=Deep autoencoder neural networks for gene ontology annotation predictions|doi=10.1145/2649387.2649442|hdl=11311/964622|s2cid=207217210|url=http://dl.acm.org/citation.cfm?id=2649442}}</ref> The weights of an autoencoder with a single hidden layer of size <math>p</math> (where <math>p</math> is less than the size of the input) span the same vector subspace as the one spanned by the first <math>p</math> principal components, and the output of the autoencoder is an orthogonal projection onto this subspace. The autoencoder weights are not equal to the principal components, and are generally not orthogonal, yet the principal components may be recovered from them using the ].<ref>{{cite arXiv|last1=Plaut|first1=E|title=From Principal Subspaces to Principal Components with Linear Autoencoders|eprint=1804.10253|date=2018|class=stat.ML}}</ref> | ||
However, the potential of autoencoders resides in their non-linearity, allowing the model to learn more powerful generalizations compared to PCA, and to reconstruct the input with significantly lower information loss.<ref name=":7" /> | However, the potential of autoencoders resides in their non-linearity, allowing the model to learn more powerful generalizations compared to PCA, and to reconstruct the input with significantly lower information loss.<ref name=":7" /> | ||
Line 185: | Line 192: | ||
=== Machine translation === | === Machine translation === | ||
Autoencoders have been applied to ], which is usually referred to as ] (NMT).<ref>{{cite arXiv |eprint=1409.1259|last1=Cho|first1=Kyunghyun|author2=Bart van Merrienboer|last3=Bahdanau|first3=Dzmitry|last4=Bengio|first4=Yoshua|title=On the Properties of Neural Machine Translation: Encoder-Decoder Approaches|class=cs.CL|date=2014}}</ref><ref>{{cite arXiv |eprint=1409.3215|last1=Sutskever|first1=Ilya|last2=Vinyals|first2=Oriol|last3=Le|first3=Quoc V.|title=Sequence to Sequence Learning with Neural Networks|class=cs.CL|date=2014}}</ref> Unlike traditional autoencoders, the output does not match the input - it is in another language. In NMT, texts are treated as sequences to be encoded into the learning procedure, while on the decoder side sequences in the target language(s) are generated. ]-specific autoencoders incorporate further ] features into the learning procedure, such as Chinese decomposition features.<ref>{{cite arXiv |eprint=1805.01565|last1=Han|first1=Lifeng|last2=Kuang|first2=Shaohui|title=Incorporating Chinese Radicals into Neural Machine Translation: Deeper Than Character Level|class=cs.CL|date=2018}}</ref> Machine translation is rarely still done with autoencoders, due to the availability of more effective ] networks. | Autoencoders have been applied to ], which is usually referred to as ] (NMT).<ref>{{cite arXiv |eprint=1409.1259|last1=Cho|first1=Kyunghyun|author2=Bart van Merrienboer|last3=Bahdanau|first3=Dzmitry|last4=Bengio|first4=Yoshua|title=On the Properties of Neural Machine Translation: Encoder-Decoder Approaches|class=cs.CL|date=2014}}</ref><ref>{{cite arXiv |eprint=1409.3215|last1=Sutskever|first1=Ilya|last2=Vinyals|first2=Oriol|last3=Le|first3=Quoc V.|title=Sequence to Sequence Learning with Neural Networks|class=cs.CL|date=2014}}</ref> Unlike traditional autoencoders, the output does not match the input - it is in another language. In NMT, texts are treated as sequences to be encoded into the learning procedure, while on the decoder side sequences in the target language(s) are generated. ]-specific autoencoders incorporate further ] features into the learning procedure, such as Chinese decomposition features.<ref>{{cite arXiv |eprint=1805.01565|last1=Han|first1=Lifeng|last2=Kuang|first2=Shaohui|title=Incorporating Chinese Radicals into Neural Machine Translation: Deeper Than Character Level|class=cs.CL|date=2018}}</ref> Machine translation is rarely still done with autoencoders, due to the availability of more effective ] networks. | ||
=== Communication Systems === | |||
Autoencoders in communication systems are important because they help in encoding data into a more resilient representation for channel impairments, which is crucial for transmitting information while minimizing errors. In Addition, AE-based systems can optimize end-to-end communication performance. This approach can solve the several limitations of designing communication systems such as the inherent difficulty in accurately modeling the complex behavior of real-world channels <ref>{{cite arXiv |eprint=2412.13843|last1=Alnaseri|first1=Omar|last2=Alzubaidi|first2=Laith|last3=Himeur|first3=Yassine|last4=Timmermann|first4=Jens|title=A Review on Deep Learning Autoencoder in the Design of Next-Generation Communication Systems|class=eess.SP|date=2024}}</ref>. | |||
==See also== | ==See also== | ||
Line 190: | Line 200: | ||
* ] | * ] | ||
* ] | * ] | ||
== Further reading == | |||
* {{cite book |last1=Bank |first1=Dor |title=Machine Learning for Data Science Handbook |last2=Koenigstein |first2=Noam |last3=Giryes |first3=Raja |publisher=Springer International Publishing |year=2023 |isbn=978-3-031-24627-2 |publication-place=Cham |chapter=Autoencoders |doi=10.1007/978-3-031-24628-9_16}} | |||
* {{Cite book |last1=Goodfellow |first1=Ian |title=Deep learning |last2=Bengio |first2=Yoshua |last3=Courville |first3=Aaron |date=2016 |publisher=The MIT press |isbn=978-0-262-03561-3 |series=Adaptive computation and machine learning |location=Cambridge, Mass |chapter=14. Autoencoders |chapter-url=https://www.deeplearningbook.org/contents/autoencoders.html}} | |||
==References== | ==References== | ||
{{Reflist|30em}} | {{Reflist|30em}} | ||
{{Artificial intelligence navbox}} | |||
{{Differentiable computing}} | |||
{{Noise}} | {{Noise}} | ||
Latest revision as of 19:38, 30 December 2024
Neural network that learns efficient data encoding in an unsupervised manner Not to be confused with Autocoder or Autocode.
Part of a series on |
Machine learning and data mining |
---|
Paradigms |
Problems
|
Supervised learning (classification • regression) |
Clustering |
Dimensionality reduction |
Structured prediction |
Anomaly detection |
Artificial neural network |
Reinforcement learning |
Learning with humans |
Model diagnostics |
Mathematical foundations |
Journals and conferences |
Related articles |
An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning). An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms.
Variants exist which aim to make the learned representations assume useful properties. Examples are regularized autoencoders (sparse, denoising and contractive autoencoders), which are effective in learning representations for subsequent classification tasks, and variational autoencoders, which can be used as generative models. Autoencoders are applied to many problems, including facial recognition, feature detection, anomaly detection, and learning the meaning of words. In terms of data synthesis, autoencoders can also be used to randomly generate new data that is similar to the input (training) data.
Mathematical principles
Definition
An autoencoder is defined by the following components:
Two sets: the space of decoded messages ; the space of encoded messages . Typically and are Euclidean spaces, that is, with
Two parametrized families of functions: the encoder family , parametrized by ; the decoder family , parametrized by .
For any , we usually write , and refer to it as the code, the latent variable, latent representation, latent vector, etc. Conversely, for any , we usually write , and refer to it as the (decoded) message.
Usually, both the encoder and the decoder are defined as multilayer perceptrons (MLPs). For example, a one-layer-MLP encoder is:
where is an element-wise activation function, is a "weight" matrix, and is a "bias" vector.
Training an autoencoder
An autoencoder, by itself, is simply a tuple of two functions. To judge its quality, we need a task. A task is defined by a reference probability distribution over , and a "reconstruction quality" function , such that measures how much differs from .
With those, we can define the loss function for the autoencoder asThe optimal autoencoder for the given task is then . The search for the optimal autoencoder can be accomplished by any mathematical optimization technique, but usually by gradient descent. This search process is referred to as "training the autoencoder".
In most situations, the reference distribution is just the empirical distribution given by a dataset , so that
where is the Dirac measure, the quality function is just L2 loss: , and is the Euclidean norm. Then the problem of searching for the optimal autoencoder is just a least-squares optimization:
Interpretation
An autoencoder has two main parts: an encoder that maps the message to a code, and a decoder that reconstructs the message from the code. An optimal autoencoder would perform as close to perfect reconstruction as possible, with "close to perfect" defined by the reconstruction quality function .
The simplest way to perform the copying task perfectly would be to duplicate the signal. To suppress this behavior, the code space usually has fewer dimensions than the message space .
Such an autoencoder is called undercomplete. It can be interpreted as compressing the message, or reducing its dimensionality.
At the limit of an ideal undercomplete autoencoder, every possible code in the code space is used to encode a message that really appears in the distribution , and the decoder is also perfect: . This ideal autoencoder can then be used to generate messages indistinguishable from real messages, by feeding its decoder arbitrary code and obtaining , which is a message that really appears in the distribution .
If the code space has dimension larger than (overcomplete), or equal to, the message space , or the hidden units are given enough capacity, an autoencoder can learn the identity function and become useless. However, experimental results found that overcomplete autoencoders might still learn useful features.
In the ideal setting, the code dimension and the model capacity could be set on the basis of the complexity of the data distribution to be modeled. A standard way to do so is to add modifications to the basic autoencoder, to be detailed below.
Variations
Variational autoencoder (VAE)
Main article: Variational autoencoderVariational autoencoders (VAEs) belong to the families of variational Bayesian methods. Despite the architectural similarities with basic autoencoders, VAEs are architected with different goals and have a different mathematical formulation. The latent space is, in this case, composed of a mixture of distributions instead of fixed vectors.
Given an input dataset characterized by an unknown probability function and a multivariate latent encoding vector , the objective is to model the data as a distribution , with defined as the set of the network parameters so that .
Sparse autoencoder (SAE)
Inspired by the sparse coding hypothesis in neuroscience, sparse autoencoders (SAE) are variants of autoencoders, such that the codes for messages tend to be sparse codes, that is, is close to zero in most entries. Sparse autoencoders may include more (rather than fewer) hidden units than inputs, but only a small number of the hidden units are allowed to be active at the same time. Encouraging sparsity improves performance on classification tasks.
There are two main ways to enforce sparsity. One way is to simply clamp all but the highest-k activations of the latent code to zero. This is the k-sparse autoencoder.
The k-sparse autoencoder inserts the following "k-sparse function" in the latent layer of a standard autoencoder:where if ranks in the top k, and 0 otherwise.
Backpropagating through is simple: set gradient to 0 for entries, and keep gradient for entries. This is essentially a generalized ReLU function.
The other way is a relaxed version of the k-sparse autoencoder. Instead of forcing sparsity, we add a sparsity regularization loss, then optimize forwhere measures how much sparsity we want to enforce.
Let the autoencoder architecture have layers. To define a sparsity regularization loss, we need a "desired" sparsity for each layer, a weight for how much to enforce each sparsity, and a function to measure how much two sparsities differ.
For each input , let the actual sparsity of activation in each layer bewhere is the activation in the -th neuron of the -th layer upon input .
The sparsity loss upon input for one layer is , and the sparsity regularization loss for the entire autoencoder is the expected weighted sum of sparsity losses:Typically, the function is either the Kullback-Leibler (KL) divergence, as
or the L1 loss, as , or the L2 loss, as .
Alternatively, the sparsity regularization loss may be defined without reference to any "desired sparsity", but simply force as much sparsity as possible. In this case, one can define the sparsity regularization loss as where is the activation vector in the -th layer of the autoencoder. The norm is usually the L1 norm (giving the L1 sparse autoencoder) or the L2 norm (giving the L2 sparse autoencoder).
Denoising autoencoder (DAE)
Denoising autoencoders (DAE) try to achieve a good representation by changing the reconstruction criterion.
A DAE, originally called a "robust autoassociative network" by Mark A. Kramer, is trained by intentionally corrupting the inputs of a standard autoencoder during training. A noise process is defined by a probability distribution over functions . That is, the function takes a message , and corrupts it to a noisy version . The function is selected randomly, with a probability distribution .
Given a task , the problem of training a DAE is the optimization problem:That is, the optimal DAE should take any noisy message and attempt to recover the original message without noise, thus the name "denoising".
Usually, the noise process is applied only during training and testing, not during downstream use.
The use of DAE depends on two assumptions:
- There exist representations to the messages that are relatively stable and robust to the type of noise we are likely to encounter;
- The said representations capture structures in the input distribution that are useful for our purposes.
Example noise processes include:
- additive isotropic Gaussian noise,
- masking noise (a fraction of the input is randomly chosen and set to 0)
- salt-and-pepper noise (a fraction of the input is randomly chosen and randomly set to its minimum or maximum value).
Contractive autoencoder (CAE)
A contractive autoencoder (CAE) adds the contractive regularization loss to the standard autoencoder loss:where measures how much contractive-ness we want to enforce. The contractive regularization loss itself is defined as the expected Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input:To understand what measures, note the factfor any message , and small variation in it. Thus, if is small, it means that a small neighborhood of the message maps to a small neighborhood of its code. This is a desired property, as it means small variation in the message leads to small, perhaps even zero, variation in its code, like how two pictures may look the same even if they are not exactly the same.
The DAE can be understood as an infinitesimal limit of CAE: in the limit of small Gaussian input noise, DAEs make the reconstruction function resist small but finite-sized input perturbations, while CAEs make the extracted features resist infinitesimal input perturbations.
Minimum description length autoencoder (MDL-AE)
A minimum description length autoencoder (MDL-AE) is an advanced variation of the traditional autoencoder, which leverages principles from information theory, specifically the Minimum Description Length (MDL) principle. The MDL principle posits that the best model for a dataset is the one that provides the shortest combined encoding of the model and the data. In the context of autoencoders, this principle is applied to ensure that the learned representation is not only compact but also interpretable and efficient for reconstruction.
The MDL-AE seeks to minimize the total description length of the data, which includes the size of the latent representation (code length) and the error in reconstructing the original data. The objective can be expressed as , where represents the length of the compressed latent representation and denotes the reconstruction error.
Concrete autoencoder (CAE)
The concrete autoencoder is designed for discrete feature selection. A concrete autoencoder forces the latent space to consist only of a user-specified number of features. The concrete autoencoder uses a continuous relaxation of the categorical distribution to allow gradients to pass through the feature selector layer, which makes it possible to use standard backpropagation to learn an optimal subset of input features that minimize reconstruction loss.
Advantages of depth
Autoencoders are often trained with a single-layer encoder and a single-layer decoder, but using many-layered (deep) encoders and decoders offers many advantages.
- Depth can exponentially reduce the computational cost of representing some functions.
- Depth can exponentially decrease the amount of training data needed to learn some functions.
- Experimentally, deep autoencoders yield better compression compared to shallow or linear autoencoders.
Training
Geoffrey Hinton developed the deep belief network technique for training many-layered deep autoencoders. His method involves treating each neighboring set of two layers as a restricted Boltzmann machine so that pretraining approximates a good solution, then using backpropagation to fine-tune the results.
Researchers have debated whether joint training (i.e. training the whole architecture together with a single global reconstruction objective to optimize) would be better for deep auto-encoders. A 2015 study showed that joint training learns better data models along with more representative features for classification as compared to the layerwise method. However, their experiments showed that the success of joint training depends heavily on the regularization strategies adopted.
History
(Oja, 1982) noted that PCA is equivalent to a neural network with one hidden layer with identity activation function. In the language of autoencoding, the input-to-hidden module is the encoder, and the hidden-to-output module is the decoder. Subsequently, in (Baldi and Hornik, 1989) and (Kramer, 1991) generalized PCA to autoencoders, which they termed as "nonlinear PCA".
Immediately after the resurgence of neural networks in the 1980s, it was suggested in 1986 that a neural network be put in "auto-association mode". This was then implemented in (Harrison, 1987) and (Elman, Zipser, 1988) for speech and in (Cottrell, Munro, Zipser, 1987) for images. In (Hinton, Salakhutdinov, 2006), deep belief networks were developed. These train a pair restricted Boltzmann machines as encoder-decoder pairs, then train another pair on the latent representation of the first pair, and so on.
The first applications of AE date to early 1990s. Their most traditional application was dimensionality reduction or feature learning, but the concept became widely used for learning generative models of data. Some of the most powerful AIs in the 2010s involved autoencoder modules as a component of larger AI systems, such as VAE in Stable Diffusion, discrete VAE in Transformer-based image generators like DALL-E 1, etc.
During the early days, when the terminology was uncertain, the autoencoder has also been called identity mapping, auto-associating, self-supervised backpropagation, or Diabolo network.
Applications
The two main applications of autoencoders are dimensionality reduction and information retrieval (or associative memory), but modern variations have been applied to other tasks.
Dimensionality reduction
Dimensionality reduction was one of the first deep learning applications.
For Hinton's 2006 study, he pretrained a multi-layer autoencoder with a stack of RBMs and then used their weights to initialize a deep autoencoder with gradually smaller hidden layers until hitting a bottleneck of 30 neurons. The resulting 30 dimensions of the code yielded a smaller reconstruction error compared to the first 30 components of a principal component analysis (PCA), and learned a representation that was qualitatively easier to interpret, clearly separating data clusters.
Representing dimensions can improve performance on tasks such as classification. Indeed, the hallmark of dimensionality reduction is to place semantically related examples near each other.
Principal component analysis
If linear activations are used, or only a single sigmoid hidden layer, then the optimal solution to an autoencoder is strongly related to principal component analysis (PCA). The weights of an autoencoder with a single hidden layer of size (where is less than the size of the input) span the same vector subspace as the one spanned by the first principal components, and the output of the autoencoder is an orthogonal projection onto this subspace. The autoencoder weights are not equal to the principal components, and are generally not orthogonal, yet the principal components may be recovered from them using the singular value decomposition.
However, the potential of autoencoders resides in their non-linearity, allowing the model to learn more powerful generalizations compared to PCA, and to reconstruct the input with significantly lower information loss.
Information retrieval and Search engine optimization
Information retrieval benefits particularly from dimensionality reduction in that search can become more efficient in certain kinds of low dimensional spaces. Autoencoders were indeed applied to semantic hashing, proposed by Salakhutdinov and Hinton in 2007. By training the algorithm to produce a low-dimensional binary code, all database entries could be stored in a hash table mapping binary code vectors to entries. This table would then support information retrieval by returning all entries with the same binary code as the query, or slightly less similar entries by flipping some bits from the query encoding.
The encoder-decoder architecture, often used in natural language processing and neural networks, can be scientifically applied in the field of SEO (Search Engine Optimization) in various ways:
- Text Processing: By using an autoencoder, it's possible to compress the text of web pages into a more compact vector representation. This can help reduce page loading times and improve indexing by search engines.
- Noise Reduction: Autoencoders can be used to remove noise from the textual data of web pages. This can lead to a better understanding of the content by search engines, thereby enhancing ranking in search engine result pages.
- Meta Tag and Snippet Generation: Autoencoders can be trained to automatically generate meta tags, snippets, and descriptions for web pages using the page content. This can optimize the presentation in search results, increasing the Click-Through Rate (CTR).
- Content Clustering: Using an autoencoder, web pages with similar content can be automatically grouped together. This can help organize the website logically and improve navigation, potentially positively affecting user experience and search engine rankings.
- Generation of Related Content: An autoencoder can be employed to generate content related to what is already present on the site. This can enhance the website's attractiveness to search engines and provide users with additional relevant information.
- Keyword Detection: Autoencoders can be trained to identify keywords and important concepts within the content of web pages. This can assist in optimizing keyword usage for better indexing.
- Semantic Search: By using autoencoder techniques, semantic representation models of content can be created. These models can be used to enhance search engines' understanding of the themes covered in web pages.
In essence, the encoder-decoder architecture or autoencoders can be leveraged in SEO to optimize web page content, improve their indexing, and enhance their appeal to both search engines and users.
Anomaly detection
Another application for autoencoders is anomaly detection. By learning to replicate the most salient features in the training data under some of the constraints described previously, the model is encouraged to learn to precisely reproduce the most frequently observed characteristics. When facing anomalies, the model should worsen its reconstruction performance. In most cases, only data with normal instances are used to train the autoencoder; in others, the frequency of anomalies is small compared to the observation set so that its contribution to the learned representation could be ignored. After training, the autoencoder will accurately reconstruct "normal" data, while failing to do so with unfamiliar anomalous data. Reconstruction error (the error between the original data and its low dimensional reconstruction) is used as an anomaly score to detect anomalies.
Recent literature has however shown that certain autoencoding models can, counterintuitively, be very good at reconstructing anomalous examples and consequently not able to reliably perform anomaly detection.
Image processing
The characteristics of autoencoders are useful in image processing.
One example can be found in lossy image compression, where autoencoders outperformed other approaches and proved competitive against JPEG 2000.
Another useful application of autoencoders in image preprocessing is image denoising.
Autoencoders found use in more demanding contexts such as medical imaging where they have been used for image denoising as well as super-resolution. In image-assisted diagnosis, experiments have applied autoencoders for breast cancer detection and for modelling the relation between the cognitive decline of Alzheimer's disease and the latent features of an autoencoder trained with MRI.
Drug discovery
In 2019 molecules generated with variational autoencoders were validated experimentally in mice.
Popularity prediction
Recently, a stacked autoencoder framework produced promising results in predicting popularity of social media posts, which is helpful for online advertising strategies.
Machine translation
Autoencoders have been applied to machine translation, which is usually referred to as neural machine translation (NMT). Unlike traditional autoencoders, the output does not match the input - it is in another language. In NMT, texts are treated as sequences to be encoded into the learning procedure, while on the decoder side sequences in the target language(s) are generated. Language-specific autoencoders incorporate further linguistic features into the learning procedure, such as Chinese decomposition features. Machine translation is rarely still done with autoencoders, due to the availability of more effective transformer networks.
Communication Systems
Autoencoders in communication systems are important because they help in encoding data into a more resilient representation for channel impairments, which is crucial for transmitting information while minimizing errors. In Addition, AE-based systems can optimize end-to-end communication performance. This approach can solve the several limitations of designing communication systems such as the inherent difficulty in accurately modeling the complex behavior of real-world channels .
See also
Further reading
- Bank, Dor; Koenigstein, Noam; Giryes, Raja (2023). "Autoencoders". Machine Learning for Data Science Handbook. Cham: Springer International Publishing. doi:10.1007/978-3-031-24628-9_16. ISBN 978-3-031-24627-2.
- Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). "14. Autoencoders". Deep learning. Adaptive computation and machine learning. Cambridge, Mass: The MIT press. ISBN 978-0-262-03561-3.
References
- Bank, Dor; Koenigstein, Noam; Giryes, Raja (2023). "Autoencoders". In Rokach, Lior; Maimon, Oded; Shmueli, Erez (eds.). Machine learning for data science handbook. pp. 353–374. doi:10.1007/978-3-031-24628-9_16. ISBN 978-3-031-24627-2.
- ^ Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). Deep Learning. MIT Press. ISBN 978-0262035613.
- ^ Vincent, Pascal; Larochelle, Hugo (2010). "Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion". Journal of Machine Learning Research. 11: 3371–3408.
- Welling, Max; Kingma, Diederik P. (2019). "An Introduction to Variational Autoencoders". Foundations and Trends in Machine Learning. 12 (4): 307–392. arXiv:1906.02691. Bibcode:2019arXiv190602691K. doi:10.1561/2200000056. S2CID 174802445.
- Hinton GE, Krizhevsky A, Wang SD. Transforming auto-encoders. In International Conference on Artificial Neural Networks 2011 Jun 14 (pp. 44-51). Springer, Berlin, Heidelberg.
- ^ Géron, Aurélien (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. Canada: O’Reilly Media, Inc. pp. 739–740.
- Liou, Cheng-Yuan; Huang, Jau-Chi; Yang, Wen-Chie (2008). "Modeling word perception using the Elman network". Neurocomputing. 71 (16–18): 3150. doi:10.1016/j.neucom.2008.04.030.
- Liou, Cheng-Yuan; Cheng, Wei-Chen; Liou, Jiun-Wei; Liou, Daw-Ran (2014). "Autoencoder for words". Neurocomputing. 139: 84–96. doi:10.1016/j.neucom.2013.09.055.
- ^ Kramer, Mark A. (1991). "Nonlinear principal component analysis using autoassociative neural networks" (PDF). AIChE Journal. 37 (2): 233–243. Bibcode:1991AIChE..37..233K. doi:10.1002/aic.690370209.
- ^ Hinton, G. E.; Salakhutdinov, R.R. (2006-07-28). "Reducing the Dimensionality of Data with Neural Networks". Science. 313 (5786): 504–507. Bibcode:2006Sci...313..504H. doi:10.1126/science.1127647. PMID 16873662. S2CID 1658773.
- ^ Bengio, Y. (2009). "Learning Deep Architectures for AI" (PDF). Foundations and Trends in Machine Learning. 2 (8): 1795–7. CiteSeerX 10.1.1.701.9550. doi:10.1561/2200000006. PMID 23946944. S2CID 207178999.
- Domingos, Pedro (2015). "4". The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World. Basic Books. "Deeper into the Brain" subsection. ISBN 978-046506192-1.
- ^ Makhzani, Alireza; Frey, Brendan (2013). "K-Sparse Autoencoders". arXiv:1312.5663 .
- ^ Ng, A. (2011). Sparse autoencoder. CS294A Lecture notes, 72(2011), 1-19.
- Nair, Vinod; Hinton, Geoffrey E. (2009). "3D Object Recognition with Deep Belief Nets". Proceedings of the 22nd International Conference on Neural Information Processing Systems. NIPS'09. USA: Curran Associates Inc.: 1339–1347. ISBN 9781615679119.
- Zeng, Nianyin; Zhang, Hong; Song, Baoye; Liu, Weibo; Li, Yurong; Dobaie, Abdullah M. (2018-01-17). "Facial expression recognition via learning deep sparse autoencoders". Neurocomputing. 273: 643–649. doi:10.1016/j.neucom.2017.08.043. ISSN 0925-2312.
- ^ Kramer, M. A. (1992-04-01). "Autoassociative neural networks". Computers & Chemical Engineering. Neutral network applications in chemical engineering. 16 (4): 313–328. doi:10.1016/0098-1354(92)80051-A. ISSN 0098-1354.
- ^ Hinton, Geoffrey E; Zemel, Richard (1993). "Autoencoders, Minimum Description Length and Helmholtz Free Energy". Advances in Neural Information Processing Systems. 6. Morgan-Kaufmann.
- Abid, Abubakar; Balin, Muhammad Fatih; Zou, James (2019-01-27). "Concrete Autoencoders for Differentiable Feature Selection and Reconstruction". arXiv:1901.09346 .
- ^ Zhou, Yingbo; Arpit, Devansh; Nwogu, Ifeoma; Govindaraju, Venu (2014). "Is Joint Training Better for Deep Auto-Encoders?". arXiv:1405.1380 .
- R. Salakhutdinov and G. E. Hinton, “Deep Boltzmann machines,” in AISTATS, 2009, pp. 448–455.
- Oja, Erkki (1982-11-01). "Simplified neuron model as a principal component analyzer". Journal of Mathematical Biology. 15 (3): 267–273. doi:10.1007/BF00275687. ISSN 1432-1416. PMID 7153672.
- ^ Baldi, Pierre; Hornik, Kurt (1989-01-01). "Neural networks and principal component analysis: Learning from examples without local minima". Neural Networks. 2 (1): 53–58. doi:10.1016/0893-6080(89)90014-2. ISSN 0893-6080.
- Rumelhart, David E.; McClelland, James L.; AU (1986). "2. A General Framework for Parallel Distributed Processing". Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations. The MIT Press. doi:10.7551/mitpress/5236.001.0001. ISBN 978-0-262-29140-8.
- Harrison TD (1987) A Connectionist framework for continuous speech recognition. Cambridge University Ph. D. dissertation
- Elman, Jeffrey L.; Zipser, David (1988-04-01). "Learning the hidden structure of speech". The Journal of the Acoustical Society of America. 83 (4): 1615–1626. Bibcode:1988ASAJ...83.1615E. doi:10.1121/1.395916. ISSN 0001-4966. PMID 3372872.
- Cottrell, Garrison W.; Munro, Paul; Zipser, David (1987). "Learning Internal Representation From Gray-Scale Images: An Example of Extensional Programming". Proceedings of the Annual Meeting of the Cognitive Science Society. 9.
- ^ Bourlard, H.; Kamp, Y. (1988). "Auto-association by multilayer perceptrons and singular value decomposition". Biological Cybernetics. 59 (4–5): 291–294. doi:10.1007/BF00332918. PMID 3196773. S2CID 206775335.
- Hinton, G. E.; Salakhutdinov, R.R. (2006-07-28). "Reducing the Dimensionality of Data with Neural Networks". Science. 313 (5786): 504–507. Bibcode:2006Sci...313..504H. doi:10.1126/science.1127647. PMID 16873662. S2CID 1658773.
- Hinton G (2009). "Deep belief networks". Scholarpedia. 4 (5): 5947. Bibcode:2009SchpJ...4.5947H. doi:10.4249/scholarpedia.5947.
- Schmidhuber, Jürgen (January 2015). "Deep learning in neural networks: An overview". Neural Networks. 61: 85–117. arXiv:1404.7828. doi:10.1016/j.neunet.2014.09.003. PMID 25462637. S2CID 11715509.
- Diederik P Kingma; Welling, Max (2013). "Auto-Encoding Variational Bayes". arXiv:1312.6114 .
- Generating Faces with Torch, Boesen A., Larsen L. and Sonderby S.K., 2015 torch
.ch /blog /2015 /11 /13 /gan .html - Ackley, D; Hinton, G; Sejnowski, T (March 1985). "A learning algorithm for boltzmann machines". Cognitive Science. 9 (1): 147–169. doi:10.1016/S0364-0213(85)80012-4.
- Schwenk, Holger; Bengio, Yoshua (1997). "Training Methods for Adaptive Boosting of Neural Networks". Advances in Neural Information Processing Systems. 10. MIT Press.
- ^ "Fashion MNIST". GitHub. 2019-07-12.
- ^ Salakhutdinov, Ruslan; Hinton, Geoffrey (2009-07-01). "Semantic hashing". International Journal of Approximate Reasoning. Special Section on Graphical Models and Information Retrieval. 50 (7): 969–978. doi:10.1016/j.ijar.2008.11.006. ISSN 0888-613X.
- Chicco, Davide; Sadowski, Peter; Baldi, Pierre (2014). "Deep autoencoder neural networks for gene ontology annotation predictions". Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics - BCB '14. p. 533. doi:10.1145/2649387.2649442. hdl:11311/964622. ISBN 9781450328944. S2CID 207217210.
- Plaut, E (2018). "From Principal Subspaces to Principal Components with Linear Autoencoders". arXiv:1804.10253 .
- Morales-Forero, A.; Bassetto, S. (December 2019). "Case Study: A Semi-Supervised Methodology for Anomaly Detection and Diagnosis". 2019 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM). Macao, Macao: IEEE. pp. 1031–1037. doi:10.1109/IEEM44572.2019.8978509. ISBN 978-1-7281-3804-6. S2CID 211027131.
- Sakurada, Mayu; Yairi, Takehisa (December 2014). "Anomaly Detection Using Autoencoders with Nonlinear Dimensionality Reduction". Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis. Gold Coast, Australia QLD, Australia: ACM Press. pp. 4–11. doi:10.1145/2689746.2689747. ISBN 978-1-4503-3159-3. S2CID 14613395.
- ^ An, J., & Cho, S. (2015). Variational Autoencoder based Anomaly Detection using Reconstruction Probability. Special Lecture on IE, 2, 1-18.
- Zhou, Chong; Paffenroth, Randy C. (2017-08-04). "Anomaly Detection with Robust Deep Autoencoders". Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM. pp. 665–674. doi:10.1145/3097983.3098052. ISBN 978-1-4503-4887-4. S2CID 207557733.
- Ribeiro, Manassés; Lazzaretti, André Eugênio; Lopes, Heitor Silvério (2018). "A study of deep convolutional auto-encoders for anomaly detection in videos". Pattern Recognition Letters. 105: 13–22. Bibcode:2018PaReL.105...13R. doi:10.1016/j.patrec.2017.07.016.
- Nalisnick, Eric; Matsukawa, Akihiro; Teh, Yee Whye; Gorur, Dilan; Lakshminarayanan, Balaji (2019-02-24). "Do Deep Generative Models Know What They Don't Know?". arXiv:1810.09136 .
- Xiao, Zhisheng; Yan, Qing; Amit, Yali (2020). "Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder". Advances in Neural Information Processing Systems. 33. arXiv:2003.02977.
- Theis, Lucas; Shi, Wenzhe; Cunningham, Andrew; Huszár, Ferenc (2017). "Lossy Image Compression with Compressive Autoencoders". arXiv:1703.00395 .
- Balle, J; Laparra, V; Simoncelli, EP (April 2017). "End-to-end optimized image compression". International Conference on Learning Representations. arXiv:1611.01704.
- Cho, K. (2013, February). Simple sparsification improves sparse denoising autoencoders in denoising highly corrupted images. In International Conference on Machine Learning (pp. 432-440).
- Cho, Kyunghyun (2013). "Boltzmann Machines and Denoising Autoencoders for Image Denoising". arXiv:1301.3468 .
- Buades, A.; Coll, B.; Morel, J. M. (2005). "A Review of Image Denoising Algorithms, with a New One". Multiscale Modeling & Simulation. 4 (2): 490–530. doi:10.1137/040616024. S2CID 218466166.
- Gondara, Lovedeep (December 2016). "Medical Image Denoising Using Convolutional Denoising Autoencoders". 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW). Barcelona, Spain: IEEE. pp. 241–246. arXiv:1608.04667. Bibcode:2016arXiv160804667G. doi:10.1109/ICDMW.2016.0041. ISBN 9781509059102. S2CID 14354973.
- Zeng, Kun; Yu, Jun; Wang, Ruxin; Li, Cuihua; Tao, Dacheng (January 2017). "Coupled Deep Autoencoder for Single Image Super-Resolution". IEEE Transactions on Cybernetics. 47 (1): 27–37. doi:10.1109/TCYB.2015.2501373. ISSN 2168-2267. PMID 26625442. S2CID 20787612.
- Tzu-Hsi, Song; Sanchez, Victor; Hesham, EIDaly; Nasir M., Rajpoot (2017). "Hybrid deep autoencoder with Curvature Gaussian for detection of various types of cells in bone marrow trephine biopsy images". 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017). pp. 1040–1043. doi:10.1109/ISBI.2017.7950694. ISBN 978-1-5090-1172-8. S2CID 7433130.
- Xu, Jun; Xiang, Lei; Liu, Qingshan; Gilmore, Hannah; Wu, Jianzhong; Tang, Jinghai; Madabhushi, Anant (January 2016). "Stacked Sparse Autoencoder (SSAE) for Nuclei Detection on Breast Cancer Histopathology Images". IEEE Transactions on Medical Imaging. 35 (1): 119–130. doi:10.1109/TMI.2015.2458702. PMC 4729702. PMID 26208307.
- Martinez-Murcia, Francisco J.; Ortiz, Andres; Gorriz, Juan M.; Ramirez, Javier; Castillo-Barnes, Diego (2020). "Studying the Manifold Structure of Alzheimer's Disease: A Deep Learning Approach Using Convolutional Autoencoders". IEEE Journal of Biomedical and Health Informatics. 24 (1): 17–26. doi:10.1109/JBHI.2019.2914970. hdl:10630/28806. PMID 31217131. S2CID 195187846.
- Zhavoronkov, Alex (2019). "Deep learning enables rapid identification of potent DDR1 kinase inhibitors". Nature Biotechnology. 37 (9): 1038–1040. doi:10.1038/s41587-019-0224-x. PMID 31477924. S2CID 201716327.
- Gregory, Barber. "A Molecule Designed By AI Exhibits 'Druglike' Qualities". Wired.
- De, Shaunak; Maity, Abhishek; Goel, Vritti; Shitole, Sanjay; Bhattacharya, Avik (2017). "Predicting the popularity of instagram posts for a lifestyle magazine using deep learning". 2017 2nd IEEE International Conference on Communication Systems, Computing and IT Applications (CSCITA). pp. 174–177. doi:10.1109/CSCITA.2017.8066548. ISBN 978-1-5090-4381-1. S2CID 35350962.
- Cho, Kyunghyun; Bart van Merrienboer; Bahdanau, Dzmitry; Bengio, Yoshua (2014). "On the Properties of Neural Machine Translation: Encoder-Decoder Approaches". arXiv:1409.1259 .
- Sutskever, Ilya; Vinyals, Oriol; Le, Quoc V. (2014). "Sequence to Sequence Learning with Neural Networks". arXiv:1409.3215 .
- Han, Lifeng; Kuang, Shaohui (2018). "Incorporating Chinese Radicals into Neural Machine Translation: Deeper Than Character Level". arXiv:1805.01565 .
- Alnaseri, Omar; Alzubaidi, Laith; Himeur, Yassine; Timmermann, Jens (2024). "A Review on Deep Learning Autoencoder in the Design of Next-Generation Communication Systems". arXiv:2412.13843 .
Noise (physics and telecommunications) | |||||
---|---|---|---|---|---|
General | |||||
Noise in... | |||||
Class of noise |
| ||||
Engineering terms |
| ||||
Ratios |
| ||||
Related topics | |||||
Denoise methods |
|