Misplaced Pages

Restricted Boltzmann machine: Difference between revisions

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
Browse history interactively← Previous editNext edit →Content deleted Content addedVisualWikitext
Revision as of 21:15, 30 April 2016 edit81.153.147.173 (talk)No edit summary← Previous edit Revision as of 21:17, 30 April 2016 edit undo81.153.147.173 (talk)No edit summaryNext edit →
Line 35: Line 35:
They can be trained in either ] or ] ways, depending on the task. They can be trained in either ] or ] ways, depending on the task.


As their name implies, RBMs are a variant of ]s, with the restriction that their neurons must form a ]: As their name implies, RBMs are a variant of ]s, with the restriction that their ] must form a ]:
a pair of nodes from each of the two groups of units, commonly referred to as the "visible" and "hidden" units respectively, may have a symmetric connection between them, and there are no connections between nodes within a group. By contrast, "unrestricted" Boltzmann machines may have connections between hidden units. This restriction allows for more efficient training algorithms than are available for the general class of Boltzmann machines, in particular the ] '''contrastive divergence''' algorithm.<ref name="oncd">Miguel Á. Carreira-Perpiñán and Geoffrey Hinton (2005). On contrastive divergence learning. ''Artificial Intelligence and Statistics''.</ref> a pair of nodes from each of the two groups of units, commonly referred to as the "visible" and "hidden" units respectively, may have a symmetric connection between them, and there are no connections between nodes within a group. By contrast, "unrestricted" Boltzmann machines may have connections between hidden units. This restriction allows for more efficient training algorithms than are available for the general class of Boltzmann machines, in particular the ] '''contrastive divergence''' algorithm.<ref name="oncd">Miguel Á. Carreira-Perpiñán and Geoffrey Hinton (2005). On contrastive divergence learning. ''Artificial Intelligence and Statistics''.</ref>



Revision as of 21:17, 30 April 2016

Part of a series on
Machine learning
and data mining
Paradigms
Problems
Supervised learning
(classification • regression)
Clustering
Dimensionality reduction
Structured prediction
Anomaly detection
Artificial neural network
Reinforcement learning
Learning with humans
Model diagnostics
Mathematical foundations
Journals and conferences
Related articles
Diagram of a restricted Boltzmann machine with three visible units and four hidden units (no bias units).

A restricted Boltzmann machine (RBM) is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs.

RBMs were initially invented under the name Harmonium by Paul Smolensky in 1986, but only rose to prominence after Geoffrey Hinton and collaborators invented fast learning algorithms for them in the mid-2000s. RBMs have found applications in dimensionality reduction, classification, collaborative filtering, feature learning and topic modelling. They can be trained in either supervised or unsupervised ways, depending on the task.

As their name implies, RBMs are a variant of Boltzmann machines, with the restriction that their neurons must form a bipartite graph: a pair of nodes from each of the two groups of units, commonly referred to as the "visible" and "hidden" units respectively, may have a symmetric connection between them, and there are no connections between nodes within a group. By contrast, "unrestricted" Boltzmann machines may have connections between hidden units. This restriction allows for more efficient training algorithms than are available for the general class of Boltzmann machines, in particular the gradient-based contrastive divergence algorithm.

Restricted Boltzmann machines can also be used in deep learning networks. In particular, deep belief networks can be formed by "stacking" RBMs and optionally fine-tuning the resulting deep network with gradient descent and backpropagation.

Structure

The standard type of RBM has binary-valued (Boolean/Bernoulli) hidden and visible units, and consists of a matrix of weights W = ( w i , j ) {\displaystyle W=(w_{i,j})} (size m×n) associated with the connection between hidden unit h j {\displaystyle h_{j}} and visible unit v i {\displaystyle v_{i}} , as well as bias weights (offsets) a i {\displaystyle a_{i}} for the visible units and b j {\displaystyle b_{j}} for the hidden units. Given these, the energy of a configuration (pair of boolean vectors) (v,h) is defined as

E ( v , h ) = i a i v i j b j h j i j v i w i , j h j {\displaystyle E(v,h)=-\sum _{i}a_{i}v_{i}-\sum _{j}b_{j}h_{j}-\sum _{i}\sum _{j}v_{i}w_{i,j}h_{j}}

or, in matrix notation,

E ( v , h ) = a T v b T h v T W h {\displaystyle E(v,h)=-a^{\mathrm {T} }v-b^{\mathrm {T} }h-v^{\mathrm {T} }Wh}

This energy function is analogous to that of a Hopfield network. As in general Boltzmann machines, probability distributions over hidden and/or visible vectors are defined in terms of the energy function:

P ( v , h ) = 1 Z e E ( v , h ) {\displaystyle P(v,h)={\frac {1}{Z}}e^{-E(v,h)}}

where Z {\displaystyle Z} is a partition function defined as the sum of e E ( v , h ) {\displaystyle e^{-E(v,h)}} over all possible configurations (in other words, just a normalizing constant to ensure the probability distribution sums to 1). Similarly, the (marginal) probability of a visible (input) vector of booleans is the sum over all possible hidden layer configurations:

P ( v ) = 1 Z h e E ( v , h ) {\displaystyle P(v)={\frac {1}{Z}}\sum _{h}e^{-E(v,h)}}

Since the RBM has the shape of a bipartite graph, with no intra-layer connections, the hidden unit activations are mutually independent given the visible unit activations and conversely, the visible unit activations are mutually independent given the hidden unit activations. That is, for m {\displaystyle m} visible units and n {\displaystyle n} hidden units, the conditional probability of a configuration of the visible units v, given a configuration of the hidden units h, is

P ( v | h ) = i = 1 m P ( v i | h ) {\displaystyle P(v|h)=\prod _{i=1}^{m}P(v_{i}|h)} .

Conversely, the conditional probability of h given v is

P ( h | v ) = j = 1 n P ( h j | v ) {\displaystyle P(h|v)=\prod _{j=1}^{n}P(h_{j}|v)} .

The individual activation probabilities are given by

P ( h j = 1 | v ) = σ ( b j + i = 1 m w i , j v i ) {\displaystyle P(h_{j}=1|v)=\sigma \left(b_{j}+\sum _{i=1}^{m}w_{i,j}v_{i}\right)\,} and P ( v i = 1 | h ) = σ ( a i + j = 1 n w i , j h j ) {\displaystyle \,P(v_{i}=1|h)=\sigma \left(a_{i}+\sum _{j=1}^{n}w_{i,j}h_{j}\right)}

where σ {\displaystyle \sigma } denotes the logistic sigmoid.

The visible units of RBM can be multinomial, although the hidden units are Bernoulli. In this case, the logistic function for visible units is replaced by the softmax function

P ( v i k = 1 | h ) = exp ( a i k + Σ j W i j k h j ) Σ k = 1 K exp ( a i k + Σ j W i j k h j ) {\displaystyle P(v_{i}^{k}=1|h)={\frac {\exp(a_{i}^{k}+\Sigma _{j}W_{ij}^{k}h_{j})}{\Sigma _{k=1}^{K}\exp(a_{i}^{k}+\Sigma _{j}W_{ij}^{k}h_{j})}}}

where K is the number of discrete values that the visible values have. They are applied in topic modeling, and recommender systems.

Relation to other models

Restricted Boltzmann machines are a special case of Boltzmann machines and Markov random fields. Their graphical model corresponds to that of factor analysis.

Training algorithm

Restricted Boltzmann machines are trained to maximize the product of probabilities assigned to some training set V {\displaystyle V} (a matrix, each row of which is treated as a visible vector v {\displaystyle v} ),

arg max W v V P ( v ) {\displaystyle \arg \max _{W}\prod _{v\in V}P(v)}

or equivalently, to maximize the expected log probability of V {\displaystyle V} :

arg max W E [ v V log P ( v ) ] {\displaystyle \arg \max _{W}\mathbb {E} \left}

The algorithm most often used to train RBMs, that is, to optimize the weight vector W {\displaystyle W} , is the contrastive divergence (CD) algorithm due to Hinton, originally developed to train PoE (product of experts) models. The algorithm performs Gibbs sampling and is used inside a gradient descent procedure (similar to the way backpropagation is used inside such a procedure when training feedforward neural nets) to compute weight update.

The basic, single-step contrastive divergence (CD-1) procedure for a single sample can be summarized as follows:

  1. Take a training sample v, compute the probabilities of the hidden units and sample a hidden activation vector h from this probability distribution.
  2. Compute the outer product of v and h and call this the positive gradient.
  3. From h, sample a reconstruction v' of the visible units, then resample the hidden activations h' from this. (Gibbs sampling step)
  4. Compute the outer product of v' and h' and call this the negative gradient.
  5. Let the weight update to w i , j {\displaystyle w_{i,j}} be the positive gradient minus the negative gradient, times some learning rate: Δ w i , j = ϵ ( v h T v h T ) {\displaystyle \Delta w_{i,j}=\epsilon (vh^{\mathsf {T}}-v'h'^{\mathsf {T}})} .

The update rule for the biases a, b is defined analogously.

A Practical Guide to Training RBMs written by Hinton can be found in his homepage.

See also

References

  1. Smolensky, Paul (1986). "Chapter 6: Information Processing in Dynamical Systems: Foundations of Harmony Theory" (PDF). In Rumelhart, David E.; McLelland, James L. (eds.). Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations. MIT Press. pp. 194–281. ISBN 0-262-68053-X.
  2. Hinton, G. E.; Salakhutdinov, R. R. (2006). "Reducing the Dimensionality of Data with Neural Networks" (PDF). Science. 313 (5786): 504–507. doi:10.1126/science.1127647. PMID 16873662.
  3. Larochelle, H.; Bengio, Y. (2008). Classification using discriminative restricted Boltzmann machines (PDF). Proceedings of the 25th international conference on Machine learning - ICML '08. p. 536. doi:10.1145/1390156.1390224. ISBN 9781605582054.
  4. ^ Salakhutdinov, R.; Mnih, A.; Hinton, G. (2007). Restricted Boltzmann machines for collaborative filtering. Proceedings of the 24th international conference on Machine learning - ICML '07. p. 791. doi:10.1145/1273496.1273596. ISBN 9781595937933.
  5. Coates, Adam; Lee, Honglak; Ng, Andrew Y. (2011). An analysis of single-layer networks in unsupervised feature learning (PDF). International Conference on Artificial Intelligence and Statistics (AISTATS).
  6. ^ Ruslan Salakhutdinov and Geoffrey Hinton (2010). Replicated softmax: an undirected topic model. Neural Information Processing Systems 23.
  7. ^ Miguel Á. Carreira-Perpiñán and Geoffrey Hinton (2005). On contrastive divergence learning. Artificial Intelligence and Statistics.
  8. Hinton, G. (2009). "Deep belief networks". Scholarpedia. 4 (5): 5947. doi:10.4249/scholarpedia.5947.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  9. ^ Geoffrey Hinton (2010). A Practical Guide to Training Restricted Boltzmann Machines. UTML TR 2010–003, University of Toronto.
  10. ^ Sutskever, Ilya; Tieleman, Tijmen (2010). "On the convergence properties of contrastive divergence" (PDF). Proc. 13th Int'l Conf. on AI and Statistics (AISTATS).
  11. ^ Asja Fischer and Christian Igel. Training Restricted Boltzmann Machines: An Introduction. Pattern Recognition 47, pp. 25-39, 2014
  12. María Angélica Cueto; Jason Morton; Bernd Sturmfels (2010). "Geometry of the restricted Boltzmann machine" (PDF). Algebraic Methods in Statistics and Probability. 516. American Mathematical Society. arXiv:0908.4425.
  13. Geoffrey Hinton (1999). Products of Experts. ICANN 1999.
  14. Hinton, G. E. (2002). "Training Products of Experts by Minimizing Contrastive Divergence" (PDF). Neural Computation. 14 (8): 1771–1800. doi:10.1162/089976602760128018. PMID 12180402.

External links

Categories:
Restricted Boltzmann machine: Difference between revisions Add topic