If you are familiar with artificial neural networks, this is going to be interesting. Neural networks are optimized by loss minimization algorithms such as gradient descent and SGD. The one I've used mostly and I think most people like to use by default is the Adam Optimizer. The way these algorithms function is that they attempt to minimize loss which is usually a function. Basically, the network looks at the predicted output of a layer, and the actual target and tries to make the prediction more accurate, by minimizing the distance between them, which is loss. Nevertheless, when it comes to specific uses such as image similarity detection, there's often the implementation of a function called "Triplet Loss".

Triplet Loss was first proposed in the paper "OASIS, 2009" by Chechik, et al. a Google research team which later became the ground upon which "FaceNet" was built. FaceNet is a deep convolutional network proposed by a Google research team which maps facial images to a 128-dimensional Euclidean space in a way that a unique person's different images are mapped very close to each other while images of different people are mapped far from each other. Such networks that map inputs to a different space cannot be trained by the regular loss functions that require both the input and output as parameters for learning. For instance, in the case of FaceNet, the mapping is the output layer of the neural network, and it consists of 128 neurons. Our data is just images of people's faces, and information such as their names or unique IDs (something that diversifies the image files of different people). As you may very well notice, we don't have any 128-dimensional vectors available to train this network. That is where triplet loss comes in.

Triplet loss is based on a triplet selection of records which can be done using a few different methods. The idea is to select triplets of records, where in each triplet, two records of the same class and one record of another exist. In the aforementioned case, each triplet would have two images of the same person and one of another. In this selection, one of the two that belong to the same class is set as the anchor, the other is set as the positive, and the one record is set as the negative of the anchor. Then, instead of processing records one-by-one, triplets are processed, where the optimization is set to minimize the distance between the anchor and the anchor positive while maximizing the distance between the anchor and the anchor negative.

The triplet selection process is also an essential part of the algorithm. In most cases, the selection algorithm attempts to find a positive for each given anchor that is farthest from it and a negative that is closest to it.

This is essentially mapping image data to a Euclidean space very suitable for other learning methods, specifically clustering methods. If you think about it, this is can be thought of as the very beginning for supervised data transformation that would further improve unsupervised learning algorithms. That is why, if you attempt to build and train a network such as FaceNet, or find a trained model online, it can easily map any facial image dataset to a 128-dimensional space which can be easily clustered with great precision using even the weakest of algorithms. The images it maps are usually ones it hasn't even been trained on yet. Nevertheless, its great mappings yield that it has a great perception of the human face, much like human beings themselves!

I came across FaceNet in the beginning of this summer and traced its idea of triplet loss back to OASIS and found it amazingly fascinating. After a while, I thought I could share this with others, which is why I wrote this post.