Abstract illustration for unsupervised learning chapter

 

Unsupervised Learning

14.0 Overview

The rise of the internet has caused a massive amount of data to be put online and it is growing exponentially. Social network posts, online product reviews, and news articles provide large amounts of text and the Internet of Things is providing data from internet-connected cameras and other sensors. Governmental entities are making more and more data available over the internet. Even medical data is now making its way to the cloud in the form of imaging, gene expression data, and even doctor’s notes.

To turn all this raw data into meaningful information is a considerable challenge. Supervised learning can handle some big problems like facial recognition and machine translation, but it cannot make sense of most of this raw data. We can create vast numbers of input columns from this giant data pool, but most of the data does not contain any labels that we can turn into output columns (i.e., usable information). While it is possible to take a table of input columns and manually create the output columns, this kind of effort is usually prohibitively expensive for large tables. A different set of algorithms is required to take all this raw, unlabeled data and make it usable.

Unsupervised learning derives insight from training datasets with no output columns (i.e., no labels). For example, unsupervised learning algorithms analyze the data to spot the types of credit card fraud that supervised learning does not detect. Similar algorithms produce product recommendations from e-commerce vendors like Amazon and Netflix. Unsupervised learning also can be used to create fake videos of political candidates (so-called deepfakes) and fraudulent social media posts.

This chapter will first explain cluster analysis, dimensionality reduction, and time series analysis, which are older but still widely used types of unsupervised learning techniques. Two applications of these techniques will also be discussed in this chapter: anomaly detection and recommender systems

The chapter will conclude with a discussion of unsupervised neural network techniques which now dominate AI research.  This section will start with generative modeling techniques including autoencoders and generative adversarial networks and continue on to the self-supervised learning techniques that form the basis of many of today’s most impressive and important AI systems.

14.1 Cluster analysis

One technique for analyzing data using unsupervised learning is cluster analysis.  Clustering refers to a set statistical techniques that have been available for over 100 years.  In clustering, the goal is to analyze the data and find groups of observations that are “similar”.  For example, an e-commerce vendor might analyze their customer data to identify groups of customers that are similar to one another.  Then they can target those customers with tailor-made marketing campaigns. 

Many years ago, a well-known consumer brands company used clustering to analyze the characteristics of all the breakfast cereals on the market to try and identify where there might be holes in the market that create an opportunity to develop a new breakfast cereal. 

Clustering techniques start with a set of data that has input variables but no output variable and the job is to find similarities in the data that define groups named clusters and place each observation (e.g. each existing breakfast cereal) in a group (or at least compute the proximity to each cluster).

One specialized type of clustering, market basket analysis, attempts to identify associations between the various items that have been chosen by a particular shopper and placed in their market basket, be it real or virtual, and assigns support and confidence measures for comparison.  A retailer can make use of this information for shopping cart recommendations, bundle pricing, shelf placement in a physical store, and post-sale marketing.

Association analysis is a generalization of market basket analysis.  The input to an association analysis is transaction data with each record showing which categories were present.  For example, for market basket analysis, input is one record per transaction with an indication (e.g. 1 or 0) of whether or not each product was purchased.  The output of an association analysis is

  • A confidence value that is essentially the likelihood of product A being purchased if product B is purchased. This measure by itself is often not useful because it doesn’t dis
  • A lift value that provide a measure of the interestingness of an association. Grocery store shoppers might buy milk most of the time they buy butter but this isn’t a very interesting association because they buy milk on most shopping trips and the same for butter. 
  • A conviction value that is a measure of the reliability of the Lift value.

Cluster analysis is used in organizations to forecast sales and discounts, analyze goods bought together, place products on store shelves, analyze web surfing patterns, and has many other use cases.

14.1.1  A cluster analysis example

Here is an example of how it works. In biology, researchers collect data on plants and animals that have yet to be categorized. They take the different observed features (e.g., size, number of eyes, etc.) and create a table that has one row for each observed animal (or plant), and one input column for each feature as shown below:

Clustering example data for unsupervised learning tutorial

Notice that there are no output columns!  None of the columns in this table has the category of each animal as we had in the supervised learning examples in previous chapters.

In supervised learning, the training dataset that is input to the supervised learning algorithm has the correct answers in an output column, and these correct answers guide (or supervise) the process that computes the optimal function. In unsupervised learning, there are no output columns, and the goal is to make sense of the data in the training dataset without having any supervision.

In our biology example, the purpose of the cluster analysis is to use the input data to figure out how to group the observed animals should with known species, and to figure out when to classify a group as a new species or sub-species.

If you take the unsupervised learning data from the table above and plot it in three dimensions (one dimension for each column), the result will be a graph like the one below:

Cluster analysis visualization for unsupervised learning tutorial

The data points (i.e., dots) in Cluster 1 are ones we would find in humans. The data points in Cluster 2 are horses, Cluster 3 is cats, and Cluster 4 is spiders.

[ Note: I have oversimplified here in order to make the example more understandable. Real-world biologists would be analyzing data on rare animals not on well-known ones like humans, horses, cats, and spiders. Also, they would likely use technical features with Latin names rather than easily understandable features like weight and lifespan. Now, suppose our data contained observations of animal types other than humans, horses, cats, and spiders.]

For example, suppose the data table also included dogs, foxes, and squirrels. If we plotted dog, fox, and squirrel observations, their traits would overlap with the cat cluster, and we would not be able to classify any of them.

To make the cat, dog, squirrel, and fox clusters separate, we would need to add more variables such as color, tail size, and the type of noise they emit. To plot the observations with these additional three variables would require three more dimensions for a total of six dimensions. The three-dimensional plot is complex, yet relatively easy to visualize compared to six dimensions, which we cannot imagine in any understandable fashion.

Worse, to include even a small fraction of the many animal types, we might need 50 or 100 dimensions or more. In a 100-dimensional space, even though we cannot visualize it, the clusters of observations for cats, dogs, foxes, and squirrels would be distinct and separate, as are the four clusters in our three-dimensional example.

If we had unlabeled observations of eight animal categories (e.g. humans, horses, spiders, cats, dogs, foxes, and squirrels), we would want to calculate that there are eight clusters in the data and would determine how to place each observation the correct cluster.

14.1.2  How cluster analysis works

Cluster analysis is a set of mathematical algorithms that can group the observations in a dataset into separate and distinct clusters in a high-dimensional space. The way these algorithms work is to find the set of clusters that both maximize the distance between the clusters in the 100-dimensional space and that minimize the distance between the observations within a cluster. Also, the mixing of numeric variables such as weight and lifespan with categorical variables such as color complicates the mathematics of cluster analysis.

However, mathematical techniques have been worked out over the years to accommodate mixed numeric and categorical variables. We can use cluster analysis algorithms to determine the number of clusters and to assign observations (rows) to the clusters for almost any training dataset.  Popular clustering algorithms include k-means clustering, affinity propagation, mean-shift, spectral clustering, hierarchical clustering, DBSCAN, OPTICS, BIRCH, and Gaussian mixture models.

People in many fields other than biology use cluster analysis. In marketing, one can use it to identify homogenous groups of customers who have similar needs and attitudes. Marketers then target the different groups with different campaigns. Golfers may see ads featuring Tiger Woods, where nature lovers might see ads featuring mountains. In medicine, researchers can apply cluster analysis to surveys of patient symptoms to identify groups of patients. They can then label some clusters as new diagnostic or disease categories. Insurance companies use cluster analysis to determine which types of customers are making which types of claims and which customers would be responsive to the marketing of its various insurance products. Geologists use cluster analysis to identify earthquake-prone regions. In epidemiology, researchers use it to find areas or neighborhoods with similar epidemiological profiles.

Statisticians created the first cluster analysis algorithms in the 1930s and had to implement them by hand. As a result, they could only find clusters in low dimensional spaces and for relatively small numbers of observations. Over the years, statisticians and, more recently, computer scientists have developed improved cluster analysis algorithms to the point where they can handle unimaginably high-dimensional problems and massive training datasets.


14.1.3  Anomaly detection

One application of cluster analysis is anomaly detection.  Anomaly detection has a wide variety of commercial uses, including fraud detection, intrusion detection in computer networks, and fault detection.

To better understand how this works, let’s revisit credit card fraud detection. Credit card issuers apply supervised learning classifiers to massive databases of past transactions to distinguish valid from fraudulent transactions based on past patterns of fraudulent activity.

However, supervised learning has difficulty identifying new patterns of fraudulent activity. Enter unsupervised learning. Unsupervised learning is the tool credit card companies use to find new patterns of fraudulent activity. It does so by finding the patterns of valid activity and then flagging any activities that do not fit any of the historical patterns as possible fraud.

Of course, these determinations are rarely definitive. If I typically buy my gas at the same gas station but find it is not open one day and go across the street, that would not be my usual pattern, yet it is not fraud, so I find it quite annoying when issuers decline such transactions.

With unsupervised learning, issuers can use the multi-dimensional clusters in a credit card transaction database to represent patterns of valid user transactions. When the credit card systems encounter new transactions (which happens millions of times a day), they can match most legitimate new transactions to an existing valid transaction cluster. However, if a transaction does not fit neatly into a cluster, the systems can flag them as potentially fraudulent. The system can flag groups of transactions as well as individual transactions.

Anomaly detection can supplement supervised learning and improve fraud detection. The idea is to find normal patterns in data so that we can identify abnormal patterns. In a cybersecurity context, IT teams use anomaly detection to uncover abnormal patterns of computer network traffic and server/database access caused by hackers. Hospitals use anomaly detection to identify life-threatening electrocardiogram patterns and abnormal CT scans for hospital patients. Workers in other areas use anomaly detection to identify insurance and accounting fraud, to predict weather patterns, and for many other application areas.

Anomaly detection methods compare new data points to past data points and look for ones that are very high or very low compared to the ones prior. One also can use the data history to develop supervised prediction functions. AI systems can learn these functions by taking each historical data point and trying to predict its next value using the data points immediately prior. Then when new data points arrive that deviate significantly from the predictions, they are treated as possible ano

There are numerous anomaly detection algorithms.  Some popular ones include COPOD, local outlier factor, isolation forest, elliptic envelope, one-class SVM, and autoencoders.

14.2  Dimensionality reduction

In the last section, we touched on the difficulty of visualizing and more generally, analyzing, high-dimensional data.  This section discusses older, but still widely used, techniques for reducing that dimensionality and making the data easier to analyze.

Principal components analysis (PCA) was invented in 1901 by Karl Pearson.  This technique finds a small number of dimensions (smaller than the number of input variables) by grouping the observations and then finding functions that segment the observation space according to those groupings. The first principal component is the function that defines a line that minimizes the least squares fit in the high-dimensionality space to all the observations. The function will include all the variables. The second principal component is the function that defines a line that minimizes the least squares fit to all the observations BUT is uncorrelated with the first principal component. And so on.

Often, most of the variability can be accounted for with 1, 2, or 3 principal components. Then these principal components are used in a regression or classification function to fit the observed values and predict the test set and real-world values. The primary disadvantage of this approach is that the resulting dimensions tend to be uninterpretable as they are different combinations of the input variables.

Factor analysis was invented in the 1880s.  It is similar to but still different from PCA.  Factor analysis attempts to find a smaller number of unobserved variables (also known as latent variables) that can explain the variability in the observed variables.  For example, researchers (Cattell and Saunders, 1954) asked subjects to listen to 120 pieces of music and rate each piece on a scale of 0 to 12.  Using factor analysis, they were able to determine that there were 11 latent variables that accounted for most of the variability in the subjects ratings of the music.

Singular value decomposition (SVD) is an improvement on factor analysis.  SVD performs a similar function to factor analysis but can handle more complex relationships.  More recently, neural network-based computation techniques using gradient descent have been used to find the lower-dimensional factors.


14.3 Time series analysis

Time series analysis is used to make predictions about observations that occur over time, such as stock prices, retail sales, call center traffic, and weather patterns.  A time series is usually defined with respect to a frequency, e.g. hourly, daily, weekly, monthly, or yearly.

Holt-Winters (Holt, 1960) is a classic time series analysis technique. This technique, like most time series analysis techniques that followed, split the analysis into three components, the raw data, the trend, and seasonality data.  It’s possible for some time series data to have only a trend (and no seasonality) or only seasonality (and no trend).  Holt-Winters uses a statistical technique called smoothing to forecast future values. It does this by computing a trend function that is weighted by seasonality data.

ARIMA (Box and Jenkins, 1970) is a slightly more recent time series analysis technique.  ARIMA uses statistical techniques to make the time series data stationary, i.e. it compensates for any changes over time to the underlying data distribution.

ARIMA, and to a lesser extent Holt-Winters, dominated time series forecasting for decades.  However, deep learning methods are now providing a challenge.  Meta’s open source library Prophet (Letham and Taylor, 2017) significantly outperforms ARIMA when seasonality has a big influence. 

Uber uses deep learning time series forecasting for predicting demand, especially during holidays and other high-demand periods (Laptev et al, 2017) because tools like ARIMA didn’t perform adequately.  They have open-sourced their time series library M3.  Amazon’s GluonTS (Gasthaus and Januschowski, 2019) and Google’s use of automation machine learning for time series forecasting (Liang and Li, 2020) are other examples of powerful deep learning-based time series forecasting tools. 

Transformer networks would seem to provide a natural architecture for time series analysis by using the positional encoding of the self-attention mechanism to capture temporal data.  While transformers have had some success in time series forecasting, they also have a tendency to lose temporal information (Zeng et al, 2022).

Time series analysis can be done solely on the basis of the raw time series data (i.e. univariate forecasting) or can take advantage of external data such as weather information (i.e. multivariate forecasting).

Time series analysis can be used for forecasting future values and for detection of anomalies.

14.4  Recommender systems

Recommender systems generate everyday e-commerce suggestions such as Amazon book and shopping recommendations, music recommendations on services like Pandora and Spotify, video recommendations on YouTube and TikTok, TV and movie recommendations on Netflix, job search recommendations on LinkedIn, news feed recommendations on Facebook, and advertisements on Google.

In 2015, Netflix’s two top product officers wrote an article describing the Netflix recommender system (Gomez-Uribe and Hunt, 2015), which it believes keeps subscribers from canceling. Their research shows that, on average, consumers will abandon a search for a movie or show to watch if they have not found one in 90 seconds. If consumers cannot find shows they want to watch, they are more likely to cancel their subscription, and that is why it is so crucial for Netflix to provide appealing recommendations for shows.

Netflix believes their recommendation system saves them one billion dollars per year that they would otherwise lose to customer churn.

The consulting firm McKinsey & Company reports that recommender systems are responsible for 35% of Amazon purchases and 75% of what people watch on Netflix (MacKenzie et al, 2013).

Recommender systems start with a table of rows of users and columns of items and cells that contain some type of interaction (e.g., a rating, a click, a like). Here is an example of a table of movie ratings:

User-item table for unsupervised learning tutorial

For example, hypothetical User 3 likes Casablanca and hates Pulp Fiction. I loved both movies, but you cannot see my ratings, because I am in the table somewhere between User 3 and User 17,632,592.

The items in the table could be movies, TV shows, or shows in a category, such as comedy or drama. They could also be books, e-commerce products, music, or almost anything else that people purchase, rate, listen to, or view. In general, the table will have far more empty cells than filled cells. For movie ratings, that is because few users will have seen more than a tiny percentage of the three million movies in the IMDB database.

The goal for a recommender system is to be able to predict user ratings for all the empty cells in the table (e.g., for all the movies each user has not rated). Recommender systems use two general types of techniques: collaborative filtering and content-based techniques. Many hybrids of the two also are in use.

14.4.1 Collaborative filtering

There are two types of collaborative filtering, user-user and item-item collaborative filtering.

14.4.1.1 User-user collaborative filtering

Suppose you have a friend named Sarah. You and Sarah have seen many of the same movies, and your ratings are always in agreement. If you see a new video you like, there is a good chance that Sarah will like the film also.

Finding users with similar tastes is the basic idea behind user-user collaborative filtering.

If you need to recommend movies to Wally Moviebuff, you can find the users that have previously rated movies similarly to Wally. You can then recommend movies these users liked and that Wally has yet to see.

Most users give high ratings to some films and low ratings to others. Some users love Casablanca and hate Pulp Fiction. Some love Pulp Fiction and hate Casablanca. Some love both. There might even be people who hate both. The goal is to find groups of users that rate the same movies high or low.

For example, some users love horror movies, and others hate them. Some love foreign films, and others hate subtitles. However, if you look at the full set of movies, there will be groups of users that love (and hate and everything in between) all the same films. A user-user collaborative filtering system will find users with similar tastes and recommend movies that they liked, and the target user has yet to see.

The notion of similar users has a mathematical definition that is beyond the scope of this book (see Aggarwal, 2016). Spotify’s Discover Weekly playlists are an example of user-user collaborative filtering.

14.4.1.2 Item-Item Collaborative Filtering

Another collaborative filtering technique is item-item collaborative filtering. Item-item collaborative filtering looks at the target user’s ratings of similar movies (as opposed to finding similar users).

Movie similarity can be determined by finding movies that have a similar pattern of ratings among all users. This type of similarity is also defined mathematically.

While both user-user and item-item approaches can be helpful, they have several downsides:

(1) They do not support new users or new movies. New movies have no ratings, and new users have not created any ratings. This lack of ratings is called the cold start problem.

(2) Systems like these tend to favor more popular items because they have more ratings to analyze, so it becomes harder for newer items to be ‘found’ and thus become popular. As a result, the system may show a given user only a narrow range of recommendation categories, even though the user might prefer to explore new ones.

(3) For large user-item tables, these computations can become too computationally expensive for them to be made available instantaneously to the user sitting on a webpage.

If we consider each column in the user-item table a dimension, in the case of IMDB movies, there will be three million dimensions. I mentioned earlier that it is impossible to visualize six dimensions. Millions of dimensions not only boggle the mind, but they slow collaborative filtering algorithms to a crawl. This difficulty in processing datasets with a large number of dimensions is an example of the curse of dimensionality.

When the user-item table is too large, data scientists often turn to dimensionality reduction techniques to effectively reduce the number of user dimensions, the number of item dimensions, or both. Cluster analysis (discussed above) is one technique that data scientists use to reduce the number of user dimensions. Cluster analysis sorts millions of users into a much smaller number of clusters. Instead of finding the most similar users among millions, the collaborative filtering algorithm only needs to find the most similar cluster to the user.

Another way to reduce the dimensionality is using PCA, SVD, and neural networks. 

Netflix started using dimensionality reduction algorithms as a result of a contest it ran in 2006. Netflix had a team of data scientists who had been able to continually improve the algorithm for predicting customer ratings of movies for years but had run into a wall in terms of further improvement. Since even small improvements in the recommendation system had a profound effect on customer retention, Netflix looked for out-of-the-box ideas.

In 2006, Netflix came up with a contest with a one-million-dollar prize to anyone who could produce a recommendation system that was ten percent more accurate than the then-existing Netflix Cinematch system. Netflix provided contestants with a table of 100 million ratings by 480,000 customers on 17,000 movies. They held back three million ratings that they used as the test. Contestants submitted the three million predicted ratings from their systems to Netflix, and Netflix told them their score.

In 2009, a team hit the ten percent goal, and the prize awarded. The winning team turned over its code to Netflix, who started incorporating that team’s specific dimensionality reduction technique in their algorithms. Interestingly, Netflix was unable to make use of the full winning system.

In 2006, all Netflix’s business was shipping DVD’s to customers. Under that model, predicted customer ratings drove recommendations. However, in 2007, Netflix began offering streaming TV and movies. This enabled Netflix to observe user behavior much more closely. Staff could see when users started and stopped and/or abandoned shows and where on the homepage a show was discovered. Simply predicting customer ratings was no longer sufficient to optimize recommendations although algorithms like the ones described above are still used.

In fact, two of the SVD-based collaborative filtering algorithms from Netflix Prize participants were adopted by Netflix and put to use on specific parts of the homepage. For a more in-depth description of the Netflix recommendation system see this article by the former product manager in charge of Netflix’s recommendation system.

The downside of dimensionality reduction techniques is that the resulting recommendations can be a bit coarse. For example, the customers in the most similar cluster are not necessarily the most similar. Dimensionality reduction of items often will eliminate low-frequency items, and they will never get recommended. Also, some users give generous ratings and some give stingy ratings. These issues all need to be taken into account when designing the algorithm.

14.4.2 Content-based techniques

Instead of breaking down the users into demographic features, content-based techniques break down the items. For example, Pandora hires musicians to rate every song based on 400 attributes. Then when a user likes a song, Pandora will find other songs with similar attributes and add them to the user’s playlist.

Another set of methods helps find items that people will like based on the words used to describe the item or in the synopsis. If a user likes a movie described as a “suspenseful thriller,” then we would want to find other films that use similar descriptors, so movies that have descriptions with more terms in common are labeled like one another. In computing similarity, these techniques typically weigh less common terms (e.g., “mountain-climbing”) higher than common terms (e.g., “movie”). The system will then use the similarity ratings to recommend movies that are similar to the ones the user previously rated highly.

One advantage of content-based approaches is that there is no cold start problem for items because the item attributes do not depend on historical data. One downside is that the system cannot personalize the results because they are an aggregate of all user preferences.

14.4.3   Neural network techniques

More recently, neural network embeddings have been used to improve on the techniques described above.  Amazon researchers (Hao et al, 2020) used an embedding technique named product2vec to represent the customer-item matrix.  This technique led to a 7% increase in purchases of recommended products. It also helped with the cold start problem because new items could still be placed in the embedding space which made recommendations possible before any customers purchased that product.

Neural network techniques are in use today at companies including Netflix (Das, 2019; Dye et al, 2020), Spotify (McInerny et al, 2018), Meta (Mudigere et al, 2022), Google (Cheng et al, 2016; Wortz and Totten, 2023), eBay (Wang et al, 2021), Uber (Ling et al, 2023), Meta (2023), and Amazon (Ma et al, 2020).  An overview of the most common type of implementation, known as the two-tower model, can be found here.

Generative AI techniques are now also being explored (Hou et al, 2023).

14.5  Autoencoders

An autoencoder is a network trained to learn to convert its inputs to a much lower-dimensional representation than the number of input variables and then to use that representation to re-generate the inputs. A common task for an autoencoder is to learn an internal representation from a large set of images of a person or object. Then it can take as input a new image of that person or object and reproduce the image from the internal representation.  The high-level architecture of an autoencoder that learns to reproduce images is shown below:

Basic autoencoder for unsupervised learning tutorial

Why do this? We do not need AI to be able to reproduce an image. We can do that with copiers. The goal of an autoencoder is to create a compact internal representation of the input image that has fewer dimensions than the input.

More specifically, there will typically be many fewer neurons in the encoder output layer than in the encoder input layer. This encoder output layer contains a compact internal representation of the input image.

If the decoder can take this compact internal representation as input and still reproduce the images, then it is not merely memorizing the values of the pixels in the input images. The compact internal representation must capture the essential features of the input image to reproduce the image. In other words, the learned weights and neurons of the encoder output layer must capture details about what makes up a face—the distance between the eyes, the angularity of the nose, and so on.

14.5.1  How autoencoders work

Geoffrey Hinton and a colleague first defined the idea of autoencoders in a seminal paper in 2006 (e.g. Hinton and Salakhutdinov, 2006). They trained an autoencoder on 28×28 images of digits (i.e. 28 x 28 = 784 input variables). In this architecture, there are 11 hidden layers. The first layer learns a representation that encodes the image information in the 784 neurons into a representation that contains only 400 neurons. Each layer successively encodes the information into a smaller number of neurons until the information is encoded into a layer of only 6 neurons. That is the encoder portion of the system.

The decoder portion first learns to decode 6 neurons to 25 and so on up the 784 neuron representation of the output image. One of the most important applications of autoencoders is using the weights of the trained encoder layer as the initial weights of the first hidden layer of a deep neural network. This produces both better performance and faster training than starting with a network with randomly initialized weights (Bengio et al, 2007).

For example, Google researchers (Dai and Le, 2015) created an autoencoder whose input was sequences of words and that learned to predict those same sequences. They then used the hidden layer as the initial layer of models that were trained on various text classification tasks using datasets such as IMDB, DBpedia, and 20 Newsgroups. After training, they found that the models with the pre-trained layer were able to be trained faster and had better performance.

Autoencoders are conceptually similar to dimensionality reduction techniques like PCA and SVD; however, autoencoder representations are non-linear.

14.5.2  Creating deepfakes using autoencoders

Deepfakes are fake images or videos of people. Early on, deepfakes mostly showed up as images and videos of celebrities pasted into porn videos. More insidious is the prospect of fake videos portraying political figures making false or provocative statements.

You can imagine the fallout of a fake video of a U.S. President declaring nuclear war on another nation with nuclear capabilities. Deepfakes can be created with autoencoders like the one illustrated above. In that architecture, the inputs consisted of many different images of this person. The training dataset included images of her smiling, speaking, in different poses, and in different lighting. The encoder learned to produce a compact internal representation that was sufficient for the decoder to reproduce what she looked like in the input image, her facial expression, her pose, and the lighting.

In the autoencoder illustrated below, we have used an autoencoder to reconstruct images of two people:

Two person autoencoder for unsupervised learning tutorial

The encoder learns a compact internal representation that captures the key features of both individuals. However, it learns a different decoder for each person. The two decoders then will learn to take the common internal representation of these facial expressions and reconstitute them for each person’s image. Once this network completes its learning phase, producing a deepfake is easy: We just switch the decoder as illustrated below.

Deepfake architecture for unsupervised learning tutorial

When a new image of the woman is input to the network, the output will be an image of the man but with the facial expression, pose, and lighting that was in the woman’s input image. The next step is to make this substitution for each frame in an entire video. The result will be a video with frames of the man wearing the same facial expression as the woman at each point in the video.

If the woman is talking in the video, the fake video of the man will appear to be saying the same words (but still with the woman’s voice).

You can see a video of Jennifer Lawrence answering questions at the Golden Globe Awards with Steve Buscemi’s face created using an open source application named FaceSwap. The deepfake architecture discussed above is based on the architecture described in the Faceswap guide.

There are similar open source tools, such as FakeApp and DeepFaceLab. Research has shown it is possible to create a deepfake image using only eight training images of the target person plus one image of another person wearing the desired facial expression (Zakharov et al, 2019).

Deepfakes can also be developed using general adversarial network technology.

Fortunately, MIT researchers (e.g. Chai et al, 2020) and others are developing AI-based methods of detecting deepfakes.

14.6  Generative models

The next set of techniques to be discussed are generative models.  Generative models create data similar but different than the data on which they are trained.  If a generative model is trained on images, it can create novel images, i.e. images that have been seen before.

So why is this useful? Ian Goodfellow (2017) from OpenAI offers several important use cases for generative models. They…

…can be used for planning and imagining possible futures and acting accordingly.

…are important for tasks where there may be more than one correct answer.

…can be used to create larger sets of input observations for a supervised learning task.

…can be used to synthesize a high-resolution image from a low-resolution image.

…can be used for image-to-image translation such as transforming a satellite image into a map.

…can be used to create art and have other creative applications.

14.6.1 Variational autoencoders

The autoencoders discussed above create a low-dimensional representation of an image that can then be used to reconstruct the image.  However, they cannot be used to generate images that were not part of the training set.  Variation autoencoders (Kingma and Welling, 2014) learn a representation of the distribution of the input observations. Because it represents the whole distribution, it can be used to generate not only the originals but also variants of the originals.

Instead of representing each image as a point in a latent space, variational autoencoders are trained to compute a distribution of the points in the latent space.  Then, the variational autoencoder can use this probabilistic representation to generate novel images.

For example, a variational autoencoder trained on a set of images of dogs can slightly vary the learned distribution to produce realistic-looking images of dogs that weren’t in the training set.

The ability to produce realistic novel images means that the representation embodies the essential features of the domain of the training images. It is possible that the variational autoencoder has learned features like edges and corners, features like “the ears are generally found above the eyes”, or it might be just more surface-level information about which pixel colors are more likely to be present next to other pixel colors.

Images generated by variational autoencoders tend to have some interesting differences from the originals. However, they also tend to be blurry.

Variational autoencoders can also be used for natural language. A team of Google Brain researchers (Bowman et al, 2016) used an LSTM encoder and an LSTM decoder to create a variational autoencoder for sentences. The output of the encoder was a holistic representation of a sentence that presumably included holistic sentence properties such as style, topic, and syntax. They were able to use this representation to generate coherent novel sentences that were not part of the training set. However, variational autoencoders cannot string together multiple sentences to form a cohesive story.

A team of Google DeepMind and University of Oxford researchers (Miao et al, 2016) developed a variational autoencoder for a question-answering task. The system treated the question and answer inputs as bags-of-words and the system learned representations of both the question and answer. They then used a mathematical measure of distance between the representation of the question and the representations of multiple choice answers to determine the correct answer.

14.6.2 Generative adversarial networks

The image below contains a diagram of a generative adversarial network (e.g. Goodfellow et al, 2014; Radford et al, 2015):

Generative adversarial network for unsupervised learning tutorial

It is composed of two sub-networks: a generator and a discriminator. The generator initially creates random images. The discriminator takes as input a data sample and a generated image and tries to determine which is the real sample and which is generated. The two networks are trained simultaneously so that the generator learns to produce more and more realistic images and the discriminator gets better at discriminating.

The generator is typically a deconvolutional network and the discriminator is a convolutional network (CNN). A deconvolutional network is essentially a CNN in reverse. Where a CNN compresses an input data set from perhaps billions of pixels to maybe 100 million weights, the deconvolutional network essentially decompresses the weights into pictures.

General adversarial networks can generate images that were not present in the training set and the images generated are more realistic than those generated by variational autoencoders. They have been in the news because they are used to generate deepfakes like the ones on the aptly-named website www.thispersondoesnotexist.com as well as pictures and videos with one person’s head on another person’s body.

General adversarial networks can also be used to create art. An AI-produced painting recently sold for $432,500. Google’s Magenta Project is an open-source tool for exploring artistic endeavors through general adversarial networks.

14.6.3  Diffusion models

As discussed in an earlier chapter, foundation models, and in particular, diffusion models, now generate the best images.

   

© aiperspectives.com, 2020. Unauthorized use and/or duplication of this material without express and written permission from this site’s owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given with appropriate and specific direction to the original content.