The CERN data center collects more than 30 petabytes of data per year from Large Hadron Collider (LHC) experiments. It stores and shares data with institutions around the globe. This is precisely how LHC got a big data storage boost in 2002. Few people at that time could have predicted the Big Data boom in the future.
Today, we live in the times of prosperity of Big Data. The fourth industrial revolution was the main topic of discussion at the World Economic Forum at Davos this year. It focused on the digital revolution and how technologies merge the physical, digital, and biological worlds. At the core of the fourth industrial revolution lies the concept of billions of people connected together by mobile devices, such as cell phones, with large processing power and storage capacity. While we experience great progress in artificial intelligence (AI), robotics, the Internet of Things, drones, autonomous vehicles and 3-D printing, everyone has become overwhelmed by data.
For example, in 2015 Facebook users sent on average 31.25 million messages and viewed 2.77 million videos per minute. And within five years, there will be over 50 billion connected smart devices in the world, all developed to collect, analyze and share data.
Data growth as digital information doubles every 18-24 months.
Every year, the amount of information around us grows so fast we need to create more sophisticated tools to analyze and structure it.
Deep learning approaches  comprise the most promising research area of 2016. Most of the scientists in the field believe that this was the year when we saw a lot of progress in unsupervised deep learning (Figure 3).
The promise of deep learning is to discover rich, hierarchical models that represent probability distributions over the kinds of data encountered in artificial intelligence applications, such as symbols in natural language corpora. Neural networks are already a must-have for image analysis – this year, we believe they will also be applied to NLP (natural language processing) as well.
Findo has taken part in this research and wants our team expertise to contribute to the creation models that will be able to generate text data after they understand it.
This “real understanding” is meant to become the main goal of creating artificial “human” cognition, as well as the goal of the whole Artificial Intelligence research. While still far ahead, this opportunity seems possible for different verticals of knowledge.
For Findo, this vertical is natural language processing in personal clouds of files, emails, notes, and contacts.
At the same moment, we see not only great success in deep learning algorithms, but also big interest in the field of personal assistant and bot products (Figure 3). Most of them are rule-based, but their existence indicates a strong demand for such solutions .
Findo works to solve the problem of creating a smart personal assistant that helps you search through your personal cloud. We focus on the analysis of text data: email, files, and notes distributed across devices, cloud storages, tags, and folders.
The problem is to create not only such an assistant, but also the unsupervised learning network that we will be able to train in different languages with less effort.
Studying the patterns of the ideal search example, we understand that the information should be found not by exact keywords, but by description. That is what we call a “smart search.”
We also try to create the system that will be able to recognize some patterns in personal data and organize it in some kinds of folders itself.
Our idea is to visually organize personal data like email and files and let our users experience not just the “search,” but what we call the “smart search” or the “knowledge-discovering” experience.
The strong analysis with a lot of different features is hardly achievable through any kinds of rule-based systems, the development of which is time-consuming, especially taking into account the different languages we need to scale to.
Deep statistical models that contain many layers of latent variables and millions of parameters can be learned efficiently, showing the learned high-level feature representations and links or hidden links between data. They are great in the dimension where the data itself changes quickly, allowing data links to grow dynamically: whether you meet a new person, get hired by a new company, or start a new project, strict rules will not give the possibility to effectively control the data.
Approach to solving the problem
Our main thought was to use the great results achieved in image analysis with deep statistical models and to apply them to text analysis. The main obstacle here is the sparsity of text data: the more discrete the data is, the more data you need to successfully train statistical models. We have overcome this obstacle using vector representations.
Producing language response to a natural language query is a key component of General Artificial Intelligence. Recent advances in generative variational text modeling, distributed vector space modeling of sentences and documents, and topic modeling make the problem more tractable. Our major goal is to create a statistical model of text data using an unsupervised and semi-supervised approach. Such a model must have two major components: a text reader and a text producer. The reader pretends to understand the text upon its input and converts it into a feature-rich artificial vector representation. This form of coding information contained in a text can later be used for various NLP tasks, such as unsupervised clustering, classification, and so on. Producer/generator models which have context and a query topic can match relevant user data, ask questions to refine and disambiguate a query if necessary, helping the user to get exactly what they want.
We plan to finally train the model to help us solve the following problems remotely:
This function will allow a user to search by description. The user can find the document using not only exact keywords, but also by descriptions if the system indicates the semantic space that corresponds to the exact search query and can interpret the results from this space as relevant.
Remember how difficult it is to take care of structuring your own data? Tags, folders, rules for moving items? The job of an assistant, as we call our model that understands the text really well,, is to solve the problem of auto-tagging and classifying the data for the user. Findo either learns new patterns with unsupervised methods or studies how to use the habits and structures a user already has to keep sorting the incoming feed with tags or folders.
Controllable chat bot
We’re usually not sure how to find the data we are looking for. The “holy grail” of the generative model with text laws and pattern understanding is the ability to create a bot that will really speak about the problem – but not just using the parts of the analyzed text with the highest probability, but actually with real “cognition.”
Findo will be able to ask a correct question during the conversation and generate appropriate texts. The whole industry is searching for a way to implement the freedom of a non-rule-based approach into the conversational bots, keeping the ability to communicate under control. For example, the chatbot will be able to ask a user what contact they have in mind when they search for the email, or if they remember what is in the photo when they search for images. Currently, neural variational networks are already used for supervised question answering.
Findo team conducts research on the model to understand the generative text process and the way to control it – to receive the texts we want for the topics we need.
We plan to choose the best model in terms of comparing the embeddings produced and the generative abilities.
Our experiments with natural text analysis start with checking different model approaches which were introduced lately:
- VAE model
- Ladder style model
- “Skip thoughts” vectors
- GAN style model
(Auto-Encoding Variational Bayes)
Variational autoencoders are interesting generative models, which combine ideas from deep learning with statistical inference. They can be used to learn a low dimensional representation Z of high dimensional data X such as images (of e.g. faces). In contrast to standard autoencoders, X and Z are random variables. They were first introduced in 2013.
When the VAE model is used, it is assumes that the data (x) is generated by pθ(x|z), where θ denotes the parameter of deep neural networks. Given the data x, we want to maximize the log-likelihood logp(x), and in order to solve this problem, we can use variational inference. The main principle of variational inference is to introduce an approximate distribution q(z|x) and maximize the lower bound instead of the log-likelihood.
In VAE, the approximate distribution q(z|x) can be parameterized by qφ(z|x), where φ stands for the parameter of deep neural networks.
VAE can generate samples from random values by using the decoder p(x|z); these samples do not actually exist in the original training data. VAE trains the probability distribution p(x|z) which is most likely to generate the original data; therefore, we can generate new samples which look like the original data.
In the fields where you can have access to a small amount of labeled and a large amount of unlabeled data, you can make great use of it with the semi-supervised approach.
The training process is seen as a game between two separate networks: a generator network (as seen above) and a second discriminative network that tries to classify samples as either coming from the true distribution p(x)p(x) or the model distribution ^p(x)p^(x). Every time the discriminator notices a difference between the two distributions, the generator adjusts its parameters slightly to make it go away, until at the end (in theory), the generator exactly reproduces the true data distribution, and the discriminator guesses at random, unable to find a difference.Our whole roadmap for the deep learning includes creating and testing the models, comparing their embeddings and generative abilities, and examining the structures of global and local latent spaces to use the best model to control the generative process with the discovered structure.
We have started the research and implementation with the VAE model . We have chosen to use recurrent neural networks which incorporate distributed latent space in the whole sentence to create sentences.
To eliminate the sensitivity to the sentence, we have introduced the symmetrical approach. We also use a discrimination information (or Kullback-Leibler divergence) to measure the distance between probability distributions. Our plan is to continue experimenting with the loss function and adding the smoothing to make the model less sensitive to thea number of words in a sentence.
The recurrent generation is to go from one sentence to the other, greedily decoding from points between two- sentence encodings with a conventional autoencoder through different sentences that are not plausible.
The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.
The multivariate normal distribution of a k-dimensional random vector x = [X1, X2, …, Xk] can be written in the following notation:
VAE for text is attractive for various reasons. First of all, it is an unsupervised style model we can train using large amount of unlabeled data. The encoder (or recognizer) part of the model qφ(z|x) maps documents into low dimensional vector representation as probabilistic distribution over latent space. Encoding a document as a probabilistic distribution model has good regularisation properties because it cannot encode a document into a single point ;it has a natural built-in mechanism to avoid overfitting. The same property also supports our intuition that documents sharing the same global semantic information will probably be represented as similar distributions, and having measure of their similarity, we can estimate how close documents are to each other in their semantic sense. This can be beneficial for the task of unsupervised clustering of user data.
Another attractive property is how the generative part of the model pθ(z)pθ(x|z)allows to mimic the process of natural text generation in a completely artificial way. This property can be considered a step towards the model of a controllable chatbot which can communicate with the users using natural language.
Our implementation of VAE for text uses the Torch machine learning framework. Both the encoder and decoder parts of the model are single layer LSTM RNNs having a range between 512 and 1024 of hidden layer sizes for different setups. We did not use any regularization in the model. We chose a completely inputless decoder variant: the RNN has to generate a distribution over a sequence of words using only the encoded representation of document sampled from z on its first state. Such a variant is beneficial for overcoming some difficulties in training and allowed us to get a high value of the KL part of loss function to ensure that latent representation is utilized as much as possible, even at the cost of likelihood. We used diagonal Gaussians as distribution over z in our experiments. Due to a large number of words in the vocabulary (from 180 000 to 450 000), we had to implement an approximated variant of softmax at the decoder output layer to speed up the model fitting. We used Noise Contrastive Estimation as a variant of such a technique. Weights of the model initialized from normal distribution with a variance of 0.05. The initial learning rate was set to 0.001 and decayed linearly over training epochs. We also observed the need to sigmoidally anneal weight of the KL Divergence term to prevent the model from collapsing embeddings into points, as it was noted in the paper.
We trained our model using Nvidia Titan X GPU on a computer with 16 GB of memory and Intel i7-6700 CPU @ 3.40GHz CPU to speed up calculations. It ran our one layer 1024 LSTM model with 400 000 vocabulary at 20000 words per second with mini batches of 128 sentences of equal length. Having equal length sentences within each minibatch allowed us to avoid the padding of documents of different lengths. Thus, the processing power of GPU was utilized in the most efficient way.
The chart above represents a typical training session. The weight of KL term (red) is gradually annealed over model updates. At the beginning, the model is free to choose any geometry of embedding, and KL Divergence is growing rapidly; but as the weight of KL term becomes nonzero, the full variational lower bound objective pushes it down. However, it never becomes zero. The model likelihood part (green) also suffers from imposed geometry, but gradually becomes lower.
Below are the results observed in some of the best trained models:
First of all, it is interesting to test text generation properties. The Findo interface for generating and testing the model allows for the generation of the endless number of meaningful text sentences just by sampling from the distribution given by the encoder/recognizer part of the model. It is interesting to note that, unlike many other approaches to text generation, VAE with an inputless decoder is conditioned only on a sample from z and generates words sequentially, developing information from z through hidden layers along the sequence. t is more common to condition the next word generation based on previously generated words.
This form of a generator can be more reliable than the usual language model especially because the decoder of the model is trained and inferred in the same way . We have a likelihood estimation from the decoder of each generated sample that samples can be weighted by.
Current experiments show good results in implementing generative models into our texts. We have managed to gather the dataset of 120 million emails from different kinds of users: corporate and private emails in English. Some part of it was manually labeled to use for the supervised learning part.
The ladder network model with a semi-supervised approach:
We have a great set of data with artificially created links from each word to several semantic class meanings. For example, “cat” belongs to “animals, domestic animals, living beings” etc.
Our goal is to get the model which reads the text and links it to the hierarchy of classes. The model originally was developed as a denoising autoencoder; we plan to use the same architecture to map text to the high-quality hierarchy of classes through the hierarchy of latent representations. Using the knowledge of the hierarchy, we are able to provide the user with tags and suggestions in the search user experience.
This is an attempt to adapt ideas from a very promising semi-supervised ladder model to the case of text sequences.
The same model is optimised to solve two tasks. One of these is the unsupervised task to restore corrupted words which require the network to understand context and subword structure. The other task is to predict a correct annotation above the word. We have a small amount of annotated data and a large amount of raw text; the model has shown better performance when solving the two tasks simultaneously than when solving them separately.
This model is only partially implemented for now. We will have more layers and a hierarchy of latent Gaussian-shaped variables to approach ideas from the ladder as close as possible.
We are currently running tests on two machines for models analyzing a dataset with a speed 106 sentences per second:
While still proceeding with the research on the unsupervised part of this model, we have already achieved great results using this network as a supervised one.
- global accuracy 0.92631767866
- weighted precision 0.924655068152
- weighted recall 0.923934965924
- weighted F1 0.924294876783
Our additional experiences include testing Sequence-to-Sequence with an attention mechanism that has shown great performance when applied to machine translation tasks.
We anticipate that our approach and the experiments of other teams will show good results and further progress.
Generative models are a rapidly advancing area of research . As we continue to advance these models and scale up the training and the datasets, we expect to eventually create the model that will be able to generate the text.
We continue the work and plan to implement great results shown by these generative models for image analysis to natural language processing and scale them other languages, with a smaller amount of data for training available there.
The development of this white paper benefited significantly from the input and support provided by our friends from Abbyy Compreno  project, who helped greatly by providing data for semi-supervised models and reviewed our results:
Anatoly Starostin, Konstantin Anisimovich, Tatyana Danielyan, Valery Novitsky.
- Sophie Curtis. What to expect from Deep Learning in 2016 and beyond. KDNuggets, 2016.
- The Rise Of Bots: A Timeline Of Major VC-Backed Bot Startups. CB Insights, 2016.
- Shivon Zilis. The Current State of Machine Intelligence 2.0. http://www.shivonzilis.com/machineintelligence
- Sentence Level Recurrent Topic Model: Letting Topics Speak for Themselves. Fei Tian, Bin Gao, Di He, Tie-Yan Liu, 2016.
- Yishu Miao, Lei Yu, Phil Blunsom. Neural Variational Inference for Text Processing, 2015.
- Introduction to variational autoencoders. Oliver Dürr. Datalab-Lunch Seminar Series Winterthur, 2016.
- Diederik P Kingma, Max Welling. Auto-Encoding Variational Bayes, 2013.
- Antti Rasmus, Harri Valpola, Mikko Honkala, Mathias Berglund, Tapani Raiko. Semi-Supervised Learning with Ladder Networks, 2015.
- Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler. Skip-Thought Vectors, 2015.
- Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio. Generative Adversarial Networks, 2014.
- Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, Samy Bengio. Generating Sentences from a Continuous Space, 2015.
- Ferenc Huszár. How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary? 2015.
- Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient, 2016.
- OpenAI. Generative models, 2016. https://openai.com/blog/generative-models/
- Abbyy Compreno http://compreno.com/en