For this project we used a data set of 10 million job vacancies originating from the UK, Australia, New Zealand and Canada covering the period 2014-2016. The total number of words in the data was 3 billion. This data set was split in a set for training (8.5 million) and a set for testing (1.5 million). For each vacancy we have the job description and the corresponding job title. We found that the job titles were usually reliable, but there was definitely a certain amount of noise in the data. The 10 million vacancies had 89,098 different job titles.
Minimal preprocessing of the data was done. We executed a Scala job on Apache Spark for tokenizing the raw text of the job descriptions in words. All words were transformed to lowercase.
A cluster (CDH 5.7.1) with 2 worker nodes, each having 16 cores, 125.9 GiB memory and 21.8 TiB disk space, was used for preprocessing the data.
All deep learning experiments were done on an NVIDIA GRID K520 GPU.
At the moment, the most popular approach in natural language processing (NLP) is arguably to process text as a sequence of words (or characters) using a recurrent neural network (RNN). This approach often relies on long short-term memory (LSTM) or gated recurrent unit (GRU) blocks. RNNs keep an internal state while processing a sequence and researchers have shown that LSTMs can deal with long-range dependencies between words.
In this project, we decided to use a convolutional neural network (CNN) as a building block for our model. The use of CNNs for NLP tasks is debatable. While convolution and pooling layers intuitively make sense for processing images, this is less clear for text. However, just like an image, text also has a compositional structure (i.e. word, phrase, sentence, paragraph and document) and the filters in a CNN capture local information. In addition, CNNs are easier to design/train and they are fast. Usually these CNNs can be made deeper as compared to the rather shallow LSTM architectures that have been published.
Our model architecture is visualized below and consists of a CNN that generates an embedding for a job description and a lookup table with job title embeddings:
The above model was inspired by the work of Collobert et al. and Weston et al. We also found the work of Denil et al., Huang et al., Kalchbrenner et al. and Zhang et al. particularly useful for our application.
The cosine similarity between the job title embedding and the job description embedding was used as a scoring function. We followed a learning to rank approach using the pairwise hinge loss to train this model. The word lookup table contained 100-dimensional embeddings for the 425,845 most frequent words. The job title lookup table consisted of 89,098 100-dimensional job title embeddings. For the convolution layer we set the number of filters to 1,000, the kernel width to 5 and the stride to 1. This model has 52,095,400 parameters and word embeddings were initialized using rescaled Word2Vec coefficients, obtained from running it on our training data (Mikolov, Sutskever et al.). Optimisation was done using stochastic gradient descent and dropout was used for regularisation purposes. The implementation was done using Torch 7.
We achieved a number of very interesting and powerful results from this approach. The following is some of the insights gained.
This article is part of a series on Using Deep Learning to Extract Knowledge from Job Descriptions. For more information, head to Using Deep Learning to Extract Knowledge from Job Descriptions.