Introduction
Convolutional Neural Networks(CNNs) are typically used with image data, however, CNN’s can also be applied to other types of input data. In this code example, we will apply a CNN to textual data. More specifically, we will use the structure of CNNs to classify text. Unlike 2D images, the text has 1D input data. Therefore, we will be using 1D convolutional layers. The Keras framework makes it easy to pre-process the input data.
IMDB Movie Reviews Sentiment Classification
Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer “3” encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: “only consider the top 10,000 most common words, but eliminate the top 20 most common words”.
As a convention, “0” does not stand for a specific word, but instead is used to encode any unknown word.
Imports
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Conv1D, GlobalMaxPooling1D
from keras.callbacks import EarlyStopping
from keras.datasets import imdb
Load Data
import numpy as np
import json
import warnings
def load_data(path=’../input/imdb.npz’, num_words=None, skip_top=0,
maxlen=None, seed=113,
start_char=1, oov_char=2, index_from=3, **kwargs):
“””Loads the IMDB dataset.
# Arguments
path: where to cache the data (relative to `~/.keras/dataset`).
num_words: max number of words to include. Words are ranked
by how often they occur (in the training set) and only
the most frequent words are kept
skip_top: skip the top N most frequently occurring words
(which may not be informative).
maxlen: sequences longer than this will be filtered out.
seed: random seed for sample shuffling.
start_char: The start of a sequence will be marked with this character.
Set to 1 because 0 is usually the padding character.
oov_char: words that were cut out because of the `num_words`
or `skip_top` limit will be replaced with this character.
index_from: index actual words with this index and higher.
# Returns
Tuple of Numpy arrays: `(x_train, y_train), (x_test, y_test)`.
# Raises
ValueError: in case `maxlen` is so low
that no input sequence could be kept.
Note that the ‘out of vocabulary’ character is only used for
words that were present in the training set but are not included
because they’re not making the `num_words` cut here.
Words that were not seen in the training set but are in the test set
have simply been skipped.
“””
# Legacy support
if ‘nb_words’ in kwargs:
warnings.warn(‘The `nb_words` argument in `load_data` ‘
‘has been renamed `num_words`.’)
num_words = kwargs.pop(‘nb_words’)
if kwargs:
raise TypeError(‘Unrecognized keyword arguments: ‘ + str(kwargs))
with np.load(path) as f:
x_train, labels_train = f[‘x_train’], f[‘y_train’]
x_test, labels_test = f[‘x_test’], f[‘y_test’]
np.random.seed(seed)
indices = np.arange(len(x_train))
np.random.shuffle(indices)
x_train = x_train[indices]
labels_train = labels_train[indices]
indices = np.arange(len(x_test))
np.random.shuffle(indices)
x_test = x_test[indices]
labels_test = labels_test[indices]
xs = np.concatenate([x_train, x_test])
labels = np.concatenate([labels_train, labels_test])
if start_char is not None:
xs = [[start_char] + [w + index_from for w in x] for x in xs]
elif index_from:
xs = [[w + index_from for w in x] for x in xs]
if maxlen:
xs, labels = _remove_long_seq(maxlen, xs, labels)
if not xs:
raise ValueError(‘After filtering for sequences shorter than maxlen=’ +
str(maxlen) + ‘, no sequence was kept. ‘
‘Increase maxlen.’)
if not num_words:
num_words = max([max(x) for x in xs])
# by convention, use 2 as OOV word
# reserve ‘index_from’ (=3 by default) characters:
# 0 (padding), 1 (start), 2 (OOV)
if oov_char is not None:
xs = [[w if (skip_top <= w < num_words) else oov_char for w in x]
for x in xs]
else:
xs = [[w for w in x if skip_top <= w < num_words]
for x in xs]
idx = len(x_train)
x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])
return (x_train, y_train), (x_test, y_test)
n_words = 1000
(X_train, y_train), (X_test, y_test) = load_data(num_words=n_words)
print(‘Train seq: {}’.format(len(X_train)))
print(‘Test seq: {}’.format(len(X_train)))
print(‘Train example: \n{}’.format(X_train[0]))
print(‘\nTest example: \n{}’.format(X_test[0]))
Padding Sequences
# Pad sequences with max_len
max_len = 200
X_train = sequence.pad_sequences(X_train, maxlen=max_len)
X_test = sequence.pad_sequences(X_test, maxlen=max_len)
Creating Network Architecture
# Define network architecture and compile
model = Sequential()
model.add(Embedding(n_words, 50, input_length=max_len))
model.add(Dropout(0.5))
model.add(Conv1D(128, 3, padding=’valid’, activation=’relu’, strides=1,))
model.add(GlobalMaxPooling1D())
model.add(Dense(250, activation=’relu’))
model.add(Dropout(0.5))
model.add(Dense(1, activation=’sigmoid’))
model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
model.summary()
Training
callbacks = [EarlyStopping(monitor=’val_acc’, patience=3)]
batch_size = 128
n_epochs = 3
model.fit(X_train, y_train, batch_size=batch_size, epochs=n_epochs, validation_split=0.2, callbacks=callbacks)
Evaluation
print(‘\nAccuracy on test set: {}’.format(model.evaluate(X_test, y_test)[1]))