Spread the love

Loading

Introduction

Convolutional Neural Networks(CNNs) are typically used with image data, however, CNN’s can also be applied to other types of input data. In this code example, we will apply a CNN to textual data. More specifically, we will use the structure of CNNs to classify text. Unlike 2D images, the text has 1D input data. Therefore, we will be using 1D convolutional layers. The Keras framework makes it easy to pre-process the input data.

IMDB Movie Reviews Sentiment Classification

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer “3” encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: “only consider the top 10,000 most common words, but eliminate the top 20 most common words”.

As a convention, “0” does not stand for a specific word, but instead is used to encode any unknown word.

Imports

from keras.preprocessing import sequence

from keras.models import Sequential

from keras.layers import Dense, Dropout, Activation

from keras.layers import Embedding

from keras.layers import Conv1D, GlobalMaxPooling1D

from keras.callbacks import EarlyStopping

from keras.datasets import imdb

Load Data

import numpy as np

import json

import warnings

def load_data(path=’../input/imdb.npz’, num_words=None, skip_top=0,

              maxlen=None, seed=113,

              start_char=1, oov_char=2, index_from=3, **kwargs):

    “””Loads the IMDB dataset.

    # Arguments

        path: where to cache the data (relative to `~/.keras/dataset`).

        num_words: max number of words to include. Words are ranked

            by how often they occur (in the training set) and only

            the most frequent words are kept

        skip_top: skip the top N most frequently occurring words

            (which may not be informative).

        maxlen: sequences longer than this will be filtered out.

        seed: random seed for sample shuffling.

        start_char: The start of a sequence will be marked with this character.

            Set to 1 because 0 is usually the padding character.

        oov_char: words that were cut out because of the `num_words`

            or `skip_top` limit will be replaced with this character.

        index_from: index actual words with this index and higher.

    # Returns

        Tuple of Numpy arrays: `(x_train, y_train), (x_test, y_test)`.

    # Raises

        ValueError: in case `maxlen` is so low

            that no input sequence could be kept.

    Note that the ‘out of vocabulary’ character is only used for

    words that were present in the training set but are not included

    because they’re not making the `num_words` cut here.

    Words that were not seen in the training set but are in the test set

    have simply been skipped.

    “””

    # Legacy support

    if ‘nb_words’ in kwargs:

        warnings.warn(‘The `nb_words` argument in `load_data` ‘

                      ‘has been renamed `num_words`.’)

        num_words = kwargs.pop(‘nb_words’)

    if kwargs:

        raise TypeError(‘Unrecognized keyword arguments: ‘ + str(kwargs))

with np.load(path) as f:

        x_train, labels_train = f[‘x_train’], f[‘y_train’]

        x_test, labels_test = f[‘x_test’], f[‘y_test’]

np.random.seed(seed)

indices = np.arange(len(x_train))

np.random.shuffle(indices)

x_train = x_train[indices]

labels_train = labels_train[indices]

indices = np.arange(len(x_test))

np.random.shuffle(indices)

x_test = x_test[indices]

labels_test = labels_test[indices]

xs = np.concatenate([x_train, x_test])

labels = np.concatenate([labels_train, labels_test])

if start_char is not None:

        xs = [[start_char] + [w + index_from for w in x] for x in xs]

elif index_from:

        xs = [[w + index_from for w in x] for x in xs]

if maxlen:

        xs, labels = _remove_long_seq(maxlen, xs, labels)

        if not xs:

            raise ValueError(‘After filtering for sequences shorter than maxlen=’ +

                             str(maxlen) + ‘, no sequence was kept. ‘

                             ‘Increase maxlen.’)

if not num_words:

        num_words = max([max(x) for x in xs])

# by convention, use 2 as OOV word

    # reserve ‘index_from’ (=3 by default) characters:

    # 0 (padding), 1 (start), 2 (OOV)

    if oov_char is not None:

        xs = [[w if (skip_top <= w < num_words) else oov_char for w in x]

              for x in xs]

    else:

        xs = [[w for w in x if skip_top <= w < num_words]

              for x in xs]

idx = len(x_train)

    x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])

    x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])

return (x_train, y_train), (x_test, y_test)

n_words = 1000

(X_train, y_train), (X_test, y_test) = load_data(num_words=n_words)

print(‘Train seq: {}’.format(len(X_train)))

print(‘Test seq: {}’.format(len(X_train)))

print(‘Train example: \n{}’.format(X_train[0]))

print(‘\nTest example: \n{}’.format(X_test[0]))

Padding Sequences

# Pad sequences with max_len

max_len = 200

X_train = sequence.pad_sequences(X_train, maxlen=max_len)

X_test = sequence.pad_sequences(X_test, maxlen=max_len)

Creating Network Architecture

# Define network architecture and compile

model = Sequential()

model.add(Embedding(n_words, 50, input_length=max_len))

model.add(Dropout(0.5))

model.add(Conv1D(128, 3, padding=’valid’, activation=’relu’, strides=1,))

model.add(GlobalMaxPooling1D())

model.add(Dense(250, activation=’relu’))

model.add(Dropout(0.5))

model.add(Dense(1, activation=’sigmoid’))

model.compile(loss=’binary_crossentropy’,  optimizer=’adam’, metrics=[‘accuracy’])

model.summary()

Training

callbacks = [EarlyStopping(monitor=’val_acc’, patience=3)]

batch_size = 128

n_epochs = 3

model.fit(X_train, y_train, batch_size=batch_size, epochs=n_epochs, validation_split=0.2, callbacks=callbacks)

Evaluation

print(‘\nAccuracy on test set: {}’.format(model.evaluate(X_test, y_test)[1]))

By Hassan Amin

Dr. Syed Hassan Amin has done Ph.D. in Computer Science from Imperial College London, United Kingdom and MS in Computer System Engineering from GIKI, Pakistan. During PhD, he has worked on Image Processing, Computer Vision, and Machine Learning. He has done research and development in many areas including Urdu and local language Optical Character Recognition, Retail Analysis, Affiliate Marketing, Fraud Prediction, 3D reconstruction of face images from 2D images, and Retinal Image analysis in addition to other areas.