Sentiment Analysis with Convolutional Neural Networks(CNN) Using Python Keras framework

Spread the love

Introduction

Convolutional Neural Networks(CNNs) are typically used with image data, however, CNN’s can also be applied to other types of input data. In this code example, we will apply a CNN to textual data. More specifically, we will use the structure of CNNs to classify text. Unlike 2D images, the text has 1D input data. Therefore, we will be using 1D convolutional layers. The Keras framework makes it easy to pre-process the input data.

IMDB Movie Reviews Sentiment Classification

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer “3” encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: “only consider the top 10,000 most common words, but eliminate the top 20 most common words”.

As a convention, “0” does not stand for a specific word, but instead is used to encode any unknown word.

Imports

from keras.preprocessing import sequence

from keras.models import Sequential

from keras.layers import Dense, Dropout, Activation

from keras.layers import Embedding

from keras.layers import Conv1D, GlobalMaxPooling1D

from keras.callbacks import EarlyStopping

from keras.datasets import imdb

Load Data

import numpy as np

import json

import warnings

def load_data(path=’../input/imdb.npz’, num_words=None, skip_top=0,

maxlen=None, seed=113,

start_char=1, oov_char=2, index_from=3, **kwargs):

“””Loads the IMDB dataset.

# Arguments

path: where to cache the data (relative to `~/.keras/dataset`).

num_words: max number of words to include. Words are ranked

by how often they occur (in the training set) and only

the most frequent words are kept

skip_top: skip the top N most frequently occurring words

(which may not be informative).

maxlen: sequences longer than this will be filtered out.

seed: random seed for sample shuffling.

start_char: The start of a sequence will be marked with this character.

Set to 1 because 0 is usually the padding character.

oov_char: words that were cut out because of the `num_words`

or `skip_top` limit will be replaced with this character.

index_from: index actual words with this index and higher.

# Returns

Tuple of Numpy arrays: `(x_train, y_train), (x_test, y_test)`.

# Raises

ValueError: in case `maxlen` is so low

that no input sequence could be kept.

Note that the ‘out of vocabulary’ character is only used for

words that were present in the training set but are not included

because they’re not making the `num_words` cut here.

Words that were not seen in the training set but are in the test set

have simply been skipped.

“””

# Legacy support

if ‘nb_words’ in kwargs:

warnings.warn(‘The `nb_words` argument in `load_data` ‘

‘has been renamed `num_words`.’)

num_words = kwargs.pop(‘nb_words’)

if kwargs:

raise TypeError(‘Unrecognized keyword arguments: ‘ + str(kwargs))

with np.load(path) as f:

x_train, labels_train = f[‘x_train’], f[‘y_train’]

x_test, labels_test = f[‘x_test’], f[‘y_test’]

np.random.seed(seed)

indices = np.arange(len(x_train))

np.random.shuffle(indices)

x_train = x_train[indices]

labels_train = labels_train[indices]

indices = np.arange(len(x_test))

np.random.shuffle(indices)

x_test = x_test[indices]

labels_test = labels_test[indices]

xs = np.concatenate([x_train, x_test])

labels = np.concatenate([labels_train, labels_test])

if start_char is not None:

xs = [[start_char] + [w + index_from for w in x] for x in xs]

elif index_from:

xs = [[w + index_from for w in x] for x in xs]

if maxlen:

xs, labels = _remove_long_seq(maxlen, xs, labels)

if not xs:

raise ValueError(‘After filtering for sequences shorter than maxlen=’ +

str(maxlen) + ‘, no sequence was kept. ‘

‘Increase maxlen.’)

if not num_words:

num_words = max([max(x) for x in xs])

# by convention, use 2 as OOV word

# reserve ‘index_from’ (=3 by default) characters:

# 0 (padding), 1 (start), 2 (OOV)

if oov_char is not None:

xs = [[w if (skip_top <= w < num_words) else oov_char for w in x]

for x in xs]

else:

xs = [[w for w in x if skip_top <= w < num_words]

for x in xs]

idx = len(x_train)

x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])

x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])

return (x_train, y_train), (x_test, y_test)

n_words = 1000

(X_train, y_train), (X_test, y_test) = load_data(num_words=n_words)

print(‘Train seq: {}’.format(len(X_train)))

print(‘Test seq: {}’.format(len(X_train)))

print(‘Train example: \n{}’.format(X_train[0]))

print(‘\nTest example: \n{}’.format(X_test[0]))

Padding Sequences

# Pad sequences with max_len

max_len = 200

X_train = sequence.pad_sequences(X_train, maxlen=max_len)

X_test = sequence.pad_sequences(X_test, maxlen=max_len)

Creating Network Architecture

# Define network architecture and compile

model = Sequential()

model.add(Embedding(n_words, 50, input_length=max_len))

model.add(Dropout(0.5))

model.add(Conv1D(128, 3, padding=’valid’, activation=’relu’, strides=1,))

model.add(GlobalMaxPooling1D())

model.add(Dense(250, activation=’relu’))

model.add(Dropout(0.5))

model.add(Dense(1, activation=’sigmoid’))

model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

model.summary()

Training

callbacks = [EarlyStopping(monitor=’val_acc’, patience=3)]

batch_size = 128

n_epochs = 3

model.fit(X_train, y_train, batch_size=batch_size, epochs=n_epochs, validation_split=0.2, callbacks=callbacks)

Evaluation

print(‘\nAccuracy on test set: {}’.format(model.evaluate(X_test, y_test)[1]))

Sentiment Analysis with Convolutional Neural Networks(CNN) Using Python Keras framework

ByHassan Amin

Introduction

IMDB Movie Reviews Sentiment Classification

Imports

Load Data

Padding Sequences

Creating Network Architecture

Training

Evaluation

Related

By Hassan Amin

Related Post

Understanding and Developing Data Strategy and Monetization

Combating AI Fear

Understanding Artificial General Intelligence: Key Elements, Advancements, and Challenges

Building a Personal Brand

Book Summary : Think and Grow Rich by Napolean Hill

Book Summary : Thinking, Fast and Slow by Nobel Laureate Daniel Kahneman

A Guide to Good and Bad Habits for Teens

Setting Up PostGIS Extension On Greenplum 6 In Ubuntu 18.04

ChatGPT is a Turning Point

Building Conversational Chatbot with GPT3

Success of Technology Companies Depends on Relationship between Technical Leadership and Management

You missed

Building a Personal Brand

Book Summary : Think and Grow Rich by Napolean Hill

Book Summary : Thinking, Fast and Slow by Nobel Laureate Daniel Kahneman

A Guide to Good and Bad Habits for Teens