Image credit: learning beats blogpost

Multilabel Text Classification using CNN and Bi-LSTM

Convolutional Neural Networks, Bidirectional Long Short Term Memory

Diwas Tiwari
7 min readFeb 21, 2021

--

In the latest years, lot of innovations, advancements and research have been done in the field on Natural Language Processing. These research includes using simple BOW approach to adding context to word vectors.

Text classification is a modelling approach where we have series of sequences as input to predict the class for the particular sequence. This predictive modelling approach tends to pose a challenge in a way that the input sequences that go as inputs do not have a constant length. This invariable length of the sequences accounts for very large vocabulary size and hence it usually requires model to learn long term contexts.

Now, lot of algorithms and solutions for binary and multi class text classification prevails but in real life tweet or even a sentence and even most of the problems can be represented as multi-label classification problem. Hence, need arises for a well to do AI driven approach for classifying sentences into multiple labels. This multi-label classification approach finds its use in lots of major areas such as :

1- Categorizing genre for movies by OTT platforms.

2- Text Classifications by Banking and financial institutions.

3- Automatic caption generation.

Here we present a deep learning framework that has been used for classifying the sentences into various labels. The aim of the article is to familiarize the audience as to how the CNN and Bi-LSTM networks in combinations is used for providing a novel multi-label classifier. The dataset that we will use for this demonstration is Toxic Comment Classification Challenge dataset featured in Kaggle competition. The aim was to classify the sentence into toxic, severe toxic, obscene, threat, insult etc. More information on the dataset can be found here.

Dataset Used

We will take the sample of the full dataset for the framework demonstration. It is highly recommended for best use of the proposed framework you use the full dataset. The aim of this Kaggle problem was to predict the toxicity of the comments and the model takes in the comments, pre-process it and gives out the labels to which the comments belong. Let’s load the data !

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
df = pd.read_csv(r"Dataset_toxic_comment\train.csv", encoding='iso-8859-1')
df.info()
Fig:1
df.head(n=5)
Fig:2

As the complete data is very large and the paper focuses on how to use the designed framework we will take a fraction of sample from the complete database and use it for training.

Note: For best results it is advised to use complete data with GPU

## Taking sample to work on just for framework demonstration ##
df_data = df.sample(frac=0.2, replace=True, random_state=1)
df_data.isnull().sum()
Fig:3

Word Pre-processing

The text’s in the comments are very dirty and unclean. This can not be used for any classification task. Thus we need to clean them and remove all the stop-words so that our input text is data ready.

## Word Pre-Processing ##
import nltk
import string
wpt = nltk.WordPunctTokenizer()
stop_words_init = nltk.corpus.stopwords.words('english')
stop_words = [i for i in stop_words_init if i not in ('not','and','for')]
print(stop_words)
## Function to normalize text for pre-processing ##
def normalize_text(text):
text = text.lower()
text = re.sub('\[.*?\]', ' ', text)
text = re.sub('https?://\S+|www\.\S+', ' ', text)
text = re.sub('<.*?>+', ' ', text)
text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text)
text = re.sub('\n', ' ', text)
text = re.sub('\w*\d\w*', ' ', text)
return text
## Apply the written function ##
df_data['comment_text'] = df_data['comment_text'].apply(lambda x: normalize_text(x))
processed_list = []
for j in df_data['comment_text']:
process = j.replace('...','')
processed_list.append(process)

df_processed = pd.DataFrame(processed_list)
df_processed.columns = ['comments']
df_processed.head(n=5)
Fig 4: Processed texts

Label Preparation

Now, once the data is ready and cleaned its time for consolidating the labels. Post consolidating the labels before jumping into model building and classification it is primarily necessary to check what are the various label types and what are the classes per labels. This is necessary is case of multi label classification to get a hang of the label distribution.

labels = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
targets = df_data[labels].values
import matplotlib.pyplot as pltval_counts = df_data[labels].sum()plt.figure(figsize=(8,5))
ax = sns.barplot(val_counts.index, val_counts.values, alpha=0.8)
plt.title("Labels per Classes")
plt.xlabel("Various Label Type")
plt.ylabel("Counts of the Labels")
rects = ax.patches
labels = val_counts.values
for rect, label in zip(rects, labels):
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2, height+5, label, ha="center", va="bottom")
plt.show()
Fig:5

We will also plotted to check for data imbalance within each toxicity label type.

import seaborn as snsfig , axes = plt.subplots(2,3,figsize = (10,10), constrained_layout = True)
sns.countplot(ax=axes[0,0],x='toxic',data=df_data )
sns.countplot(ax=axes[0,1],x='severe_toxic',data=df_data)
sns.countplot(ax=axes[0,2],x='obscene',data=df_data)
sns.countplot(ax = axes[1,0],x='threat',data=df_data)
sns.countplot(ax=axes[1,1],x='insult',data=df_data)
sns.countplot(ax=axes[1,2],x='identity_hate',data=df_data)
plt.suptitle('Number Of Labels of each Toxicity Type')
plt.show()
Fig:6

Data conversion to the format as required by the framework

X = list(df_processed['comments'])
y_data = df_data[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]]
y = y_data.values

Train Test Split

from sklearn.model_selection import train_test_splitx_train, x_test, y_train, y_test = train_test_split(X, y,
test_size=0.15, train_size=0.85)

Yayyyyyy !!! we are done with train test data creation and text preprocessing. Lets go to the stuff you have came for :)

CNN Bi-LSTM Modelling

  • Load all the necessary keras libraries.
  • Take top 10000 words as features to to convert the texts into sequence of integers
  • Put the maximum length as 100 of the sequence.
  • Finally, pad the text sequence to make all the input texts of same length for modelling.
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers.core import Activation, Dropout, Dense
from keras.layers import Flatten, LSTM
from keras.layers import Bidirectional,GRU,concatenate,SpatialDropout1D
from keras.layers import GlobalMaxPooling1D,GlobalAveragePooling1D,Conv1D
from keras.models import Model
from keras.layers.embeddings import Embedding
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.layers import Input
from keras.layers.merge import Concatenate
import matplotlib.pyplot as plt
from keras import layers
from keras.optimizers import Adam,SGD,RMSprop
######## Textual Features for Embedding ###################max_len = 100
max_features = 10000
embed_size = 300
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(x_train)+list(x_test))
x_train = tokenizer.texts_to_sequences(x_train)
x_test= tokenizer.texts_to_sequences(x_test)
x_train = pad_sequences(x_train, padding='post', maxlen=max_len)
x_test = pad_sequences(x_test, padding='post', maxlen=max_len)

Now, as the first input layer to the network is the embedding layer and the weights are to be transferred from an embedding matrix.

Huh !!, we will prepare an embedding matrix now ….

from numpy import array
from numpy import asarray
from numpy import zeros
embeddings_dictionary = dict()glove_file = open(".....", encoding="utf8") ## pre-trained or self trained global vectors file ##for line in glove_file:
records = line.split()
word = records[0]
vector_dimensions = asarray(records[1:], dtype='float32')
embeddings_dictionary[word] = vector_dimensions
glove_file.close()
vocab_size = len(tokenizer.word_index) + 1 ## total distinct words is the Vocabulary ##
word_index = tokenizer.word_index
num_words = min(max_features,len(word_index)+1)
embedding_matrix = zeros((num_words, embed_size)) ## has to be similar to glove dimension ##
for word, index in tokenizer.word_index.items():
if index >= max_features:
continue
embedding_vector = embeddings_dictionary.get(word)
if embedding_vector is not None:
embedding_matrix[index] = embedding_vector

Once the embedding matrix is ready we are all set to define our last part. Congratulations !!!!! You have reached till the last …

  • The first layer will be embedding layer which uses 300 length vectors and weight derived from an designed embedding matrix
  • Then is the spatial dropout layer which performs dropout variationally
  • The next layer is a convolutional layer for extracting 1D higher level features from the input sequence
  • The next layer is a Bi-LSTM layer with 128 units.
  • Then we have an average poling layer
  • Finally we have a dense layer followed by a simple dropout layer and then closing with another dense layer which has units similar to as the multi-label outputs.

The sample diagrammatic representation of the architecture is given below:

Fig 7: Architectural Flow of CNN Bi-LSTM Framework
sequence_input = Input(shape=(max_len, ))
x = Embedding(max_features, embed_size, weights=[embedding_matrix],trainable = False)(sequence_input)
x = SpatialDropout1D(0.2)(x) ## ostly drops the entire 1D feature map rather than individual elements.
x = Conv1D(64, kernel_size = 3, padding = "valid", kernel_initializer = "glorot_uniform")(x)
x = Bidirectional(LSTM(128, return_sequences=True,dropout=0.1,recurrent_dropout=0.1))(x)
avg_pool = GlobalAveragePooling1D()(x)
x = Dense(128, activation='relu')(avg_pool)
x = Dropout(0.1)(x)
preds = Dense(6, activation="sigmoid")(x)
model = Model(sequence_input, preds)
model.compile(loss='binary_crossentropy',optimizer=RMSprop(lr=1e-3),metrics=['accuracy'])
print(model.summary())
Fig 8: Model Summary
history = model.fit(x_train, y_train, batch_size=128, epochs=5,
verbose=1, validation_split=0.2)
model.save_weights("./BiLSTM_ver1.h5")
Fig 9: Training for 5 epocs.
score = model.evaluate(x_test, y_test, verbose=1)
Fig 10: Test accuracy only after 5 epochs

Finally, plotting the curves for accuracy and loss of model training.

plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
Fig 11: Model accuracy curve after 5 epoch
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
Fig 12: Model loss curve after 5 epoch

Hope you enjoyed the contents of the blog. This was more of an informative introduction as to how multi label classification can be dealt using conventional AI driven techniques with some modifications. For complete code on the subject please refer to my github link below :

Do leave claps if you appreciate the content and efforts :) !!!!

--

--

Diwas Tiwari

Data Science enthusiast. Ambassador of AI for good. Disintegrating data to read between the lines.