Unraveling the Mysteries of the AG Data Set Through NLP

20 min readMay 17, 2021

by Marcus Hilliard

In this work, we examined several different architectures associated with various neural networks for the classification of news articles from the AG dataset. We show through different experiments that the optimum number of word tokens varies based on the model foundations (i.e., structure of the input data). We then build and test several neural networks designed for natural language processing (NLP). We then chose the optimum model and provide recommendations for implementation.

In this work, we explore the realm of multilayer perceptrons (MLP), convolutional neural networks (CNN), recurrent neural networks (RNN), long short memory networks (LSTM), and gated recurrent unit (GRU) networks by conducting various experiments to elucidate the classification of news articles through several different architectures and by exploring the performance of several deep learning models.

Document Vectorization

The Term Frequency — Inverse Document Frequency (TF-IDF) approach is used to give a weight to each word in a text document based on how unique it is across the corpus as a whole. This helps to capture the relevancy among words, documents, and categories (Khan et al. 2010). TF-IDF is one of the most common approaches for document processing in NLP. This approach allows the researchers to determine which words are significant to the corpus as a whole but aren’t overused. Researchers have used TF-IDF to classify hate speech on Twitter (Ayo et al. 2021) and to improve recall and precision in terms of text classification compared to the traditional approach (Yun-tao, Ling, and Yong-cheng 2005).

The bag-of-words model in NLP is one of the most popular ways of vectorizing documents (Browniee 2017), and also happens to be the vectorization of choice for the topic modeling algorithm at the heart of this study. Bag-of-words is used to extract features from text to use for modeling. This model describes, for every word or n-gram in the corpus, a measure of the presence of those words. The model takes these words and essentially throws them into a “bag”, losing their order and structure in the process but retaining the knowledge of their frequency. One famous use case for bag-of-words is in training robots to better understand rooms within an apartment (Filliat 2007).

Clustering

K-means clustering is an algorithm for grouping similar data points together to discover patterns, supervised by the desired number of clusters, or k. Within NLP, the algorithm uses the mean of the vectorized data to determine which documents appear in which cluster. Researchers have used the global k-means clustering algorithm that doesn’t depend on initial parameter values and uses the algorithm as a local search procedure (Likas, Vlassis, and Verbeek 2003). Researchers have also used k-means clustering to understand the unique sleep schedules of youth athletes if they were multi-sport or individual-sport athletes (Suppiah et al. 2021).

Topic Mapping

Latent Dirichlet Allocation (LDA) is a form of topic modeling and one of the most popular ways to research topic maps. It is also the focus of this research. The algorithm is a generative model for collections of discrete data. This is useful for NLP and text corpora because it allows the researcher to find underlying themes across a corpus (Kim et al. 2019). Once LDA has been trained you get two outputs: term distribution per topic and topic distribution per document. Both will be part of the analysis that follows. Researchers have used LDA as a spam filter (Bíró, Szabó, and Benczúr 2008) and to improve the process of document classification using labels, even though LDA is unsupervised and doesn’t traditionally use labels (Wang, Thint, and Al-Rubaie 2012).

Data

For this work, we used the AG database (Antonio Gulli) which includes more than 1 million news articles from more than 2,000 news sources. The AG dataset was constructed by choosing the four largest classes (i.e. world, sports, business, science/technology) from the original corpus (Zhang, Zhao, and LeCun 2015) which includes 127,600 news articles (i.e., 31,900 articles per class).

Research Design and Modeling Methods

In this work, we are going to be using artificial neural networks called convolutional neural networks (CNN) with multiple architectures. All simulations were performed using Jupyter notebooks within Google Colaboratory running Python version 3.7.10, Tensorflow version 2.4.1, Keras version 2.4.3, various visualization libraries (i.e., Matplotlib version 3.2.2, Seaborn version 0.11.1, and Poltly version 4.4.1), data structure/analysis libraries (i.e., Pandas version 1.1.5 and Numpy version 1.19.5), and various natural language processing libraries (i.e., NLTK version 3.2.5, SKLearn version 0.22.2, Gensim version 3.8.3).

Exploratory Data Analysis

Before conducting our analysis, we had to pre-process our data, relying heavily on Python and its NLTK library for these steps. Beginning with our document corpus, we used NLTK’s tokenizer to separate all of the words in each document into separate tokens, remove all punctuation and special characters, expand contractions, remove all non-alphabetic characters, make all letters lowercase, and remove all terms less than three characters. Figure 1 provides an example of the text cleaning process that was applied to the corpus. At each step, you can see how each method was applied to achieve the resulting text.

Figure 1. Example of preprocessing from the AG Database. — Figure 1. Example of preprocessing from the AG Database

As a result of the cleaning process, the total number of tokens were dramatically affected. Table 1 shows the result of each step on the maximum number of tokens, the average number of tokens, the total number of tokens, and the number of tokens that were either gained or removed from the corpus. Figure 2 illustrates the distributions associated with a specific pre-processing step effect on the number of tokens. The largest shift in the distributions occurred between the Nonsense and Stop word pre-processing steps.

Table 1. Sequential Corpus Preprocessing Results

Figure 2. Preprocessing statistics from the AG Database. Red vertical line equals 33 tokens.

Preceding the pre-processing of the corpus, we then wanted to explore the corpus and the terms contained within using two algorithms: TF-IDF and Bag of Words.

The TF-IDF algorithm scores every term in each document by considering the term frequency and how uncommon the term is across the corpus; for a term to receive a high TF-IDF score in a given document, it needs to both appear with some frequency in that document and also not be too common throughout the corpus so as to make its appearance in this given document insignificant. After compiling a term list consisting of all unigrams in our corpus documents, we scored each term using the TF-IDF vectorizer in Python’s sklearn library. We used this algorithm to get a sense of which terms were important in each document; the highest scoring terms are the ones that ought to drive a given document’s topic mapping through the use of different clustering algorithms (i.e. Agglomerative Hierarchical (Unsupervised) or K-Means (Supervised)).

Agglomerative hierarchical clustering is a technique, where each data point is initially considered as an individual cluster. Then through each iteration, similar clusters merge with other clusters until one cluster or K clusters are formed. Hierarchical clustering is an unsupervised clustering technique, because the algorithm does not require a priori the number of clusters to target. The optimal number of clusters is determined through the use of a dendrogram which is used to visualize the number of clusters. Unfortunately, we were not able to perform agglomerative hierarchical clustering due to the amount of memory (i.e. 121 gigabytes) that would need to be required to perform the calculation based on the size of the TF-IDF proximity matrix (127,600 rows by 31,825 columns).

The K-Means algorithm clusters data by trying to separate samples in n groups of equal variances by minimizing the inertia or within-cluster sum-of-squares. This algorithm is a type of supervised clustering techniques, because K-Means requires the number of clusters to be specified. During clustering, the K-means algorithm first randomly chooses initial centroids. Then the K-means algorithm conducts a series of loops between two nested two steps. The first step assigns each sample to its nearest centroid. Then new centroids determined by taking the mean value of all of the samples assigned to each previous centroid. The difference between the old and the new centroids are computed and the algorithm repeats until the new centroids are no longer moving significantly.

Due to the large size of the corpus, we used the Mini Batch K-Means algorithm. Mini Batch K-Means is a variation on the original K-Means algorithm by using randomly samples mini-batches of sample to reduce the computation time. The Mini Batch K-Means algorithm iterates between two major steps, similar to the original k-means algorithm. The first step assigns each sample to its nearest centroid. Then new centroids determined by taking the mean value of all of the samples assigned to each previous centroid. In contrast to k-means, this is done on a per-sample basis. For each sample in the mini-batch, the new centroid is updated by taking the running average of the samples and all previous samples assigned to a specific centroid. These steps are performed until convergence or a predetermined number of iterations is reached.

Based on the datasets ground truth, the number of clusters, one per each class, equals four. Using the mini Batch K-Means algorithm, we were able to classify the corpus based on the TF-IDF proximity matrix into four clusters. We then used t-SNE to reduce the dimensionality of the TF-IDF proximity matrix to two dimensions for easier visualization. The 2D t-SNE representation of the TF-IDF proximity matrix is shown in Figure 3 and illustrates very minimal separation of the respective clusters. This is further confirmed by examining the top 10 terms in each cluster identified by the K-Means algorithm as shown in Tables 2, 3, and 4.

Figure 3. Corpus representation using dimensional reduction with K-Means cluster differentiation

By increasing the minimum character limit from two to three to four we were able to improve the clustering.

Table 2. K-Means most common terms (minimum characters per token >= 2)

Table 3. K-Means most common terms (minimum characters per token >= 3)

Table 4. K-Means most common terms (minimum characters per token >= 4)

This is where the power of topic mapping using LDA, based on word vectors from the bag of words algorithm, to classify text into a particular topic. We can then perform a series of experiments varying the minimum character limit and by including additional stop words to refine our topic classes. For each of these experiments, we vectorized each document using TF-IDF vectorization and to keep the vector dimensionality consistent with the bag-of-word vectors, we focused on unigrams. Bag-of-words vectorization, then, would serve as the input into our Latent Dirichlet Allocation (LDA) topic mapping algorithm.

Based on the ground truth, we know that each class is balanced in the dataset containing 31,900 articles per class. But after running our initial LDA experiment, the number of documents by cluster are unbalanced. Figure 4 illustrates that cluster 1 with the smallest number of documents where as cluster 3 contains the greatest number of documents. Figure 4 also shows word clouds for the top ten terms in each cluster. Based on the ground truth, we know that the four classes are science/technology, world, business, and sports. Based on the word clouds, we see that each topic appears to be aligning with one of the ground truth classes. Topic 1 appears to be associated with business, Topic 2 with science/technology, Topic 3 with world, and Topic 4 with sports. On the other hand, we see that the word “say” appears in topics 1, 2, and 3.

Figure 4. LDA Topic Probability Distribution and Word Cloud Associations Using Default Stop Words. Horizontal line equals 31,900 documents.

For our first mini-experiment, we will add the word “say” as additional stop words. As a result, the number of documents in topic 2 and 3 swapped places as compared to the word clouds in Figure 4.

Figure 5. LDA Topic Probability Distribution and Word Cloud Associations Using Added Stop Words. Horizontal line equals 31,900 documents.

In our second mini-experiment, if we removed the additional stop words “say, reuters,” there was a dramatic shift in the topic distribution as shown in Figure 6.

Figure 6. LDA Topic Probability Distribution and Word Cloud Associations Using Added Stop Words. Horizontal line equals 31,900 documents.

In our third mini-experiment, if we removed the additional stop words “say, monday” there was another dramatic shift in the topic distribution as shown in Figure 7. We see that the word “wednesday” is prominent in topic 1 and topic 4.

Figure 7. LDA Topic Probability Distribution and Word Cloud Associations Using Added Stop Words. Horizontal line equals 31,900 documents.

In our fourth mini-experiment, if we removed the additional stop words “say, monday, wednesday” there was another dramatic shift in the topic distribution as shown in Figure 8.

Figure 8. LDA Topic Probability Distribution and Word Cloud Associations Using Added Stop Words. Horizontal line equals 31,900 documents.

In our fifth mini-experiment, if we removed the additional stop words “say, monday, wednesday, sunday” there was another dramatic shift in the topic distribution as shown in Figure 9.

Figure 9. LDA Topic Probability Distribution and Word Cloud Associations Using Added Stop Words. Horizontal line equals 31,900 documents.

Figure 10 shows that in each cluster there is a majority of documents in one topic.

Figure 10. LDA Topic Probability Distribution Using Added Stop Words. Horizontal line equals 31,900 documents.

As one can see, the LDA results are extremely sensitive to the choices made in preparing our reference term vectors. As we walked through the various mini-experiments, the LDA model results varied greatly and with dramatic fluctuations–as our incremental feature engineering steps made for document vectorizations more strongly suited to topic extraction. In particular, the biggest gains in model performance were realized by excluding certain words altogether from the vector of reference terms. Earlier, we discussed how TF-IDF penalizes terms that are too common throughout the corpus because that commonality makes their appearance in certain documents less significant. The LDA model, though, does not use TF-IDF vectors as its input; LDA instead uses the bag-of-words model for developing its input corpus. Thus, unlike TF-IDF-based models, LDA is susceptible to influence from common yet relatively empty tokens. While this hurts the model’s ability to base its probabilistic allocations on the corpus’s more information-rich terms, it also presents the modeler with a real opportunity to improve performance by taking the time to manually remove them, effectively applying a document frequency penalty to a bag-of-words-based topic model. In addition, because the algorithm returns both a term distribution per topic and a topic distribution per document, each term does not simply point to a single topic, but rather probabilistically points to multiple, forming a bridge between topics. And because the topics are also not unique within documents, they in turn form bridges between the documents. Thus, when we change even one single input term, by removing the term, the effect has a downstream ripple effect on the other terms it is linked to through the topic distributions and the documents those topics were linked to, as all of the probabilities are now forced to redistribute and reallocate themselves.

Figure 11 is a network graph of the four topics and the dominant terms in their term distributions, demonstrates this interconnected nature of our final LDA model. In doing so, it also helps to illustrate why each of the specific decisions made in input vector preparation is so important to the LDA model in particular, as this networked effect makes it so that a change to any word necessarily impacts so many others, as well. This probabilistic interconnectedness has its advantages, too. While the above network graph speaks to LDA’s term distribution per topic output, it is the model’s topic distribution per document return that can inform our interpretation of the relationships between topics.

Figure 11. This topic-word network map is meant to demonstrate the interconnected web that a topic map represents, where all the terms and topics are linked to one another.

LDA topic modeling is a powerful statistical approach for organizing and understanding the relationships between topics and documents in a corpus. We have explored possible driving forces as to what truly motivations this model and why the connection between data preparation and model performance is both so strong and so volatile.

Training, Validating, and Testing Datasets

Before we dive into building models, we will first split the dataset into two main data sets (e.g., training and testing) containing 80 percent of the data for training and hold out 20 percent for testing. Since we are going to be creating multiple models, we will also split the full training dataset into two data sub-sets (e.g., reduced training and holdout validation). The holdout validation dataset will be used to evaluate several candidate models and select the best one. After this holdout validation process, we will train the best model on the full training dataset to produce the final model. Lastly, we will evaluate this final model on the test dataset to get an estimate of the generalization error.

Table 5. Observations for Training, Validating, and Testing

Results

We ran six principal experiments with several optimization experiments embedded within the principal experiment. We began with a baseline model and incrementally adding more sophisticated classification techniques into each successive model run, examining the model performance metrics for each one. To evaluate model performance in each experiment, we focused on the predictive accuracy as determined by the accuracy and loss. The loss represents the difference between the original and predicted values squared.

Experiment 1

In experiment 1, we are going to be using a sequential MLP with three dense hidden layers and one dense output layer. Each hidden layer contained 64 filter with relu activation and the output layer contained 4 filters with softmax activation. The model was compiled using categorical cross-entropy using the Adam optimizer while monitoring accuracy as the metric. The model was fitted with a maximum of 500 iterations using an early stopping checkpoint with a patience value equal to three. After 4 iterations, the model converged with a validation loss and validation accuracy equal to 0.376 and 0.881, respectively. The model’s performance inference was evaluated against the reduced training data, the holdout validation data, and the test data. The model is not overfitting, due to the early stopping checkpoint, this trend is also shown in Figure 12, which shows the accuracy evaluation for each iteration and illustrates that the validation accuracy scores (blue) are not diverging away from the training accuracy scores.

Figure 12. Model 1 Training and Validation Accuracy Score for Each Iteration

Figure 13 shows the confusion matrix and error rates associated with the training data. The model does the best classifying sports (99%) and science/technology (96%), but did the worse at predicting business (92%). By adding additional layers, we may be able to further increase accuracy.

Figure 13. Model 4 Training Confusion Matrix (left panel) Error Rates (right panel). Label Dictionary: 0, World; 1, Sports; 2, Business; 3, Science/Technology

Experiment 2

In experiment 2, we are going to be using a sequential one-dimensional CNN with three hidden layers and one dense output layer. Each hidden layer contained 256 filters with relu activation and the output layer contained 4 filters with softmax activation. The model was compiled using categorical cross-entropy using the Adam optimizer while monitoring accuracy as the metric. The model was fitted with a maximum of 500 iterations using an early stopping checkpoint with a patience value equal to three. After 5 iterations, the model converged with a validation loss and validation accuracy equal to 0.327 and 0.888, respectively. The model’s performance inference was evaluated against the reduced training data, the holdout validation data, and the test data. The model is not overfitting, due to the early stopping checkpoint, this trend is also shown in Figure 14, which shows the accuracy evaluation for each iteration and illustrates that the validation accuracy scores (blue) are not diverging away from the training accuracy scores.

Figure 14. Model 2 Training and Validation Accuracy Score for Each Iteration

Figure 15 shows the confusion matrix and error rates associated with the training data. The model does the best classifying sports (99%) and science/technology (92%), but did the worse at predicting business (88%). By adding additional layers, we may be able to further increase accuracy.

Figure 15. Model 2 Training Confusion Matrix (left panel) Error Rates (right panel). Label Dictionary: 0, World; 1, Sports; 2, Business; 3, Science/Technology

Experiment 3

In experiment 3, we are going to be using a sequential recurrent neural network and one dense output layer. Recurrent neural networks (RNN) are a class of neural networks that is powerful for modeling sequence data such as time series or natural language. The RNN layer uses a for loop to iterate over the timesteps of a sequence, while maintaining an internal state that encodes information about the timesteps it has seen so far. The RNN layer contained 256 filters and the output layer contained 4 filters with softmax activation. The model was compiled using categorical cross-entropy using the Adam optimizer while monitoring accuracy as the metric. The model was fitted with a maximum of 500 iterations using an early stopping checkpoint with a patience value equal to three. After 4 iterations, the model converged with a validation loss and validation accuracy equal to 0.353 and 0.881, respectively. The model’s performance inference was evaluated against the reduced training data, the holdout validation data, and the test data. The model is not overfitting, due to the early stopping checkpoint, this trend is also shown in Figure 16, which shows the accuracy evaluation for each iteration and illustrates that the validation accuracy scores (blue) are not diverging away from the training accuracy scores.

Figure 16. Model 3 Training and Validation Accuracy Score for Each Iteration

Figure 17 shows the confusion matrix and error rates associated with the training data. The model does the best classifying sports (98%) and science/technology (92%), but did the worse at predicting business (82%). By adding additional layers, we may be able to further increase accuracy.

Figure 17. Model 3 Training Confusion Matrix (left panel) Error Rates (right panel). Label Dictionary: 0, World; 1, Sports; 2, Business; 3, Science/Technology

Experiment 4

In experiment 4, we are going to be using a sequential long short term memory (LSTM) network with one dense output layer. LSTMs have an edge over conventional feed-forward neural networks and RNNs, because of the LSTMs property of selectively remembering patterns for long durations of time. The LSTM layer contained 256 filters and the output layer contained 4 filters with softmax activation. The model was compiled using categorical cross-entropy using the Adam optimizer while monitoring accuracy as the metric. The model was fitted with a maximum of 500 iterations using an early stopping checkpoint with a patience value equal to three. After 5 iterations, the model converged with a validation loss and validation accuracy equal to 0.319 and 0.892, respectively. The model’s performance inference was evaluated against the reduced training data, the holdout validation data, and the test data. The model is not overfitting, due to the early stopping checkpoint, this trend is also shown in Figure 18, which shows the accuracy evaluation for each iteration and illustrates that the validation accuracy scores (blue) are not diverging away from the training accuracy scores.

Figure 18. Model 4 Training and Validation Accuracy Score for Each Iteration

Figure 19 shows the confusion matrix and error rates associated with the training data. The model does the best classifying sports (98%) and science/technology (90%), but did the worse at predicting business (88%). By adding additional layers, we may be able to further increase accuracy.

Figure 19. Model 4 Training Confusion Matrix (left panel) Error Rates (right panel). Label Dictionary: 0, World; 1, Sports; 2, Business; 3, Science/Technology

Experiment 5

In experiment 5, we are going to be using a sequential gated recurrent unit (GRU) network and one dense output layer. GRUs are improved version of standard recurrent neural network to solve the vanishing gradient problem of standard RNN. The GRU contained 256 filters and the output layer contained 4 filters with softmax activation. The model was compiled using categorical cross-entropy using the Adam optimizer while monitoring accuracy as the metric. The model was fitted with a maximum of 500 iterations using an early stopping checkpoint with a patience value equal to three. After 4 iterations, the model converged with a validation loss and validation accuracy equal to 0.323 and 0.890, respectively. The model’s performance inference was evaluated against the reduced training data, the holdout validation data, and the test data. The model is not overfitting, due to the early stopping checkpoint, this trend is also shown in Figure 20, which shows the accuracy evaluation for each iteration and illustrates that the validation accuracy scores (blue) are not diverging away from the training accuracy scores.

Figure 20. Model 5 Training and Validation Accuracy Score for Each Iteration

Figure 21 shows the confusion matrix and error rates associated with the training data. The model does the best classifying sports (98%) and science/technology (90%), but did the worse at predicting business (85%). By adding additional layers, we may be able to further increase accuracy.

Figure 21. Model 5 Training Confusion Matrix (left panel) Error Rates (right panel). Label Dictionary: 0, World; 1, Sports; 2, Business; 3, Science/Technology

Experiment 6

In experiment 6, we are going to be using a sequential bidirectional LSTM and one dense output layer. The biLSTM layer contained 256 filters and the output layer contained 4 filters with softmax activation. The model was compiled using categorical cross-entropy using the Adam optimizer while monitoring accuracy as the metric. The model was fitted with a maximum of 500 iterations using an early stopping checkpoint with a patience value equal to three. After 6 iterations, the model converged with a validation loss and validation accuracy equal to 0.325 and 0.890, respectively. The model’s performance inference was evaluated against the reduced training data, the holdout validation data, and the test data. The model is not overfitting, due to the early stopping checkpoint, this trend is also shown in Figure 22, which shows the accuracy evaluation for each iteration and illustrates that the validation accuracy scores (blue) are not diverging away from the training accuracy scores.

Figure 22. Model 6 Training and Validation Accuracy Score for Each Iteration

Figure 23 shows the confusion matrix and error rates associated with the training data. The model does the best classifying sports (97%) and science/technology (93%), but did the worse at predicting business (87%). By adding additional layers, we may be able to further increase accuracy.

Figure 23. Model 6 Training Confusion Matrix (left panel) Error Rates (right panel). Label Dictionary: 0, World; 1, Sports; 2, Business; 3, Science/Technology

Table 6 shows the overall experimental results for each model that was run previously.

Conclusions

Base on the above analysis, we choose Experiment 1 as the final model that I am recommending for implementation. By utilizing dense neural network in combination with adequate regularization techniques, the implementation of the model will require less time to retrain the model. Which will aid in the reduction of working hours needed for future expensive expenders of project capital resources. The model will also need to be reevaluated after a period of time in conjunction with the implementation team.

References

Ayo, Femi Emmanuel, Olusegun Folorunso, Friday Thomas Ibharalu, Idowu Ademola Osinuga, and Adebayo Abayomi-Alli. 2021. “A probabilistic clustering model for hate speech classification in twitter.” Expert Systems with Applications 173: 114762.

Bíró, István, Jácint Szabó, and András A Benczúr. 2008. “Latent dirichlet allocation in web spam filtering.” Proceedings of the 4th international workshop on Adversarial information retrieval on the web.

Browniee, Jason. 2017. “A Gentle Introduction to the Bag-of-Words Model.” Accessed May 13, 2021. https://machinelearningmastery.com/gentle-introduction-bag-words-model/.

Filliat, David. 2007. “A visual bag of words method for interactive qualitative localization and mapping.” Proceedings 2007 IEEE International Conference on Robotics and Automation.

Khan, Aurangzeb, Baharum Baharudin, Lam Hong Lee, and Khairullah Khan. 2010. “A review of machine learning algorithms for text-documents classification.” Journal of advances in information technology 1 (1): 4–20.

Kim, Donghwa, Deokseong Seo, Suhyoun Cho, and Pilsung Kang. 2019. “Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec.” Information Sciences 477: 15–29.

Likas, Aristidis, Nikos Vlassis, and Jakob J Verbeek. 2003. “The global k-means clustering algorithm.” Pattern recognition 36 (2): 451–461.

Suppiah, Haresh T, Richard Swinbourne, Jericho Wee, Vanes Tay, and Paul Gastin. 2021. “Sleep Characteristics of Elite Youth Athletes: A Clustering Approach to Optimize Sleep Support Strategies.” International Journal of Sports Physiology and Performance 1 (aop): 1–9.

Wang, Di, Marcus Thint, and Ahmad Al-Rubaie. 2012. “Semi-supervised latent Dirichlet allocation and its application for document classification.” 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

Yun-tao, Zhang, Gong Ling, and Wang Yong-cheng. 2005. “An improved TF-IDF approach for text classification.” Journal of Zhejiang University-Science A 6 (1): 49–55.

Zhang, Xiang, Junbo Zhao, and Yann LeCun. 2015. “Character-level convolutional networks for text classification.” arXiv preprint arXiv:1509.01626.