# Difference between revisions of "Bag of Tricks for Efficient Text Classification"

Line 12: | Line 12: | ||

− | Each N represents a seperate N-th gram features in the sentence. This feature will be explained in a coming section. | + | Each <math> N </math> represents a seperate <math> N-th </math> gram features in the sentence. This feature will be explained in a coming section. |

=== Softmax and Hierarchy Softmax === | === Softmax and Hierarchy Softmax === | ||

Line 19: | Line 19: | ||

*PLACE HOLDER for log-likelihood error function found in article* | *PLACE HOLDER for log-likelihood error function found in article* | ||

− | In this formula. A and B are weight matrix which will be calculated in the training set. <math> X_n </math> is the normalizefeature of the n-th documentation. | + | In this formula. A and B are weight matrix which will be calculated in the training set. <math> X_n </math> is the normalizefeature of the <math> n-th </math> documentation. <math> Y_n </math> is the label. |

## Revision as of 22:27, 5 March 2018

- WORK IN PROGRESS*

## Contents

## Introduction

Neural network have been utilized more recently for Text-Classifications and demosntated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large dataset. The authors suggests that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performances. The basis of the analysis for this paper were the approach of fastText on the two tasks: tag predictions, and sentiment analysis. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”

## Model

### Model Architecture of fastText

- PLACE HOLDER FOR IMAGE FROM ARTICLE*

Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples (low frequency) will often get classified in a large output field. The model in the paper is built on top of a linear model with a rank constraint and a fast loss approximation.

Each [math] N [/math] represents a seperate [math] N-th [/math] gram features in the sentence. This feature will be explained in a coming section.

### Softmax and Hierarchy Softmax

Softmax function *f* is used to compute the probability density over the predefined classes. The softmax output layer with log-likelihood is given in the article as:

- PLACE HOLDER for log-likelihood error function found in article*

In this formula. A and B are weight matrix which will be calculated in the training set. [math] X_n [/math] is the normalizefeature of the [math] n-th [/math] documentation. [math] Y_n [/math] is the label.

Remark: Negatively log-likelihood is a multiclass cross-entropy. What this means is that for a binary problem (dog or not dog), it will output two values between [0,1] where the sum of the two values equates to 1. (Dog = 0.6, Cat = 0.4). This can further be expanded into larger dimensions. In contrast, sigmoid outputs one value and in the binary case, the other value can be derived via 1 - p.

Softmax will have a complexity of O(kh) where k is the number of classes and h is the number of dimensions of text representation. The function that the authors used for their model was a variation of the softmax function, known as **Hiearchy Softmax**. The hiearchy softmax is based on the Huffman Coding Tree and will reduce complexity to O(H*log2(k)).

### N-gram and Bag of Tricks

**Bag of Word Model** is a model for simplifying text representation by storing a set of words and their frequency count in a document. **Bag of word is invariant to word order (single word and dictionary based)**. An example of the model can be found in this wikipedia page [ https://en.wikipedia.org/wiki/Bag-of-words_model].

(1) John likes to watch movies. => BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1} (2) Mary likes movies too. => BoW2 = {"John":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1};

If a document contains a union of the (1), (2), and (3) then (3) John likes to watch football game {"John":2,"likes":3,"to":2,"watch":2,"movies":2,"Mary":1,"too":1,"also":1,"football":1,"games":1};

The problem with the bag of word model is that it requires an extensive dictionary of words on file and would take a long time to search through. Additionally, **bag of word losses information due to it being single word and invariant to order.** Lastly, it will fail if the training set does not include the entire dictionary of the testing set.

**N Gram Model (Word Based)** is a model for simplifying text representation by storing n local words adjacent to the initial word. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. An example of how N gram stores and identifies order will be demonstrated in a later section. An example of a document being stores in both Bag of Words, Unigram (N = 1) and Bigram (N = 2) can be found below:

- PLACE HOLDER FOR PICTURE*

Source: https://qph.fs.quoracdn.net/main-qimg-c47060d2f02439a44795e2fbcf2ca347-c

In the article, N gram model was used instead of Bag of Word because the authors wanted to capture the information about the local order.

### Feature Hashing

The authors utilized a feature hashing to map N-gram more efficiently. (SOURCE https://en.wikipedia.org/wiki/Feature_hashing) Feature hashing is a way to vectorize n-gram of a document. It is an effective tool when dealing with n-gram of higher dimension spaces. The algorithm creates a **hash table**, which is a special data structure that contains a hash function and a matrix. The hash function will map the appropriate n-gram into the matrix. An example of this is found in the below example

A = “I love apple”

B = “apple love I”

C = “I love sentence”

A | B | C | |

I | 1 | 1 | 1 |

love | 1 | 1 | 0 |

apple | 1 | 1 | 0 |

sentence | 0 | 0 | 1 |

Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of **order does not matter!**

A | B | C | |

I love | 1 | 0 | 1 |

love apple | 1 | 0 | 0 |

apple love | 0 | 1 | 0 |

love i | 0 | 1 | 0 |

love sentence | 0 | 0 | 1 |

Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.

The author utilized this hashing trick from Mikolov et al. (2011) and 10M bins of only used bigrams, 100M otherwise. This suggests that there are a total combination of sqrt(10 mil) different words in the data base from the training set.