Pages

Saturday 5 July 2014

Why Clipping Of Frequency Distribution Wouldn't Help With Stopword Generation [Part 2]


In continuation with the last post where I conducted to find a pattern in the frequency of stopwords.

I had earlier assumed that stopwords would be the most frequent words, and if not all, a larger number of them would be the most frequent words. This assumption was proved wrong evidently in the last post. But I realized I wasn't doing a good job, and I should have enough data to completely shun the assumption.

What if the cut off limit is put more lower? What percent of stopwords have been covered? I decided to plot again.

Btw, I am using Tatoeba's english sentences corpus for the experiment.


The cyan line represent the percentage of stopwords acquired by clipping off on a particular frequency. For example if you are looking at the plot at 20000 frequency, it tells that if you set the threshold as 20k, and take all words with frequencies above 20k, you get around 13% of total stopwords listed by nltk.

Naturally, if the cutoff is set as zero, you are taking all the words above 0 frequency, and hence end up getting all the stopwords.

Now, to even get 50% of the stopwords, the frequency has to be set at 3540 which is quite lower than what I had kept as threshold in the last experiment (20k and 10k). About 50% of the stopwords have frequencies less than 3540.

Interesting part :
Now, if set the threshold as 3540, we end up getting 50% stopwords, but what about the number of non stopwords? That is only 45. Now 45 seems small compared to the total number of words, more than 40k. But 45 is considerable noise in the list of stopwords which we are generating. Nltk has a total of 127 stopwords, and 45 is huge enough noise with only 63 correct stopwords selected.


Also, again, purple dots are all word frequency scatter plots, and red represents stopword frequencies.

No comments:

Post a Comment