Pages

Wednesday 18 June 2014

Frequency Distribution in Tatoeba's Corpus



So, I was dabbling with the corpus for SentencePicker sake, check the last post.

I decide to write a script for plotting the frequency distribution of languages in the Tatoeba corpus.

So basically you give the sentences.csv file (containing all language sentences on Tatoeba), a stopword list if you have and the language. That's it.

It plots and prints out data for the language.

It basically prints how many times each word was present in the corpus. A lot of be learned from that. I will upload the code once I have polished it enough for general use. And yes, pep8 compliant.

This is how it looks :

With Stopwords removal.
Without removing the stopwords, top 500

Observations :
Tom is most used word in the english corpus of Tatoeba after removing the stopwords. It probably used as examples for a lot of sentences. It like a code to use Tom for males and Mary for females. Also, Tom is way popular than Mary.

More, importantly most words are below the 10k mark after Tom.

The second diagram shows the distribution of words without any filtering. No stop words have been removed. Now words like the, is, are and a are on top. Now, we can probably create an automated stopword list generator from such graphs. If you see closely, the second diagram has initial words in the range 1,40,000 to 20,000. These essentially can be assumed to be the stop words, because as we saw in the first diagram without stopwords, the words are below the 10,000 frequency mark.

We just need to pick out the words which have the frequency between the top and 20k.

But each different language will have a different lower limit. And the problem arises in determining the lower limit.

An algorithm which can observe the change in slope of the curve, and all the differences in frequency in sequentially ranked words can be used to get the stopwords of any language given a strong enough corpus.

This will probably be a good contribution to the interwebz, because I recently downloaded a zip with stopword lists of all languages and it contained only about 30 languages. On the other hand Tatoeba has 130+ languages. Now not all languages have a good strength, but one of these years they will all be powerful.

Its a pretty exciting problem, atleast for me though it may be already solved by someone.

Will code it one of these days, stay tuned :)

Ps. Not proofreading, because too sleepy. Also, I am sure a lot of other analysis could have been done from the Frequency Distributions. I am just not in the mood to.


No comments:

Post a Comment