Pages

Wednesday 18 June 2014

Frequency Distribution in Tatoeba's Corpus



So, I was dabbling with the corpus for SentencePicker sake, check the last post.

I decide to write a script for plotting the frequency distribution of languages in the Tatoeba corpus.

So basically you give the sentences.csv file (containing all language sentences on Tatoeba), a stopword list if you have and the language. That's it.

It plots and prints out data for the language.

It basically prints how many times each word was present in the corpus. A lot of be learned from that. I will upload the code once I have polished it enough for general use. And yes, pep8 compliant.

This is how it looks :

With Stopwords removal.
Without removing the stopwords, top 500

Observations :
Tom is most used word in the english corpus of Tatoeba after removing the stopwords. It probably used as examples for a lot of sentences. It like a code to use Tom for males and Mary for females. Also, Tom is way popular than Mary.

More, importantly most words are below the 10k mark after Tom.

The second diagram shows the distribution of words without any filtering. No stop words have been removed. Now words like the, is, are and a are on top. Now, we can probably create an automated stopword list generator from such graphs. If you see closely, the second diagram has initial words in the range 1,40,000 to 20,000. These essentially can be assumed to be the stop words, because as we saw in the first diagram without stopwords, the words are below the 10,000 frequency mark.

We just need to pick out the words which have the frequency between the top and 20k.

But each different language will have a different lower limit. And the problem arises in determining the lower limit.

An algorithm which can observe the change in slope of the curve, and all the differences in frequency in sequentially ranked words can be used to get the stopwords of any language given a strong enough corpus.

This will probably be a good contribution to the interwebz, because I recently downloaded a zip with stopword lists of all languages and it contained only about 30 languages. On the other hand Tatoeba has 130+ languages. Now not all languages have a good strength, but one of these years they will all be powerful.

Its a pretty exciting problem, atleast for me though it may be already solved by someone.

Will code it one of these days, stay tuned :)

Ps. Not proofreading, because too sleepy. Also, I am sure a lot of other analysis could have been done from the Frequency Distributions. I am just not in the mood to.


Monday 16 June 2014

Sentence Picker : Picking Usable Context Free Sentences From Open Texts

Well, the title describes it all. Its part of my Google Summer of Code project for Tatoeba.

Its the heart of it.

Lately, the code was getting about of hand with many tweaks here and there. It was getting less and less systematic. Anyway, I needed to write everything down to keep track.

Beware, its a long post. A lot of it piled on. Later additions to it will be smaller hopefully.

SentencePicker
What does it do?
Given sentences (previously tokenized using Punkt Sentence Tokenizer with minor tweaks) scores and picks sentences which could be good for Tatoeba.

Only working on english at this moment, I think I have already put too much effort into english alone. Probably should covert he same script languages, as many rules are not english specific but rather latin script specific.

I just need to write down somewhere the checks that I am doing, because I have a feeling that the code is getting out of my hand and untidy.

So, initially I am taking the text as input, and passing to the sentencesplitter module which earlier created. It is working pretty well.

The sentences that come out are full of newlines because the Gutenberg text that I am using is text wrapped. Argh. I needed to normalize those sentences.

Simple, I need to replace the '\n's with ' ' and the work is done. But on trying for half and hour nothing was moving. Why wasn't a simple replace call not working. Turned out Gutenberg Project created all its text files on windows and hence the newline character was \r\n and not \n . Lesson learned. I hate normalization.

Now, all sentences got are not perfect, because Punkt can handle so much. But more or less good results. About 900 sentences extracted.

Now most of them are not useful for the corpus.

The first check was sentence length.

I passed the sentences through the wordtokenizers that I had created earlier on.
I removed all sentences that smaller than length 4 and greater than length 12. Well, I got around 117 sentences. I forgot to keep the equality signs on the checks, and was missing out on sentences which were exactly of length 4 and 12. Correcting that added around 30 more sentences.

147/900

I reverted back with results. Many sentences were junk and pointless. Also, I was losing out on 750 sentences. Not at all good efficiency.

Now, we were working on fiction, so it contains a lot of dialogues. A majority of the 750 sentences were sentences which were of length greater than 12. And those sentences contained dialogues which were in length lesser than 12 and useful for the corpus.

A simple regex was used to get all the sentences between double quotes. The regex had to be lazy in case there were multiple dialogues in a sentence.

"(.*?)"

dot is for any character.
asterisk for 0 or more
question mark for making the preceding token optional.

http://www.regular-expressions.info/repeat.html

Anyway, I added the dialogues from the sentences which were greater than length 12.

This added a lot of good sentences as well as bad sentences. Because Punkt screwed up a lot of dialogues. Hence not all dialogues made sense or actually were part of full formed dialogue from the original text.

I start ignoring sentences which didn't start with an uppercase character. This helped a lot. As all those partial sentences were skipped now.

Next, sentences were too archaic or contextual. I only wanted to pick sentences which made sense out of context too.

I had a list of top 2000 words from modern english fiction from wikipedia. So for each sentence I tokenized into words and removed all the stopwords from them using set subtraction (Pretty convenient, also not in is faster in sets than lists).

Later on I used new sentences and subtracted the top 2000 word set from them too. I just love the set subtraction hack. I am not really sure if it is that efficient. But sets are faster than lists from membership checks because they are hash tables. Anyway, on subtraction, if all the words in my sentence were also in the set of popular words, my new set would be empty and according to the algorithm, they were relevant and free from context.

This check worked wonderfully, but I ended up with only 50 sentences out of the 900 sentences. 5%+ is very low.

I tried to make the algorithm less stringent and allowed one word not from the list. This gave me 99 sentences, but yes I ended losing a little on quality.

Out all the features discussed till now, most can be used for other languages except that I would be needing top 2000 or such wordlists. Also some tweaks would be different, like say sentence lengths etc.

Also, contractions like they've in sentences get tokenized as they've only. And obviously, they've isn't on the top 2000 list. Hence a good sentence gets thrown out. I need to do something about this.

Possible solutions : I could add these contractions to the list. But I don't want to edit these lists manually because its goes away from the automated feature of the script and will involve such list generations for each language. I wanted my script to be language agnostic more or less.

I could use edit distance while matching and parametrize on the Levenshtein distance.

THEY and THEYVE (after removing the punctuation) would mean two additions and hence 2. But this would take time.

Let's see.

The other thing is I could save a lot on comparison if I applied a morphological stemmer of the words and then compared. But again, I don't want to go in this direction as not all sentences are having strong morphological stemmers.

Maybe, I could use the existing Tatoeba corpus to make the wordlist. But then this would end up being circle jerk, and I wouldn't be going for vocabulary which is missing in the corpus but should be there.

I'll update later on. This has been a long post.

 


Thursday 12 June 2014

Hello Coding Conventions. Using Pylint and pep8.

I feel like I lost my innocence. After ignoring my mentor's demand to make my code pep8 compliant since days, I finally buckled down.

I installed sumblimelinter, which acts as an interface to actual linters for different programming languages.

Note : If you are using pylint from sublime for the first time, you need to have pylint installed on your machine.

pip install pylint

Then only the interface sublimelinter-pylint will work (duh).

Or else you would get this,
pylint deactivated, cannot locate 'pylint@python'


What exactly is Pylinter?
Pylint is a Python tool that checks a module for coding standards. According to the TurboGears project coding guidelines, PEP8 is the standard and pylint is a good mechanical test to help us in attaining that goal.
The range of checks run from Python errors, missing docstrings, unused imports, unintended redefinition of built-ins, to bad naming and more.

Yes. Bad naming. And what is demotivating is that they give scores. So I quickly ran pylint on one of my scripts.

Here are the results.

Yes. I was given a score in negative. -1.05/10.

Most my variables are invalid names. What exactly is invalid names?
Well, most of my variables were outside of any function or class (It was a quickly written script) and hence they were global, and according to convention global variables should be uppercase.

So I am supposed to wrap it up in functions, which is all right because it makes them reusable. But not today.

There are other things to which I never really noticed before, like trailing white spaces and spaces after commas etc. This is going to be a rough ride.

Lets see if my scores get better.

Wednesday 11 June 2014

Good Night World

Sorry to disappoint, not the clichéd hello world!

Anyway, started this blog to keep a track of things I am coding and learning through the Google Summer of Code.

I am pretty late I know, but will clear off the backlog.

Right now going to sleep, been coding through the night and its 6:42 AM right now.


Why this name, hashcomment?Because I am too sleepy right now to think of something else. This is not my creative writing blog. Go there. I already spend alot of time thinking about variable names since the day I was ridiculed on irc about my variable names.

What is this, Java?!

Good Night!