Pages

Monday 16 June 2014

Sentence Picker : Picking Usable Context Free Sentences From Open Texts

Well, the title describes it all. Its part of my Google Summer of Code project for Tatoeba.

Its the heart of it.

Lately, the code was getting about of hand with many tweaks here and there. It was getting less and less systematic. Anyway, I needed to write everything down to keep track.

Beware, its a long post. A lot of it piled on. Later additions to it will be smaller hopefully.

SentencePicker
What does it do?
Given sentences (previously tokenized using Punkt Sentence Tokenizer with minor tweaks) scores and picks sentences which could be good for Tatoeba.

Only working on english at this moment, I think I have already put too much effort into english alone. Probably should covert he same script languages, as many rules are not english specific but rather latin script specific.

I just need to write down somewhere the checks that I am doing, because I have a feeling that the code is getting out of my hand and untidy.

So, initially I am taking the text as input, and passing to the sentencesplitter module which earlier created. It is working pretty well.

The sentences that come out are full of newlines because the Gutenberg text that I am using is text wrapped. Argh. I needed to normalize those sentences.

Simple, I need to replace the '\n's with ' ' and the work is done. But on trying for half and hour nothing was moving. Why wasn't a simple replace call not working. Turned out Gutenberg Project created all its text files on windows and hence the newline character was \r\n and not \n . Lesson learned. I hate normalization.

Now, all sentences got are not perfect, because Punkt can handle so much. But more or less good results. About 900 sentences extracted.

Now most of them are not useful for the corpus.

The first check was sentence length.

I passed the sentences through the wordtokenizers that I had created earlier on.
I removed all sentences that smaller than length 4 and greater than length 12. Well, I got around 117 sentences. I forgot to keep the equality signs on the checks, and was missing out on sentences which were exactly of length 4 and 12. Correcting that added around 30 more sentences.

147/900

I reverted back with results. Many sentences were junk and pointless. Also, I was losing out on 750 sentences. Not at all good efficiency.

Now, we were working on fiction, so it contains a lot of dialogues. A majority of the 750 sentences were sentences which were of length greater than 12. And those sentences contained dialogues which were in length lesser than 12 and useful for the corpus.

A simple regex was used to get all the sentences between double quotes. The regex had to be lazy in case there were multiple dialogues in a sentence.

"(.*?)"

dot is for any character.
asterisk for 0 or more
question mark for making the preceding token optional.

http://www.regular-expressions.info/repeat.html

Anyway, I added the dialogues from the sentences which were greater than length 12.

This added a lot of good sentences as well as bad sentences. Because Punkt screwed up a lot of dialogues. Hence not all dialogues made sense or actually were part of full formed dialogue from the original text.

I start ignoring sentences which didn't start with an uppercase character. This helped a lot. As all those partial sentences were skipped now.

Next, sentences were too archaic or contextual. I only wanted to pick sentences which made sense out of context too.

I had a list of top 2000 words from modern english fiction from wikipedia. So for each sentence I tokenized into words and removed all the stopwords from them using set subtraction (Pretty convenient, also not in is faster in sets than lists).

Later on I used new sentences and subtracted the top 2000 word set from them too. I just love the set subtraction hack. I am not really sure if it is that efficient. But sets are faster than lists from membership checks because they are hash tables. Anyway, on subtraction, if all the words in my sentence were also in the set of popular words, my new set would be empty and according to the algorithm, they were relevant and free from context.

This check worked wonderfully, but I ended up with only 50 sentences out of the 900 sentences. 5%+ is very low.

I tried to make the algorithm less stringent and allowed one word not from the list. This gave me 99 sentences, but yes I ended losing a little on quality.

Out all the features discussed till now, most can be used for other languages except that I would be needing top 2000 or such wordlists. Also some tweaks would be different, like say sentence lengths etc.

Also, contractions like they've in sentences get tokenized as they've only. And obviously, they've isn't on the top 2000 list. Hence a good sentence gets thrown out. I need to do something about this.

Possible solutions : I could add these contractions to the list. But I don't want to edit these lists manually because its goes away from the automated feature of the script and will involve such list generations for each language. I wanted my script to be language agnostic more or less.

I could use edit distance while matching and parametrize on the Levenshtein distance.

THEY and THEYVE (after removing the punctuation) would mean two additions and hence 2. But this would take time.

Let's see.

The other thing is I could save a lot on comparison if I applied a morphological stemmer of the words and then compared. But again, I don't want to go in this direction as not all sentences are having strong morphological stemmers.

Maybe, I could use the existing Tatoeba corpus to make the wordlist. But then this would end up being circle jerk, and I wouldn't be going for vocabulary which is missing in the corpus but should be there.

I'll update later on. This has been a long post.

 


No comments:

Post a Comment