Well, the title describes it all. Its part of my Google Summer of Code project for Tatoeba.
Its the heart of it.
Lately, the code was getting about of hand with many tweaks here and there. It was getting less and less systematic. Anyway, I needed to write everything down to keep track.
Beware, its a long post. A lot of it piled on. Later additions to it will be smaller hopefully.
SentencePicker
What does it do?
Given sentences
(previously tokenized using Punkt Sentence Tokenizer with minor tweaks)
scores and picks sentences which could be good for Tatoeba.
Only
working on english at this moment, I think I have already put too much
effort into english alone. Probably should covert he same script
languages, as many rules are not english specific but rather latin
script specific.
I just need to write down somewhere
the checks that I am doing, because I have a feeling that the code is
getting out of my hand and untidy.
So, initially I am
taking the text as input, and passing to the sentencesplitter module
which earlier created. It is working pretty well.
The
sentences that come out are full of newlines because the Gutenberg text
that I am using is text wrapped. Argh. I needed to normalize those
sentences.
Simple, I need to replace the '\n's with ' '
and the work is done. But on trying for half and hour nothing was
moving. Why wasn't a simple replace call not working. Turned out
Gutenberg Project created all its text files on windows and hence the
newline character was \r\n and not \n . Lesson learned. I hate
normalization.
Now, all sentences got are not perfect,
because Punkt can handle so much. But more or less good results. About
900 sentences extracted.
Now most of them are not useful for the corpus.
The first check was sentence length.
I passed the sentences through the wordtokenizers that I had created earlier on.
I
removed all sentences that smaller than length 4 and greater than
length 12. Well, I got around 117 sentences. I forgot to keep the
equality signs on the checks, and was missing out on sentences which
were exactly of length 4 and 12. Correcting that added around 30 more
sentences.
147/900
I reverted back
with results. Many sentences were junk and pointless. Also, I was losing
out on 750 sentences. Not at all good efficiency.
Now,
we were working on fiction, so it contains a lot of dialogues. A
majority of the 750 sentences were sentences which were of length
greater than 12. And those sentences contained dialogues which were in
length lesser than 12 and useful for the corpus.
A
simple regex was used to get all the sentences between double quotes.
The regex had to be lazy in case there were multiple dialogues in a
sentence.
"(.*?)"
dot is for any character.
asterisk for 0 or more
question mark for making the preceding token optional.
http://www.regular-expressions.info/repeat.html
Anyway, I added the dialogues from the sentences which were greater than length 12.
This
added a lot of good sentences as well as bad sentences. Because Punkt
screwed up a lot of dialogues. Hence not all dialogues made sense or
actually were part of full formed dialogue from the original text.
I
start ignoring sentences which didn't start with an uppercase
character. This helped a lot. As all those partial sentences were
skipped now.
Next, sentences were too archaic or contextual. I only wanted to pick sentences which made sense out of context too.
I
had a list of top 2000 words from modern english fiction from
wikipedia. So for each sentence I tokenized into words and removed all
the stopwords from them using set subtraction (Pretty convenient, also not in is faster in sets than lists).
Later
on I used new sentences and subtracted the top 2000 word set from them
too. I just love the set subtraction hack. I am not really sure if it is
that efficient. But sets are faster than lists from membership checks
because they are hash tables. Anyway, on subtraction, if all the words
in my sentence were also in the set of popular words, my new set would
be empty and according to the algorithm, they were relevant and free
from context.
This check worked wonderfully, but I ended up with only 50 sentences out of the 900 sentences. 5%+ is very low.
I
tried to make the algorithm less stringent and allowed one word not
from the list. This gave me 99 sentences, but yes I ended losing a
little on quality.
Out all the features discussed till
now, most can be used for other languages except that I would be
needing top 2000 or such wordlists. Also some tweaks would be different,
like say sentence lengths etc.
Also, contractions
like they've in sentences get tokenized as they've only. And obviously,
they've isn't on the top 2000 list. Hence a good sentence gets thrown
out. I need to do something about this.
Possible
solutions : I could add these contractions to the list. But I don't want
to edit these lists manually because its goes away from the automated
feature of the script and will involve such list generations for each
language. I wanted my script to be language agnostic more or less.
I could use edit distance while matching and parametrize on the Levenshtein distance.
THEY and THEYVE (after removing the punctuation) would mean two additions and hence 2. But this would take time.
Let's see.
The
other thing is I could save a lot on comparison if I applied a
morphological stemmer of the words and then compared. But again, I don't
want to go in this direction as not all sentences are having strong
morphological stemmers.
Maybe, I could use the existing
Tatoeba corpus to make the wordlist. But then this would end up being
circle jerk, and I wouldn't be going for vocabulary which is missing in
the corpus but should be there.
I'll update later on. This has been a long post.