Harsh Nisar's Programming Journal: July 2014

Friday, 18 July 2014

So this happened. Really embarrassed.

I am currently working on a Flask webapp to use the sentence picker script I am developing for Tatoeba as part of GSoC.

So, I need to take the txt file of an open text as input from the user. I made the form accept the file, send it as a post function to my view. The view accepted the file, did a little validation and tried to print back the type of the object returned from the form on the console.

I tried testing it. The form appeared. Nothing wrong. I had to select a file to upload. Clicked browse, pressed f randomly, and found foo.txt on my Desktop. It was a pretty small file, worked for me. Pressed upload, and learned that it was a FileStorage object. Read the documentation, learned that I needed to call read() (simple to a File obj) to get the text. Changed the code accordingly to print whatever was returned on calling read.

So, the console printed,
fd = 1
Should not see this either.

Hmm. Probably, you couldn't just read a file like that without saving it or something. And what I was doing right now was returning the file descriptor's integer and an internal warning saying, I shouldn't be reading this.

I really don't know what I was thinking. But certainly this wasn't working.

After spending 3-4 hours on it, disturbing the org irc and searching everywhere, I felt like giving up and look into other methods of getting text as an input. Another day went waste.

Out of curiosity, I ended up opening foo.txt to see what it really had.

So. Yes. That happened. I had been reading the file right since the beginning. The file contained the text,
fd = 1
Should not see this either.

Lesson Learned. I feel like a complete asinine right now.

Ps. In retrospect, if I were expecting to get the fd of the newly opened file, it should have been 3. Here.

Saturday, 5 July 2014

Why Clipping Of Frequency Distribution Wouldn't Help With Stopword Generation [Part 2]

In continuation with the last post where I conducted to find a pattern in the frequency of stopwords.

I had earlier assumed that stopwords would be the most frequent words, and if not all, a larger number of them would be the most frequent words. This assumption was proved wrong evidently in the last post. But I realized I wasn't doing a good job, and I should have enough data to completely shun the assumption.

What if the cut off limit is put more lower? What percent of stopwords have been covered? I decided to plot again.

Btw, I am using Tatoeba's english sentences corpus for the experiment.

The cyan line represent the percentage of stopwords acquired by clipping off on a particular frequency. For example if you are looking at the plot at 20000 frequency, it tells that if you set the threshold as 20k, and take all words with frequencies above 20k, you get around 13% of total stopwords listed by nltk.

Naturally, if the cutoff is set as zero, you are taking all the words above 0 frequency, and hence end up getting all the stopwords.

Now, to even get 50% of the stopwords, the frequency has to be set at 3540 which is quite lower than what I had kept as threshold in the last experiment (20k and 10k). About 50% of the stopwords have frequencies less than 3540.

Interesting part :
Now, if set the threshold as 3540, we end up getting 50% stopwords, but what about the number of non stopwords? That is only 45. Now 45 seems small compared to the total number of words, more than 40k. But 45 is considerable noise in the list of stopwords which we are generating. Nltk has a total of 127 stopwords, and 45 is huge enough noise with only 63 correct stopwords selected.

Also, again, purple dots are all word frequency scatter plots, and red represents stopword frequencies.

Tuesday, 1 July 2014

Why Clipping Of Frequency Distribution Wouldn't Help With Stopword Generation

In the last post I was fooling around with the idea that a list of stopwords would automatically be generated by clipping the high frequency part of a frequency distribution of words taken from the Tatoeba corpus of english sentences.

So if we clipped the graph from say around 20,000 or 10,000 we could get a list of stop-words. I further investigated it today and was proved wrong.

So, I took the frequency distribution and plotted the frequency of words on x-axis and simple put the y-axis as 1 to see if I could see any clustering. I was certain till then that I will found a cluster near the 10k mark and the remaining would come as a different cluster. I also had assumed that the stopwords were way more in frequency than the non-stopwords. All was proven wrong.

So, I initially plotted the stopwords to see if I could see any clusters.

So, anyway, I got the cluster a little different that I had expected. The words less than 20k in frequency end up forming a cluster and the remaining words of higher frequency are scattered. Still, the different could be done.

But it all made sense when I plotted the nltk stopwords along with this graph to see whether if really the stopwords are the most frequent words in a corpus.

The dots in red are the words from the nltk stopword list while the blue dots are all the words in the corpus.

Now, I thought that I probably ended up committing a logical mistake in my code because the stopword scatter almost looks the same as the all-word scatter.

But on zooming in on the highly dense part I realized both plots were really mirrors of each other. The high frequency words were part of the stopword list, but the remaining stopwords were equally distributed throughout the corpus and not just at the extreme end of the plot as I had assumed.

I printed the frequencies of all the stopwords and it made sense.

all 10362
just 6383
being 1981
over 3077
both 1241
through 1336
yourselves 66
its 1253
before 3131
herself 447
had 9268
should 4769
to 102287
only 3843
under 894
ours 87
has 10816
do 18263
them 2997
his 20218
very 7362
they 10419
not 17543
during 730
now 4686
him 10412
nor 274
did 6792
this 23848
she 19373
each 1506
further 177
where 3850
few 1445
because 2471
doing 1709
some 4852
are 18810
our 5011
ourselves 183
out 8799
what 14946
for 25174
while 1651
does 2897
above 364
between 971
t 13
be 18481
we 18294
who 5402
were 6786
here 6315
hers 62
by 8629
on 18708
about 8901
of 46076
against 1032
s 13
or 3614
own 1516
into 3706
yourself 1040
down 2823
your 14615
from 9115
her 12497
their 3665
there 9686
been 5945
whom 204
too 3855
themselves 314
was 28484
until 1122
more 5306
himself 1328
that 30412
but 7883
don 9
with 18232
than 5267
those 1489
he 40149
me 19989
myself 1034
these 2213
up 8576
will 11079
below 136
can 10195
theirs 39
my 22829
and 25155
then 1370
is 59895
am 4076
it 29943
an 8428
as 12119
itself 234
at 15178
have 23312
in 43745
any 3826
if 8176
again 2130
no 7407
when 7689
same 1552
how 8036
other 2842
which 2057
you 66957
after 3127
most 1814
such 1912
why 4478
a 76821
off 3071
i 90981
yours 339
so 7067
the 139284
having 1201
once 1477

Even if we remove stopwords like s and t etc, there other stopwords which are in the mid 5k range. Which is way lower than the 10k limit.

Hence, my assumption that stopword list can be formed automatically given a corpus of a particular language, by clipping of the high frequency words is wrong.

Pages