In the last post I was fooling around with the idea that a list of stopwords would automatically be generated by clipping the high frequency part of a frequency distribution of words taken from the Tatoeba corpus of english sentences.

So if we clipped the graph from say around 20,000 or 10,000 we could get a list of stop-words. I further investigated it today and was proved wrong.
So, I took the frequency distribution and plotted the frequency of words on x-axis and simple put the y-axis as 1 to see if I could see any clustering. I was certain till then that I will found a cluster near the 10k mark and the remaining would come as a different cluster. I also had assumed that the stopwords were way more in frequency than the non-stopwords. All was proven wrong.
So, I initially plotted the stopwords to see if I could see any clusters.
So, anyway, I got the cluster a little different that I had expected. The words less than 20k in frequency end up forming a cluster and the remaining words of higher frequency are scattered. Still, the different could be done.
But it all made sense when I plotted the nltk stopwords along with this graph to see whether if really the stopwords are the most frequent words in a corpus.

The dots in red are the words from the nltk stopword list while the blue dots are all the words in the corpus.
Now, I thought that I probably ended up committing a logical mistake in my code because the stopword scatter almost looks the same as the all-word scatter.
But on zooming in on the highly dense part I realized both plots were really mirrors of each other. The high frequency words were part of the stopword list, but the remaining stopwords were equally distributed throughout the corpus and not just at the extreme end of the plot as I had assumed.
I printed the frequencies of all the stopwords and it made sense.
all 10362
just 6383
being 1981
over 3077
both 1241
through 1336
yourselves 66
its 1253
before 3131
herself 447
had 9268
should 4769
to 102287
only 3843
under 894
ours 87
has 10816
do 18263
them 2997
his 20218
very 7362
they 10419
not 17543
during 730
now 4686
him 10412
nor 274
did 6792
this 23848
she 19373
each 1506
further 177
where 3850
few 1445
because 2471
doing 1709
some 4852
are 18810
our 5011
ourselves 183
out 8799
what 14946
for 25174
while 1651
does 2897
above 364
between 971
t 13
be 18481
we 18294
who 5402
were 6786
here 6315
hers 62
by 8629
on 18708
about 8901
of 46076
against 1032
s 13
or 3614
own 1516
into 3706
yourself 1040
down 2823
your 14615
from 9115
her 12497
their 3665
there 9686
been 5945
whom 204
too 3855
themselves 314
was 28484
until 1122
more 5306
himself 1328
that 30412
but 7883
don 9
with 18232
than 5267
those 1489
he 40149
me 19989
myself 1034
these 2213
up 8576
will 11079
below 136
can 10195
theirs 39
my 22829
and 25155
then 1370
is 59895
am 4076
it 29943
an 8428
as 12119
itself 234
at 15178
have 23312
in 43745
any 3826
if 8176
again 2130
no 7407
when 7689
same 1552
how 8036
other 2842
which 2057
you 66957
after 3127
most 1814
such 1912
why 4478
a 76821
off 3071
i 90981
yours 339
so 7067
the 139284
having 1201
once 1477
Even if we remove stopwords like s and t etc, there other stopwords which are in the mid 5k range. Which is way lower than the 10k limit.
Hence, my assumption that stopword list can be formed automatically given a corpus of a particular language, by clipping of the high frequency words is wrong.

So if we clipped the graph from say around 20,000 or 10,000 we could get a list of stop-words. I further investigated it today and was proved wrong.
So, I took the frequency distribution and plotted the frequency of words on x-axis and simple put the y-axis as 1 to see if I could see any clustering. I was certain till then that I will found a cluster near the 10k mark and the remaining would come as a different cluster. I also had assumed that the stopwords were way more in frequency than the non-stopwords. All was proven wrong.
So, I initially plotted the stopwords to see if I could see any clusters.
So, anyway, I got the cluster a little different that I had expected. The words less than 20k in frequency end up forming a cluster and the remaining words of higher frequency are scattered. Still, the different could be done.
But it all made sense when I plotted the nltk stopwords along with this graph to see whether if really the stopwords are the most frequent words in a corpus.

The dots in red are the words from the nltk stopword list while the blue dots are all the words in the corpus.
Now, I thought that I probably ended up committing a logical mistake in my code because the stopword scatter almost looks the same as the all-word scatter.
But on zooming in on the highly dense part I realized both plots were really mirrors of each other. The high frequency words were part of the stopword list, but the remaining stopwords were equally distributed throughout the corpus and not just at the extreme end of the plot as I had assumed.
I printed the frequencies of all the stopwords and it made sense.
all 10362
just 6383
being 1981
over 3077
both 1241
through 1336
yourselves 66
its 1253
before 3131
herself 447
had 9268
should 4769
to 102287
only 3843
under 894
ours 87
has 10816
do 18263
them 2997
his 20218
very 7362
they 10419
not 17543
during 730
now 4686
him 10412
nor 274
did 6792
this 23848
she 19373
each 1506
further 177
where 3850
few 1445
because 2471
doing 1709
some 4852
are 18810
our 5011
ourselves 183
out 8799
what 14946
for 25174
while 1651
does 2897
above 364
between 971
t 13
be 18481
we 18294
who 5402
were 6786
here 6315
hers 62
by 8629
on 18708
about 8901
of 46076
against 1032
s 13
or 3614
own 1516
into 3706
yourself 1040
down 2823
your 14615
from 9115
her 12497
their 3665
there 9686
been 5945
whom 204
too 3855
themselves 314
was 28484
until 1122
more 5306
himself 1328
that 30412
but 7883
don 9
with 18232
than 5267
those 1489
he 40149
me 19989
myself 1034
these 2213
up 8576
will 11079
below 136
can 10195
theirs 39
my 22829
and 25155
then 1370
is 59895
am 4076
it 29943
an 8428
as 12119
itself 234
at 15178
have 23312
in 43745
any 3826
if 8176
again 2130
no 7407
when 7689
same 1552
how 8036
other 2842
which 2057
you 66957
after 3127
most 1814
such 1912
why 4478
a 76821
off 3071
i 90981
yours 339
so 7067
the 139284
having 1201
once 1477
Even if we remove stopwords like s and t etc, there other stopwords which are in the mid 5k range. Which is way lower than the 10k limit.
Hence, my assumption that stopword list can be formed automatically given a corpus of a particular language, by clipping of the high frequency words is wrong.

 
No comments:
Post a Comment