Harsh Nisar's Programming Journal

Tuesday, 17 February 2015

Mass Voting and Bypassing Captcha on College Fest's T-Shirt Poll (Requests and Fiddler)

Disclaimer : Didn't bypass reCaptcha. But webmaster's own system of captcha. Phew.

Ethics
I am going to put this here, in case you do selective reading and skip over this part in the end.
Please be considerate and try not cause harm.
I didn't want to jeopardize the poll. Even though I did have a favorite (#1) and used that particular vote throughout my testing. (I later told the webmaster to adjust the votes accordingly but the decision was made already, #2 was winning with a margin. And thank god it did. T-shirt #1 was a terrible decision, and terrible decisions are not to be judged). Also, I only voted every 5 second and didn't increase the load on the server.
Please use reCaptcha.

Okay.

Every year before my university holds it cultural fest, Synapse, they poll the students regarding which T-Shirt design becomes the official t-shirt representing the festival.

This year, instead of sending simple Google Form through the college webmail, the student webmaster decided to incorporate a self-made form in official Synapse website.

Was the decision there so that people from outside could also vote? Or to integrate the whole Synapse experience at one place. I don't know.

But I am away from college for the last semester for an internship and decided to have some fun.

The form fields are :
Preference
What is 100x10

I have been doing a lot scraping work lately at work and home and hence wanted to see if I could get through this and get some practice.
At least the student decided to stop spam and added his own version of Captcha.
After refreshing the page a few times, I could see the pattern in the questions.

They were either of the type:
Type xyz or What is x + y?

These captchas could be solved easily on the go by the script.

To get a fair idea of different types of questions, I wrote a scraper to scrape the questions and save them in a file. This part of the script was anyway going to be helpful later on when I would be solving the Captcha.

    url = 'http://synapse.daiict.ac.in/poll.php'
    print 'Making request'
    req = requests.session()
    try:
        text = req.get(url, timeout = 5).text.encode('utf-8')
    except:
        #Too lazy to add Exception. Sin.
        print 'Request Timeout'
        return False
    # print text

    text = text.replace('\t','')

    starter = '<option value="6">6</option>'
    ender = '<input id="name" type="text" name="spam"'

    text = text[text.find(starter):text.find(ender)]
    text = text[text.find('<span>') + 6:text.find('</span>')]

    # print text
    question =  text.strip().strip('\n')
    print question

I did write a regular expression initially but it was failing for some reason and I didn't want to waste time debugging it. I could have used Beautiful Soup too but I just wanted to get over with it quickly.
Hence the text.find hack. As long as you get your work done.

The questions:
Name of the state you are in (in lowercase)?
What is when 19+3?
What is when 100x10?
What is when 50x10?
What is when 9+3?
what is the first letter of your college's name?
What is when 9-3?
Type linux
Name of this planet (in lowercase)?
Name of our planet (in lowercase)?
what comes before b?
what comes after a?
What is when 500/10?
Type pink
What is when 500x10?
What is when 10x10?
What is when 50/10?

So, the website wasn't generating questions on the go, but had odd 16-17 questions from which it picked randomly.
Bleh. I opened a csv, and manually wrote down the answers to all the 16 questions.

The POST call
Now, I just need to send the appropriate POST data.
To see which fields were being sent and to what URL, I used Firefox's console.

So the fields that were being sent were:

prefempty
option
pref2
spam
pref

The values of prefempty remained the same and not dynamically generated. I later learned the webmaster had added it to thwart off another attack on the main festival registration form.

Option represented my vote and spam was the answer to the captcha.

All I needed to do was a POST call.

   url = 'http://synapse.daiict.ac.in/poll.php'
    payload = { 'option': option, 'spam':answer, 'pref': 'vote', 'pref2':'http://', 'prefempty':''}
    try:
        r = requests.post(url, data=payload, timeout = 5)
    except:
        print 'Post time out'
        return False

Done! Done! Done!
But wait.
How do I check whether the post was successful or not?

Fiddler. This is a tool which I recently used at work. It let's you monitor all the post and get requests being made my your computer and not just your browser like the console.

Every successful vote returned a page which had 'Vote has been registered' in it.
So, Fiddler let's you check the response of your request and I couldn't find the word registered in the response. Hence my votes were not working.

Could it be that it was blocking the request because it didn't have appropriate headers? Or something to do with cookies?

So, a good way to go about it was to replicate a normal browser as much as I could.
So, instead of doing the GET call and the POST call differently, I did them from the same request session.

Why did it work?
In retrospection, every time I made a GET request, the server started a session for me and picked a question, answer pair from its pool of questions. Now, when I made a POST call (not from the same session) a new session was being created for me whose question and answer I didn't know. I was simply sending the answer for the previous session. And hence I didn't pass the Captcha. The chances of me passing the captcha then would be 1/16. It was like guessing.
Making both the calls from the sessions made sure I was answering by POST to the appropriate question scraped from the same session.

So, this worked correctly. Yay :D

So, I ran the script to do a 1200 votes equally voting all the shirts. Fun fact, my college's population is around that number!

And the webmaster intervened..

Somewhere around 2:30 AM, I wanted to check the count, feel a little good about myself, and go to sleep. But the script was failing the captcha.

On first glance, I could see that the questions were still the same. Why was the script failing?

It turned out the webmaster had obviously noted the votes and decided to change the questions but keep the answers the same!

Lazy webmaster, very lazy. Can't blame him, it was too late in the night.

So he had added one or two more questions. Slightly edited the questions. But majorly he had batch edited the maths questions from:
What is X + Y to What happens when X + Y?

I chuckled way too hard for that time of the night. So I batch edited my questions too in the csv and we were good to go! This was fun, turning into 1 v 1. And it was harmless.

By 3 AM was done with the changes, but what if he was still up and changed the questions again. Should I wait a little longer and then attack again? But I was really sleepy.

So, I started the script with a delay,

1	time.sleep(1800)

and both of us went to sleep together. Aww. :|

Next morning I called up the webmaster, and told him what I was doing and to not fret about the poll results because I had equally voted. He was a good sport and took it well. And we decided to play the game and we decided that he wouldn't Google and still figure out a way to stop my script.

He eventually added random number of empty fields on each request, and added multiple forms from which only one was active. This had I encountered first, I wouldn't have tried to bypass this. It was a good trick. Dirty, but effective!

Now, this was a good hack and I would have to parse the HTML and send a post request to all the forms, making sure at least one was correct. The empty fields could also be took from the forms. But it was a lot of grunt work and I didn't want to pursue it. Also because I was at work.

But all in all, it was educational for both the parties. Good sport, webmaster.

In the time the did this, and decided to blog this I finally decided to use Selenium for one my projects. In retrospection, Selenium would have easily solved the problem of multiple forms and hidden fields.

What should have the webmaster used?
One should simply use reCaptcha and that is what I had suggested but the webmaster didn't want to rely on external tools and this didn't make sense to me at all. Yes, it is ugly and spoils the UX and could hamper people from voting, but people would atleast vote once and that is what we wanted. This poll was a 2 day affair but the festival 's main registration should definitely have reCaptcha. Well, I haven't checked if it is updated.

Tools used : Python (Requests), Fiddler

Please use reCaptcha!

Friday, 18 July 2014

So this happened. Really embarrassed.

I am currently working on a Flask webapp to use the sentence picker script I am developing for Tatoeba as part of GSoC.

So, I need to take the txt file of an open text as input from the user. I made the form accept the file, send it as a post function to my view. The view accepted the file, did a little validation and tried to print back the type of the object returned from the form on the console.

I tried testing it. The form appeared. Nothing wrong. I had to select a file to upload. Clicked browse, pressed f randomly, and found foo.txt on my Desktop. It was a pretty small file, worked for me. Pressed upload, and learned that it was a FileStorage object. Read the documentation, learned that I needed to call read() (simple to a File obj) to get the text. Changed the code accordingly to print whatever was returned on calling read.

So, the console printed,
fd = 1
Should not see this either.

Hmm. Probably, you couldn't just read a file like that without saving it or something. And what I was doing right now was returning the file descriptor's integer and an internal warning saying, I shouldn't be reading this.

I really don't know what I was thinking. But certainly this wasn't working.

After spending 3-4 hours on it, disturbing the org irc and searching everywhere, I felt like giving up and look into other methods of getting text as an input. Another day went waste.

Out of curiosity, I ended up opening foo.txt to see what it really had.

So. Yes. That happened. I had been reading the file right since the beginning. The file contained the text,
fd = 1
Should not see this either.

Lesson Learned. I feel like a complete asinine right now.

Ps. In retrospect, if I were expecting to get the fd of the newly opened file, it should have been 3. Here.

Saturday, 5 July 2014

Why Clipping Of Frequency Distribution Wouldn't Help With Stopword Generation [Part 2]

In continuation with the last post where I conducted to find a pattern in the frequency of stopwords.

I had earlier assumed that stopwords would be the most frequent words, and if not all, a larger number of them would be the most frequent words. This assumption was proved wrong evidently in the last post. But I realized I wasn't doing a good job, and I should have enough data to completely shun the assumption.

What if the cut off limit is put more lower? What percent of stopwords have been covered? I decided to plot again.

Btw, I am using Tatoeba's english sentences corpus for the experiment.

The cyan line represent the percentage of stopwords acquired by clipping off on a particular frequency. For example if you are looking at the plot at 20000 frequency, it tells that if you set the threshold as 20k, and take all words with frequencies above 20k, you get around 13% of total stopwords listed by nltk.

Naturally, if the cutoff is set as zero, you are taking all the words above 0 frequency, and hence end up getting all the stopwords.

Now, to even get 50% of the stopwords, the frequency has to be set at 3540 which is quite lower than what I had kept as threshold in the last experiment (20k and 10k). About 50% of the stopwords have frequencies less than 3540.

Interesting part :
Now, if set the threshold as 3540, we end up getting 50% stopwords, but what about the number of non stopwords? That is only 45. Now 45 seems small compared to the total number of words, more than 40k. But 45 is considerable noise in the list of stopwords which we are generating. Nltk has a total of 127 stopwords, and 45 is huge enough noise with only 63 correct stopwords selected.

Also, again, purple dots are all word frequency scatter plots, and red represents stopword frequencies.

Tuesday, 1 July 2014

Why Clipping Of Frequency Distribution Wouldn't Help With Stopword Generation

In the last post I was fooling around with the idea that a list of stopwords would automatically be generated by clipping the high frequency part of a frequency distribution of words taken from the Tatoeba corpus of english sentences.

So if we clipped the graph from say around 20,000 or 10,000 we could get a list of stop-words. I further investigated it today and was proved wrong.

So, I took the frequency distribution and plotted the frequency of words on x-axis and simple put the y-axis as 1 to see if I could see any clustering. I was certain till then that I will found a cluster near the 10k mark and the remaining would come as a different cluster. I also had assumed that the stopwords were way more in frequency than the non-stopwords. All was proven wrong.

So, I initially plotted the stopwords to see if I could see any clusters.

So, anyway, I got the cluster a little different that I had expected. The words less than 20k in frequency end up forming a cluster and the remaining words of higher frequency are scattered. Still, the different could be done.

But it all made sense when I plotted the nltk stopwords along with this graph to see whether if really the stopwords are the most frequent words in a corpus.

The dots in red are the words from the nltk stopword list while the blue dots are all the words in the corpus.

Now, I thought that I probably ended up committing a logical mistake in my code because the stopword scatter almost looks the same as the all-word scatter.

But on zooming in on the highly dense part I realized both plots were really mirrors of each other. The high frequency words were part of the stopword list, but the remaining stopwords were equally distributed throughout the corpus and not just at the extreme end of the plot as I had assumed.

I printed the frequencies of all the stopwords and it made sense.

all 10362
just 6383
being 1981
over 3077
both 1241
through 1336
yourselves 66
its 1253
before 3131
herself 447
had 9268
should 4769
to 102287
only 3843
under 894
ours 87
has 10816
do 18263
them 2997
his 20218
very 7362
they 10419
not 17543
during 730
now 4686
him 10412
nor 274
did 6792
this 23848
she 19373
each 1506
further 177
where 3850
few 1445
because 2471
doing 1709
some 4852
are 18810
our 5011
ourselves 183
out 8799
what 14946
for 25174
while 1651
does 2897
above 364
between 971
t 13
be 18481
we 18294
who 5402
were 6786
here 6315
hers 62
by 8629
on 18708
about 8901
of 46076
against 1032
s 13
or 3614
own 1516
into 3706
yourself 1040
down 2823
your 14615
from 9115
her 12497
their 3665
there 9686
been 5945
whom 204
too 3855
themselves 314
was 28484
until 1122
more 5306
himself 1328
that 30412
but 7883
don 9
with 18232
than 5267
those 1489
he 40149
me 19989
myself 1034
these 2213
up 8576
will 11079
below 136
can 10195
theirs 39
my 22829
and 25155
then 1370
is 59895
am 4076
it 29943
an 8428
as 12119
itself 234
at 15178
have 23312
in 43745
any 3826
if 8176
again 2130
no 7407
when 7689
same 1552
how 8036
other 2842
which 2057
you 66957
after 3127
most 1814
such 1912
why 4478
a 76821
off 3071
i 90981
yours 339
so 7067
the 139284
having 1201
once 1477

Even if we remove stopwords like s and t etc, there other stopwords which are in the mid 5k range. Which is way lower than the 10k limit.

Hence, my assumption that stopword list can be formed automatically given a corpus of a particular language, by clipping of the high frequency words is wrong.

Wednesday, 18 June 2014

Frequency Distribution in Tatoeba's Corpus

So, I was dabbling with the corpus for SentencePicker sake, check the last post.

I decide to write a script for plotting the frequency distribution of languages in the Tatoeba corpus.

So basically you give the sentences.csv file (containing all language sentences on Tatoeba), a stopword list if you have and the language. That's it.

It plots and prints out data for the language.

It basically prints how many times each word was present in the corpus. A lot of be learned from that. I will upload the code once I have polished it enough for general use. And yes, pep8 compliant.

This is how it looks :

With Stopwords removal.

Without removing the stopwords, top 500

Observations :
Tom is most used word in the english corpus of Tatoeba after removing the stopwords. It probably used as examples for a lot of sentences. It like a code to use Tom for males and Mary for females. Also, Tom is way popular than Mary.

More, importantly most words are below the 10k mark after Tom.

The second diagram shows the distribution of words without any filtering. No stop words have been removed. Now words like the, is, are and a are on top. Now, we can probably create an automated stopword list generator from such graphs. If you see closely, the second diagram has initial words in the range 1,40,000 to 20,000. These essentially can be assumed to be the stop words, because as we saw in the first diagram without stopwords, the words are below the 10,000 frequency mark.

We just need to pick out the words which have the frequency between the top and 20k.

But each different language will have a different lower limit. And the problem arises in determining the lower limit.

An algorithm which can observe the change in slope of the curve, and all the differences in frequency in sequentially ranked words can be used to get the stopwords of any language given a strong enough corpus.

This will probably be a good contribution to the interwebz, because I recently downloaded a zip with stopword lists of all languages and it contained only about 30 languages. On the other hand Tatoeba has 130+ languages. Now not all languages have a good strength, but one of these years they will all be powerful.

Its a pretty exciting problem, atleast for me though it may be already solved by someone.

Will code it one of these days, stay tuned :)

Ps. Not proofreading, because too sleepy. Also, I am sure a lot of other analysis could have been done from the Frequency Distributions. I am just not in the mood to.

Monday, 16 June 2014

Sentence Picker : Picking Usable Context Free Sentences From Open Texts

Well, the title describes it all. Its part of my Google Summer of Code project for Tatoeba.

Its the heart of it.

Lately, the code was getting about of hand with many tweaks here and there. It was getting less and less systematic. Anyway, I needed to write everything down to keep track.

Beware, its a long post. A lot of it piled on. Later additions to it will be smaller hopefully.

SentencePicker
What does it do?
Given sentences (previously tokenized using Punkt Sentence Tokenizer with minor tweaks) scores and picks sentences which could be good for Tatoeba.

Only working on english at this moment, I think I have already put too much effort into english alone. Probably should covert he same script languages, as many rules are not english specific but rather latin script specific.

I just need to write down somewhere the checks that I am doing, because I have a feeling that the code is getting out of my hand and untidy.

So, initially I am taking the text as input, and passing to the sentencesplitter module which earlier created. It is working pretty well.

The sentences that come out are full of newlines because the Gutenberg text that I am using is text wrapped. Argh. I needed to normalize those sentences.

Simple, I need to replace the '\n's with ' ' and the work is done. But on trying for half and hour nothing was moving. Why wasn't a simple replace call not working. Turned out Gutenberg Project created all its text files on windows and hence the newline character was \r\n and not \n . Lesson learned. I hate normalization.

Now, all sentences got are not perfect, because Punkt can handle so much. But more or less good results. About 900 sentences extracted.

Now most of them are not useful for the corpus.

The first check was sentence length.

I passed the sentences through the wordtokenizers that I had created earlier on.
I removed all sentences that smaller than length 4 and greater than length 12. Well, I got around 117 sentences. I forgot to keep the equality signs on the checks, and was missing out on sentences which were exactly of length 4 and 12. Correcting that added around 30 more sentences.

147/900

I reverted back with results. Many sentences were junk and pointless. Also, I was losing out on 750 sentences. Not at all good efficiency.

Now, we were working on fiction, so it contains a lot of dialogues. A majority of the 750 sentences were sentences which were of length greater than 12. And those sentences contained dialogues which were in length lesser than 12 and useful for the corpus.

A simple regex was used to get all the sentences between double quotes. The regex had to be lazy in case there were multiple dialogues in a sentence.

"(.*?)"

dot is for any character.
asterisk for 0 or more
question mark for making the preceding token optional.

http://www.regular-expressions.info/repeat.html

Anyway, I added the dialogues from the sentences which were greater than length 12.

This added a lot of good sentences as well as bad sentences. Because Punkt screwed up a lot of dialogues. Hence not all dialogues made sense or actually were part of full formed dialogue from the original text.

I start ignoring sentences which didn't start with an uppercase character. This helped a lot. As all those partial sentences were skipped now.

Next, sentences were too archaic or contextual. I only wanted to pick sentences which made sense out of context too.

I had a list of top 2000 words from modern english fiction from wikipedia. So for each sentence I tokenized into words and removed all the stopwords from them using set subtraction (Pretty convenient, also not in is faster in sets than lists).

Later on I used new sentences and subtracted the top 2000 word set from them too. I just love the set subtraction hack. I am not really sure if it is that efficient. But sets are faster than lists from membership checks because they are hash tables. Anyway, on subtraction, if all the words in my sentence were also in the set of popular words, my new set would be empty and according to the algorithm, they were relevant and free from context.

This check worked wonderfully, but I ended up with only 50 sentences out of the 900 sentences. 5%+ is very low.

I tried to make the algorithm less stringent and allowed one word not from the list. This gave me 99 sentences, but yes I ended losing a little on quality.

Out all the features discussed till now, most can be used for other languages except that I would be needing top 2000 or such wordlists. Also some tweaks would be different, like say sentence lengths etc.

Also, contractions like they've in sentences get tokenized as they've only. And obviously, they've isn't on the top 2000 list. Hence a good sentence gets thrown out. I need to do something about this.

Possible solutions : I could add these contractions to the list. But I don't want to edit these lists manually because its goes away from the automated feature of the script and will involve such list generations for each language. I wanted my script to be language agnostic more or less.

I could use edit distance while matching and parametrize on the Levenshtein distance.

THEY and THEYVE (after removing the punctuation) would mean two additions and hence 2. But this would take time.

Let's see.

The other thing is I could save a lot on comparison if I applied a morphological stemmer of the words and then compared. But again, I don't want to go in this direction as not all sentences are having strong morphological stemmers.

Maybe, I could use the existing Tatoeba corpus to make the wordlist. But then this would end up being circle jerk, and I wouldn't be going for vocabulary which is missing in the corpus but should be there.

I'll update later on. This has been a long post.

Thursday, 12 June 2014

Hello Coding Conventions. Using Pylint and pep8.

I feel like I lost my innocence. After ignoring my mentor's demand to make my code pep8 compliant since days, I finally buckled down.

I installed sumblimelinter, which acts as an interface to actual linters for different programming languages.

Note : If you are using pylint from sublime for the first time, you need to have pylint installed on your machine.

pip install pylint

Then only the interface sublimelinter-pylint will work (duh).

Or else you would get this,
pylint deactivated, cannot locate 'pylint@python'

What exactly is Pylinter?

Pylint is a Python tool that checks a module for coding standards. According to the TurboGears project coding guidelines, PEP8 is the standard and pylint is a good mechanical test to help us in attaining that goal.
The range of checks run from Python errors, missing docstrings, unused imports, unintended redefinition of built-ins, to bad naming and more.

Yes. Bad naming. And what is demotivating is that they give scores. So I quickly ran pylint on one of my scripts.

Here are the results.

Yes. I was given a score in negative. -1.05/10.

Most my variables are invalid names. What exactly is invalid names?
Well, most of my variables were outside of any function or class (It was a quickly written script) and hence they were global, and according to convention global variables should be uppercase.

So I am supposed to wrap it up in functions, which is all right because it makes them reusable. But not today.

There are other things to which I never really noticed before, like trailing white spaces and spaces after commas etc. This is going to be a rough ride.

Lets see if my scores get better.

Pages