Disclaimer : Didn't bypass reCaptcha. But webmaster's own system of captcha. Phew.
Ethics
I am going to put this here, in case you do selective reading and skip over this part in the end.
Please be considerate and try not cause harm.
I didn't want to jeopardize the poll. Even though I did have a favorite (#1) and used that particular vote throughout my testing. (I later told the webmaster to adjust the votes accordingly but the decision was made already, #2 was winning with a margin. And thank god it did. T-shirt #1 was a terrible decision, and terrible decisions are not to be judged). Also, I only voted every 5 second and didn't increase the load on the server.
Please use reCaptcha.
Okay.
Every year before my university holds it cultural fest, Synapse, they poll the students regarding which T-Shirt design becomes the official t-shirt representing the festival.
This year, instead of sending simple Google Form through the college webmail, the student webmaster decided to incorporate a self-made form in official Synapse website.
Was the decision there so that people from outside could also vote? Or to integrate the whole Synapse experience at one place. I don't know.
But I am away from college for the last semester for an internship and decided to have some fun.
The form fields are :
Preference
What is 100x10
I have been doing a lot scraping work lately at work and home and hence wanted to see if I could get through this and get some practice.
At least the student decided to stop spam and added his own version of Captcha.
After refreshing the page a few times, I could see the pattern in the questions.
They were either of the type:
Type xyz or What is x + y?
These captchas could be solved easily on the go by the script.
To get a fair idea of different types of questions, I wrote a scraper to scrape the questions and save them in a file. This part of the script was anyway going to be helpful later on when I would be solving the Captcha.
I did write a regular expression initially but it was failing for some reason and I didn't want to waste time debugging it. I could have used Beautiful Soup too but I just wanted to get over with it quickly.
Hence the text.find hack. As long as you get your work done.
The questions:
Name of the state you are in (in lowercase)?
What is when 19+3?
What is when 100x10?
What is when 50x10?
What is when 9+3?
what is the first letter of your college's name?
What is when 9-3?
Type linux
Name of this planet (in lowercase)?
Name of our planet (in lowercase)?
what comes before b?
what comes after a?
What is when 500/10?
Type pink
What is when 500x10?
What is when 10x10?
What is when 50/10?
So, the website wasn't generating questions on the go, but had odd 16-17 questions from which it picked randomly.
Bleh. I opened a csv, and manually wrote down the answers to all the 16 questions.
The POST call
Now, I just need to send the appropriate POST data.
To see which fields were being sent and to what URL, I used Firefox's console.
So the fields that were being sent were:
Option represented my vote and spam was the answer to the captcha.
All I needed to do was a POST call.
Done! Done! Done!
But wait.
How do I check whether the post was successful or not?
Fiddler. This is a tool which I recently used at work. It let's you monitor all the post and get requests being made my your computer and not just your browser like the console.
Every successful vote returned a page which had 'Vote has been registered' in it.
So, Fiddler let's you check the response of your request and I couldn't find the word registered in the response. Hence my votes were not working.
Could it be that it was blocking the request because it didn't have appropriate headers? Or something to do with cookies?
So, a good way to go about it was to replicate a normal browser as much as I could.
So, instead of doing the GET call and the POST call differently, I did them from the same request session.
Why did it work?
In retrospection, every time I made a GET request, the server started a session for me and picked a question, answer pair from its pool of questions. Now, when I made a POST call (not from the same session) a new session was being created for me whose question and answer I didn't know. I was simply sending the answer for the previous session. And hence I didn't pass the Captcha. The chances of me passing the captcha then would be 1/16. It was like guessing.
Making both the calls from the sessions made sure I was answering by POST to the appropriate question scraped from the same session.
So, this worked correctly. Yay :D
So, I ran the script to do a 1200 votes equally voting all the shirts. Fun fact, my college's population is around that number!
And the webmaster intervened..
Somewhere around 2:30 AM, I wanted to check the count, feel a little good about myself, and go to sleep. But the script was failing the captcha.
On first glance, I could see that the questions were still the same. Why was the script failing?
It turned out the webmaster had obviously noted the votes and decided to change the questions but keep the answers the same!
Lazy webmaster, very lazy. Can't blame him, it was too late in the night.
So he had added one or two more questions. Slightly edited the questions. But majorly he had batch edited the maths questions from:
What is X + Y to What happens when X + Y?
I chuckled way too hard for that time of the night. So I batch edited my questions too in the csv and we were good to go! This was fun, turning into 1 v 1. And it was harmless.
By 3 AM was done with the changes, but what if he was still up and changed the questions again. Should I wait a little longer and then attack again? But I was really sleepy.
So, I started the script with a delay,
and both of us went to sleep together. Aww. :|
Next morning I called up the webmaster, and told him what I was doing and to not fret about the poll results because I had equally voted. He was a good sport and took it well. And we decided to play the game and we decided that he wouldn't Google and still figure out a way to stop my script.
He eventually added random number of empty fields on each request, and added multiple forms from which only one was active. This had I encountered first, I wouldn't have tried to bypass this. It was a good trick. Dirty, but effective!
Now, this was a good hack and I would have to parse the HTML and send a post request to all the forms, making sure at least one was correct. The empty fields could also be took from the forms. But it was a lot of grunt work and I didn't want to pursue it. Also because I was at work.
But all in all, it was educational for both the parties. Good sport, webmaster.
In the time the did this, and decided to blog this I finally decided to use Selenium for one my projects. In retrospection, Selenium would have easily solved the problem of multiple forms and hidden fields.
What should have the webmaster used?
One should simply use reCaptcha and that is what I had suggested but the webmaster didn't want to rely on external tools and this didn't make sense to me at all. Yes, it is ugly and spoils the UX and could hamper people from voting, but people would atleast vote once and that is what we wanted. This poll was a 2 day affair but the festival 's main registration should definitely have reCaptcha. Well, I haven't checked if it is updated.
Tools used : Python (Requests), Fiddler
Ethics
I am going to put this here, in case you do selective reading and skip over this part in the end.
Please be considerate and try not cause harm.
I didn't want to jeopardize the poll. Even though I did have a favorite (#1) and used that particular vote throughout my testing. (I later told the webmaster to adjust the votes accordingly but the decision was made already, #2 was winning with a margin. And thank god it did. T-shirt #1 was a terrible decision, and terrible decisions are not to be judged). Also, I only voted every 5 second and didn't increase the load on the server.
Please use reCaptcha.
Okay.
Every year before my university holds it cultural fest, Synapse, they poll the students regarding which T-Shirt design becomes the official t-shirt representing the festival.
This year, instead of sending simple Google Form through the college webmail, the student webmaster decided to incorporate a self-made form in official Synapse website.
Was the decision there so that people from outside could also vote? Or to integrate the whole Synapse experience at one place. I don't know.
But I am away from college for the last semester for an internship and decided to have some fun.
The form fields are :
Preference
What is 100x10
I have been doing a lot scraping work lately at work and home and hence wanted to see if I could get through this and get some practice.
At least the student decided to stop spam and added his own version of Captcha.
After refreshing the page a few times, I could see the pattern in the questions.
They were either of the type:
Type xyz or What is x + y?
These captchas could be solved easily on the go by the script.
To get a fair idea of different types of questions, I wrote a scraper to scrape the questions and save them in a file. This part of the script was anyway going to be helpful later on when I would be solving the Captcha.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | url = 'http://synapse.daiict.ac.in/poll.php' print 'Making request' req = requests.session() try: text = req.get(url, timeout = 5).text.encode('utf-8') except: #Too lazy to add Exception. Sin. print 'Request Timeout' return False # print text text = text.replace('\t','') starter = '<option value="6">6</option>' ender = '<input id="name" type="text" name="spam"' text = text[text.find(starter):text.find(ender)] text = text[text.find('<span>') + 6:text.find('</span>')] # print text question = text.strip().strip('\n') print question |
I did write a regular expression initially but it was failing for some reason and I didn't want to waste time debugging it. I could have used Beautiful Soup too but I just wanted to get over with it quickly.
Hence the text.find hack. As long as you get your work done.
The questions:
Name of the state you are in (in lowercase)?
What is when 19+3?
What is when 100x10?
What is when 50x10?
What is when 9+3?
what is the first letter of your college's name?
What is when 9-3?
Type linux
Name of this planet (in lowercase)?
Name of our planet (in lowercase)?
what comes before b?
what comes after a?
What is when 500/10?
Type pink
What is when 500x10?
What is when 10x10?
What is when 50/10?
So, the website wasn't generating questions on the go, but had odd 16-17 questions from which it picked randomly.
Bleh. I opened a csv, and manually wrote down the answers to all the 16 questions.
The POST call
Now, I just need to send the appropriate POST data.
To see which fields were being sent and to what URL, I used Firefox's console.
So the fields that were being sent were:
- prefempty
- option
- pref2
- spam
- pref
Option represented my vote and spam was the answer to the captcha.
All I needed to do was a POST call.
1 2 3 4 5 6 7 | url = 'http://synapse.daiict.ac.in/poll.php' payload = { 'option': option, 'spam':answer, 'pref': 'vote', 'pref2':'http://', 'prefempty':''} try: r = requests.post(url, data=payload, timeout = 5) except: print 'Post time out' return False |
But wait.
How do I check whether the post was successful or not?
Fiddler. This is a tool which I recently used at work. It let's you monitor all the post and get requests being made my your computer and not just your browser like the console.
Every successful vote returned a page which had 'Vote has been registered' in it.
So, Fiddler let's you check the response of your request and I couldn't find the word registered in the response. Hence my votes were not working.
Could it be that it was blocking the request because it didn't have appropriate headers? Or something to do with cookies?
So, a good way to go about it was to replicate a normal browser as much as I could.
So, instead of doing the GET call and the POST call differently, I did them from the same request session.
Why did it work?
In retrospection, every time I made a GET request, the server started a session for me and picked a question, answer pair from its pool of questions. Now, when I made a POST call (not from the same session) a new session was being created for me whose question and answer I didn't know. I was simply sending the answer for the previous session. And hence I didn't pass the Captcha. The chances of me passing the captcha then would be 1/16. It was like guessing.
Making both the calls from the sessions made sure I was answering by POST to the appropriate question scraped from the same session.
So, this worked correctly. Yay :D
So, I ran the script to do a 1200 votes equally voting all the shirts. Fun fact, my college's population is around that number!
And the webmaster intervened..
Somewhere around 2:30 AM, I wanted to check the count, feel a little good about myself, and go to sleep. But the script was failing the captcha.
On first glance, I could see that the questions were still the same. Why was the script failing?
It turned out the webmaster had obviously noted the votes and decided to change the questions but keep the answers the same!
Lazy webmaster, very lazy. Can't blame him, it was too late in the night.
So he had added one or two more questions. Slightly edited the questions. But majorly he had batch edited the maths questions from:
What is X + Y to What happens when X + Y?
I chuckled way too hard for that time of the night. So I batch edited my questions too in the csv and we were good to go! This was fun, turning into 1 v 1. And it was harmless.
By 3 AM was done with the changes, but what if he was still up and changed the questions again. Should I wait a little longer and then attack again? But I was really sleepy.
So, I started the script with a delay,
1 | time.sleep(1800) |
Next morning I called up the webmaster, and told him what I was doing and to not fret about the poll results because I had equally voted. He was a good sport and took it well. And we decided to play the game and we decided that he wouldn't Google and still figure out a way to stop my script.
He eventually added random number of empty fields on each request, and added multiple forms from which only one was active. This had I encountered first, I wouldn't have tried to bypass this. It was a good trick. Dirty, but effective!
Now, this was a good hack and I would have to parse the HTML and send a post request to all the forms, making sure at least one was correct. The empty fields could also be took from the forms. But it was a lot of grunt work and I didn't want to pursue it. Also because I was at work.
But all in all, it was educational for both the parties. Good sport, webmaster.
In the time the did this, and decided to blog this I finally decided to use Selenium for one my projects. In retrospection, Selenium would have easily solved the problem of multiple forms and hidden fields.
What should have the webmaster used?
One should simply use reCaptcha and that is what I had suggested but the webmaster didn't want to rely on external tools and this didn't make sense to me at all. Yes, it is ugly and spoils the UX and could hamper people from voting, but people would atleast vote once and that is what we wanted. This poll was a 2 day affair but the festival 's main registration should definitely have reCaptcha. Well, I haven't checked if it is updated.
Tools used : Python (Requests), Fiddler