Statement: The methods described in this article are for learning and communication purposes only.
The school has issued a competition notice calling on students to participate. This competition is organized by a large financial institution in China, and the preliminary competition system is clear: each person has three opportunities to take the online exam, and ultimately ranks according to their highest score. The test papers are all choice questions and provide free practice. Free practice scores are not counted towards the final grade.
It can be determined that each time you enter the practice interface, the system will automatically extract 10 questions from the question bank for practice. After several refreshes, I believed that the total number of question banks was not large because duplicate questions had already appeared within a relatively small number of refreshes. Therefore, my initial idea is that perhaps the question bank can be cracked.
Using the developer tool of Chrome browser, I found a key JavaScript file. Due to limited abilities, I am unable to proceed with subsequent work, but through exploration I can confirm that the idea is correct and feasible. Subsequently, I downloaded and used Fiddler, a free packet capturing tool, to retrieve the exercise page. At this point, the various files became clear, and I effortlessly discovered the JSON file and its URL for this free practice.
The format of the JSON file is as below:
1 | ... |
Among them, “id” is the question number or option number, “questionTitle” and “optionTitle” are the content of the question and option, respectively, and “questionSolution” is the answer option to the question. I don’t know if there is a way to directly obtain all the question banks, but next I just need to program in familiar computer languages such as Python, repeatedly request the URL, and finally organize and remove duplicates to obtain approximately all the question banks.
However, there is another issue that needs to be addressed when programming crawlers. Because when using the browser to refresh the free practice, the website has already recorded my account and I can directly request the URL. But in a crawler program in Python, the first step is to write an appropriate program to fill in the username and password. More complex is that the website does not have a fixed password, and each login requires a verification code to be sent through the phone number. These obstacles to some extent prevent question banks from being easily exploded.
I did not continue learning how to log in to my account in a crawler program. Without systematic learning, identifying web elements is my weaknesses. My alternative is to write an Tampermonkey script, directly request the URL in Chrome, and download the obtained information locally.
1 | // ==UserScript== |
This script adds a timestamp to the generate file name function so that the local files have unique file names. In addition, the script downloads all the text on the page, as data cleaning can be done in a more familiar language after the download is completed.
We can efficiently improve our grades with question banks. Of course, if the formal competition and free practice share the same question extracting system (which may indeed be the case), the story narrative can be more straightforward.