Analysis of An Online Question Bank | Aochen (Arsen) Sun's Blog

Statement: The methods described in this article are for learning and communication purposes only.

The school has issued a competition notice calling on students to participate. This competition is organized by a large financial institution in China, and the preliminary competition system is clear: each person has three opportunities to take the online exam, and ultimately ranks according to their highest score. The test papers are all choice questions and provide free practice. Free practice scores are not counted towards the final grade.

It can be determined that each time you enter the practice interface, the system will automatically extract 10 questions from the question bank for practice. After several refreshes, I believed that the total number of question banks was not large because duplicate questions had already appeared within a relatively small number of refreshes. Therefore, my initial idea is that perhaps the question bank can be cracked.

Using the developer tool of Chrome browser, I found a key JavaScript file. Due to limited abilities, I am unable to proceed with subsequent work, but through exploration I can confirm that the idea is correct and feasible. Subsequently, I downloaded and used Fiddler, a free packet capturing tool, to retrieve the exercise page. At this point, the various files became clear, and I effortlessly discovered the JSON file and its URL for this free practice.

The format of the JSON file is as below:

1
2
3

...
{"id":1838,"questionTitle":"Question Content","questionType":1,"questionSolution":"A","etfOptionsDtoList":[{"id":7333,"optionTitle":"Option A","optionIndex":"A","questionId":1838,"createTime":"May 14, 2021 7:46:25 PM","createBy":"sys"},{"id":7334,"optionTitle":"Option B","optionIndex":"B","questionId":1838,"createTime":"May 14, 2021 7:46:25 PM","createBy":"sys"},{"id":7335,"optionTitle":"Option C","optionIndex":"C","questionId":1838,"createTime":"May 14, 2021 7:46:25 PM","createBy":"sys"},{"id":7336,"optionTitle":"Option D","optionIndex":"D","questionId":1838,"createTime":"May 14, 2021 7:46:25 PM","createBy":"sys"}],"questionStage":"1","createTime":"May 14, 2021 7:46:25 PM","createBy":"sys","extend1":"1","extend2":"5","extend4":"etf","difficult":2}
...

Among them, “id” is the question number or option number, “questionTitle” and “optionTitle” are the content of the question and option, respectively, and “questionSolution” is the answer option to the question. I don’t know if there is a way to directly obtain all the question banks, but next I just need to program in familiar computer languages such as Python, repeatedly request the URL, and finally organize and remove duplicates to obtain approximately all the question banks.

However, there is another issue that needs to be addressed when programming crawlers. Because when using the browser to refresh the free practice, the website has already recorded my account and I can directly request the URL. But in a crawler program in Python, the first step is to write an appropriate program to fill in the username and password. More complex is that the website does not have a fixed password, and each login requires a verification code to be sent through the phone number. These obstacles to some extent prevent question banks from being easily exploded.

I did not continue learning how to log in to my account in a crawler program. Without systematic learning, identifying web elements is my weaknesses. My alternative is to write an Tampermonkey script, directly request the URL in Chrome, and download the obtained information locally.

// ==UserScript==
// @name         name
// @namespace    https://aochen-sun.github.io/
// @version      1.0
// @description  Automatically refresh and download the body of the page
// @author       Arsen Sun
// @match        https://1234567.com/1234567
// @grant        none
// ==/UserScript==

(function() {
    'use strict';

    // Set refresh interval in milliseconds
    var refreshInterval = 5000; 

    // Generate unique file names
    function generateFileName() {
        var timestamp = new Date().getTime(); // Use timestamp as part of file name
        return 'page_content_' + timestamp + '.txt';
    }

    // Define refresh function
    function refreshPage() {
        location.reload();
    }

    // Define download function
    function downloadContent(content) {
        var blob = new Blob([content], { type: 'text/plain' });
        var url = URL.createObjectURL(blob);
        var fileName = generateFileName();
        var a = document.createElement('a');
        a.href = url;
        a.download = fileName;
        a.style.display = 'none';
        document.body.appendChild(a);
        a.click();
        document.body.removeChild(a);
        URL.revokeObjectURL(url);
    }

    // Regularly refresh the page
    setInterval(refreshPage, refreshInterval);

    // Execute when the page load is complete
    window.addEventListener('load', function() {
        // Add a selector for the page content you need to download
        var contentElement = document.querySelector('body'); // Example Select the content of the entire page

        if (contentElement) {
            var content = contentElement.textContent; // Get page content

            // Download page content
            downloadContent(content);
        }
    });
})();

This script adds a timestamp to the generate file name function so that the local files have unique file names. In addition, the script downloads all the text on the page, as data cleaning can be done in a more familiar language after the download is completed.

We can efficiently improve our grades with question banks. Of course, if the formal competition and free practice share the same question extracting system (which may indeed be the case), the story narrative can be more straightforward.