Usuario:RBSpamAnalyzerBot

De Biquipedia
:-) Iste usuario tiene o flag de bot.

Overview[editar | modificar o codigo]

This bot will post external link analysis, find probable spambot-created pages, and eventually tag them for speedy deletion. It will also generate a set of statistics that can be used by the community to determine whether some pages are being used as spam carriers.

The bot runs once per database dump. In the case of this Wikipedia, I expect it to run once every 2 weeks.

Tasks[editar | modificar o codigo]

The bot itself is composed of a set of bash shell script files, each doing a single task:

  • review.sh: The "bot" itself. The script just calls each of the following scripts in order, handling any problem they may have.
  • download.sh: Checks download.wikimedia.org to find new database dumps, comparing the current ones with the last one it had processed. If new ones are found, it can generate a list of urls to download page.sql.gz and externallinks.sql.gz to be downloaded via wget.
  • process.sh: Executes the queries from page.sql.gz and externallinks.sql.gz in a local database, then executes several custom-made queries to gather statistics:
    SELECT COUNT(el_from) AS total, el_from, page_title
    FROM externallinks, page
    WHERE externallinks.el_from = page_id AND page_is_redirect = 0 AND page_namespace = 0
    GROUP BY el_from
    ORDER BY total DESC;
    Generates a list of articles sorted by the amount of external links each has.
    SELECT COUNT(el_to) AS total, SUBSTRING_INDEX(el_to, '/', 3) AS search
    FROM externallinks, page
    WHERE page_id = el_from AND page_namespace = 0
    GROUP BY search
    ORDER BY total DESC;
    Generates a list of external links in descendant order.
    SELECT page_id, page_title, page_namespace
    FROM page
    WHERE page_title LIKE '%index.php%'
    OR page_title LIKE '%/wiki/%'
    OR page_title LIKE '%/w/%' OR
    page_title LIKE '%/';
    Generates a list of pages with titles containing one of several patterns used by malfunctioning bots, like /wiki/, /w/, or ending with /.
    After executing the queries, the script processes the resulting lists to limit the lists to a determined amount, to prevent creating pages too big. If resulting listing has more than 500 items, the bot stops, as the dump result must be manually analyzed.
  • upload.sh: This script executes the communication between the bot and the Wikipedia project. The script logins the bot and uploads the generated listings at a determined location. Currently, that is being done at User:ReyBrujo/Dumps. First, the script determines whether there is a current dump, and if so, archives it at User:ReyBrujo/Dumps/Archive. Then it uploads the listings and the dump page, with the format:
    User:ReyBrujo/Dumps/yyyymmdd where yyyymmdd is the database dump date (and not the processing date)
    User:ReyBrujo/Dumps/yyyymmdd/Sites linked more than xxx times where xxx is usually 10 in the case of this Wikipedia
    User:ReyBrujo/Dumps/yyyymmdd/Sites linked between xxx and yyy times where xxx and yyy are delimiters when a single listing would have over 500 items.
    User:ReyBrujo/Dumps/yyyymmdd/Articles with more than xxx external links where xxx is usually 10.
    User:ReyBrujo/Dumps/yyyymmdd/Articles with between xxx and yyy external links where xxx and yyy are delimiters when a single listing would have over 500 items.

Finally, the bot will also edit a global page currently found at meta:User:ReyBrujo/Dump statistics table, updating the statistics in that page. Permission for the bot to run there will be requested after having the bot approved in the individual Wikipedias.