User:Topbanana/Reports/Scripts/Create Link Analysis Database
|
Contents |
Overview
The downloadable database dumps (http://download.wikimedia.org/) of Wikipedia only contain details about current articles and possibly their history. The script below analyses the text of articles available in these database dumps to create a number of tables describing the Wikipedia:wikilinks they contain.
The links tables can then be used to carry out further analyses such as identifying Wikipedia:Orphaned articles.
Preparing a local Wikipedia database
Firstly, download and install a copy of the free, open-source MySQL database. Version 4.1 contains all of the features required to run the script below and can be downloaded from here (http://dev.mysql.com/downloads/mysql/4.1.html).
Now download a dump of the Wikipedia database from here (http://download.wikimedia.org/) - the 'cur' downloads contain all the information needed by the script below. Unzip the downloaded file - you should end up with a single file with a name ending in ".sql".
To load the contents of this file into MySQL, you will first need to create and use a database:
- mysql> create database wikipedia;
- Query OK, 1 row affected (0.06 sec)
- mysql> use wikipedia;
- Database changed
Now the downloaded and unzipped database dump can be entered into your database:
- mysql> source c:\20040613_cur_table.sql;
The file will take several hours to load. While waiting, log on to wikipedia and improve some articles.
Create required subscripts
The script below makes use of four other scripts to perform repetetive tasks. You will need to create and save these - instructions on how to do this can be found on each of these pages:
The script below expects to find these in the directory "C:\" - if you save them to a different location, you will need to update your copy of the script accordingly. The lines you will need to modify all begin with the word "SOURCE".
What the script does
Once you have completed these steps script below can be cut and paste into your mysql client window. It will take about 12 hours to run on a modern desktop PC and require about 2 GB of disk space.
The script first copies relevant articles from the 'cur' table (created by the database dump file you downloaded) into a new table 'art'. This cuts down the amount of work the script needs to do later, making it faster. It also ensures that only one article exists with any given title. Articles are marked as being 'stubs', 'redirects', 'disambiguation pages' or, if they are none of these, just plain 'articles'.
Next, the text of articles in the 'art' table is analysed. Sections of articles marked as being 'special' - for example being within nowiki tags - are identified and details stored in the 'special_text' table. Once done, the text of the articles is analysed once more and all wikilinks identified. These are stored in the 'raw_links' table.
The raw links are matched back to article titles where possible. A number of different transformations of the links is carried out - for example removing repeated spaces and some special character codes - to ensure as many links as possible are matched. Links for which a target is found are moved into the 'good_links' table. Unmatched links end up in 'bad_links' and any links which could not be processed - for example because they were badly formed or were interlanguage links - end up in 'unhandled_links'.
Once all this is done, summaries of the number and types of links are calculated. Summaries of good links are stored on the 'art' table. Unmatched links (those in the 'bad_links' table) are summarised to 'bad_link_summary'.
The script
------------------------------------------------------------------------------ -- Script to create WikiPedia reporting database -- ------------------------------------------------------------------------------ -- This script recreates a set of 'links' tables by analysing the text of -- -- articles stored in the 'cur' table. It only deals with articles in the -- -- main 'en' namespace (namespace 0), and so works best with a database -- -- contqaining only the 'en cur' wikipedia database dump -- ------------------------------------------------------------------------------ -- No warranty express or implied. Queries, complaints, and praise to -- -- User_talk:Topbanana please -- ------------------------------------------------------------------------------ -- Create a table containing just the articles we're interested in. This makes -- this slightly more efficient at the expense of disk space and allows us to -- precalculate a number of interesting properties of articles. CREATE TABLE art ( -- Basic article details art_id mediumint(7) unsigned NOT NULL, art_title varchar(255) binary NOT NULL, art_text mediumtext NOT NULL, -- Precalculated details about this article art_len mediumint(6) unsigned NOT NULL, art_is_stub tinyint(1) unsigned NOT NULL, art_is_redirect tinyint(1) unsigned NOT NULL, art_is_disambig tinyint(1) unsigned NOT NULL, art_is_copyvio tinyint(1) unsigned NOT NULL, art_is_list tinyint(1) unsigned NOT NULL, -- Summary counts of links links_from mediumint(6) unsigned NOT NULL default 0, good_links_from mediumint(6) unsigned NOT NULL default 0, bad_links_from mediumint(6) unsigned NOT NULL default 0, unhandled_links_from mediumint(6) unsigned NOT NULL default 0, links_to mediumint(6) unsigned NOT NULL default 0, PRIMARY KEY( art_id ), UNIQUE INDEX( art_title ), INDEX( art_len ), INDEX( art_is_stub ), INDEX( art_is_redirect ), INDEX( art_is_disambig ) ) ENGINE=MyISAM; -- Locking tables before large inserts is much faster LOCK TABLES art WRITE, cur READ; -- It has been the case that the 'cur' table contains a number of articles with -- different id numbers but identical titles. This would appear to be incorrect. -- We need to strip out the bogus records here to avoid getting spurious entries on -- reports later on. The 'REPLACE INTO' syntax works just like 'INSERT INTO' only -- it will overwrite existing records where there is a unique key violation. -- -- This statement should take ~10 mins to run on a desktop PC. REPLACE INTO art ( art_id, art_title, art_text, art_len, art_is_stub, art_is_redirect, art_is_disambig, art_is_copyvio, art_is_list ) SELECT cur_id, cur_title, cur_text, Length( cur_text ), if( locate( 'stub}}', cur_text ) > 0, 1, 0), cur_is_redirect, if( locate( '{{msg:disambig}}', cur_text ) > 0, 1, 0), if( locate( 'copyvio', cur_text ) > 0, 1, 0), if( cur_title LIKE 'List_of_%' OR cur_title LIKE 'Table_of_%', 1, 0 ) FROM cur WHERE cur_namespace = 0; -- Alternatives for the stub and disambig messages UPDATE art SET art_is_disambig = 1 WHERE locate( '{{disambig}}', art_text ) > 0; UPDATE art SET art_is_disambig = 1 WHERE art_title LIKE '% (disambiguation)'; UPDATE art SET art_is_disambig = 1 WHERE art_title LIKE '% (disambig)'; UNLOCK TABLES; -- Okay, on to links. This is again a multi-stage process. -- -- Links tables are: -- raw_links - links we are processing -- unhandled_links - links we are unable to process automatically -- good_links - links we have matched to a target article -- bad_links - links we can't find a target article for CREATE TABLE raw_links ( pos mediumint(5) unsigned NOT NULL, from_id mediumint(7) unsigned NOT NULL, to_id mediumint(7) unsigned NOT NULL default 0, lstart mediumint(6) unsigned NOT NULL, lend mediumint(6) unsigned NOT NULL default 0, link varchar(255) binary NOT NULL default '', PRIMARY KEY ( pos, from_id ) ) ENGINE=MyISAM; CREATE TABLE unhandled_links ( pos mediumint(5) unsigned NOT NULL, from_id mediumint(7) unsigned NOT NULL, to_id mediumint(7) unsigned NOT NULL, lstart mediumint(6) unsigned NOT NULL, lend mediumint(6) unsigned NOT NULL, link varchar(255) binary NOT NULL, reason varchar(255) binary NOT NULL, PRIMARY KEY ( pos, from_id ) ) ENGINE=MyISAM; CREATE TABLE good_links ( from_id mediumint(7) unsigned NOT NULL, to_id mediumint(7) unsigned NOT NULL, count smallint(5) NOT NULL, PRIMARY KEY ( from_id, to_id ) ) ENGINE=MyISAM; CREATE TABLE bad_links ( from_id mediumint(7) unsigned NOT NULL, link varchar(255) binary NOT NULL, count smallint(5) NOT NULL ) ENGINE=MyISAM; -- First the position of each link start (indicated by [[ in the article text) -- is noted. This is done using repeated queries and a 'position' column to -- note the number of the link within each page. To the best of my knowledge -- MySQL has no loop constructs available in 4.1 which allow this to be done -- more neatly, but I'm happy to be proven wrong if anyone knows better. -- This statement and the extract_links script should take about 30 minutes -- to run on a desktop PC. LOCK TABLES raw_links WRITE, art READ, raw_links as rl2 READ; INSERT INTO raw_links (pos, from_id, lstart) SELECT 1, art_id, locate( '[[', art_text, 1 ) + 2 FROM art WHERE locate( '[[', art_text, 1 ) > 0; -- Search for the second and subsequent links. This script handles up to 10,000 links -- per article. SOURCE c:\\extract_links.sql UNLOCK TABLES; ANALYZE TABLE raw_links; -- Okay, now find the end of each link ( the next ]] after each [[ ). This'll take -- about 20 minutes on a desktop PC. LOCK TABLES raw_links WRITE, art WRITE; UPDATE raw_links, art SET raw_links.lend = locate( ']]', art.art_text, raw_links.lstart ) WHERE raw_links.from_id = art.art_id; UNLOCK TABLES; -- Now carry out further analysis of text to identify blocks enclosed by important tags CREATE TABLE special_classes ( special_id tinyint(3) unsigned NOT NULL, link varchar(255) binary NOT NULL, PRIMARY KEY ( special_id ) ) ENGINE=MyISAM; INSERT INTO special_classes VALUES ( 0, 'nowiki' ); INSERT INTO special_classes VALUES ( 1, 'math' ); INSERT INTO special_classes VALUES ( 2, 'HTML comment' ); INSERT INTO special_classes VALUES ( 3, 'external link' ); INSERT INTO special_classes VALUES ( 4, 'HTML references' ); CREATE TABLE special_text ( special_id tinyint(3) unsigned NOT NULL, art_id mediumint(7) unsigned NOT NULL, pos smallint(5) unsigned NOT NULL, start mediumint(6) unsigned NOT NULL, end mediumint(6) unsigned NOT NULL default 0, PRIMARY KEY ( special_id, art_id, pos ) ) ENGINE=MyISAM; -- Text inside <nowiki> </nowiki> tag pairs INSERT INTO special_text ( special_id, art_id, pos, start ) SELECT 0, art_id, 1, locate( '<nowiki>', art_text, 1 ) + 8 FROM art WHERE locate( '<nowiki>', art_text, 1 ) > 0; SOURCE c:\\extract_nowiki.sql; UPDATE special_text, art SET special_text.end = locate( '</nowiki>', art.art_text, special_text.start ) WHERE special_text.art_id = art.art_id AND special_id = 0; -- Text inside <math> </math> tag pairs INSERT INTO special_text ( special_id, art_id, pos, start ) SELECT 1, art_id, 1, locate( '<math>', art_text, 1 ) + 6 FROM art WHERE locate( '<math>', art_text, 1 ) > 0; SOURCE c:\\extract_math.sql; UPDATE special_text, art SET special_text.end = locate( '</math>', art.art_text, special_text.start ) WHERE special_text.art_id = art.art_id AND special_id = 1; -- Text inside HTML comments <!-- --> INSERT INTO special_text ( special_id, art_id, pos, start ) SELECT 2, art_id, 1, locate( '<!--', art_text, 1 ) + 4 FROM art WHERE locate( '<!--', art_text, 1 ) > 0; SOURCE c:\\extract_HTML_comment.sql; UPDATE special_text, art SET special_text.end = locate( '-->', art.art_text, special_text.start ) WHERE special_text.art_id = art.art_id AND special_id = 2; -- Unclosed tags run to the end of the article select art_title, special_text.* from art, special_text where art.art_id = special_text.art_id and start > end; update special_text, art set end = art_len where art.art_id = special_text.art_id and start > end; ALTER TABLE special_text ADD INDEX( art_id ); -- Okay, back to the link analysis -- Detect obviously anomalous links and move them into the unhandled links table before -- proceeding LOCK TABLES raw_links WRITE, unhandled_links WRITE, special_text READ; INSERT INTO unhandled_links SELECT *, 'end <= start' FROM raw_links WHERE lend <= lstart; DELETE FROM raw_links WHERE EXISTS ( SELECT 1 FROM unhandled_links WHERE unhandled_links.pos = raw_links.pos AND unhandled_links.from_id = raw_links.from_id ); INSERT INTO unhandled_links SELECT *, 'Length > 255' FROM raw_links WHERE ( lend - lstart ) > 255; DELETE FROM raw_links WHERE EXISTS ( SELECT 1 FROM unhandled_links WHERE unhandled_links.pos = raw_links.pos AND unhandled_links.from_id = raw_links.from_id ); INSERT INTO unhandled_links SELECT *, 'All in HTML comment' FROM raw_links WHERE EXISTS ( SELECT 1 FROM special_text WHERE special_id = 2 AND from_id = art_id AND lstart > start AND lend <= end ); DELETE FROM raw_links WHERE EXISTS ( SELECT 1 FROM unhandled_links WHERE unhandled_links.pos = raw_links.pos AND unhandled_links.from_id = raw_links.from_id ); INSERT INTO unhandled_links SELECT *, 'Starts in HTML comment' FROM raw_links WHERE EXISTS ( SELECT 1 FROM special_text WHERE special_id = 2 AND from_id = art_id AND lstart > start AND lstart <= end ); DELETE FROM raw_links WHERE EXISTS ( SELECT 1 FROM unhandled_links WHERE unhandled_links.pos = raw_links.pos AND unhandled_links.from_id = raw_links.from_id ); INSERT INTO unhandled_links SELECT *, 'Ends in HTML comment' FROM raw_links WHERE EXISTS ( SELECT 1 FROM special_text WHERE special_id = 2 AND from_id = art_id AND lend > start AND lend <= end ); DELETE FROM raw_links WHERE EXISTS ( SELECT 1 FROM unhandled_links WHERE unhandled_links.pos = raw_links.pos AND unhandled_links.from_id = raw_links.from_id ); INSERT INTO unhandled_links SELECT *, 'All in nowiki' FROM raw_links WHERE EXISTS ( SELECT 1 FROM special_text WHERE special_id = 0 AND from_id = art_id AND lstart > start AND lend <= end ); DELETE FROM raw_links WHERE EXISTS ( SELECT 1 FROM unhandled_links WHERE unhandled_links.pos = raw_links.pos AND unhandled_links.from_id = raw_links.from_id ); INSERT INTO unhandled_links SELECT *, 'Starts in nowiki' FROM raw_links WHERE EXISTS ( SELECT 1 FROM special_text WHERE special_id = 0 AND from_id = art_id AND lstart > start AND lstart <= end ); DELETE FROM raw_links WHERE EXISTS ( SELECT 1 FROM unhandled_links WHERE unhandled_links.pos = raw_links.pos AND unhandled_links.from_id = raw_links.from_id ); INSERT INTO unhandled_links SELECT *, 'Ends in nowiki' FROM raw_links WHERE EXISTS ( SELECT 1 FROM special_text WHERE special_id = 0 AND from_id = art_id AND lend > start AND lend <= end ); DELETE FROM raw_links WHERE EXISTS ( SELECT 1 FROM unhandled_links WHERE unhandled_links.pos = raw_links.pos AND unhandled_links.from_id = raw_links.from_id ); INSERT INTO unhandled_links SELECT *, 'All in math' FROM raw_links WHERE EXISTS ( SELECT 1 FROM special_text WHERE special_id = 1 AND from_id = art_id AND lstart > start AND lend <= end ); DELETE FROM raw_links WHERE EXISTS ( SELECT 1 FROM unhandled_links WHERE unhandled_links.pos = raw_links.pos AND unhandled_links.from_id = raw_links.from_id ); INSERT INTO unhandled_links SELECT *, 'Starts in math' FROM raw_links WHERE EXISTS ( SELECT 1 FROM special_text WHERE special_id = 1 AND from_id = art_id AND lstart > start AND lstart <= end ); DELETE FROM raw_links WHERE EXISTS ( SELECT 1 FROM unhandled_links WHERE unhandled_links.pos = raw_links.pos AND unhandled_links.from_id = raw_links.from_id ); INSERT INTO unhandled_links SELECT *, 'Ends in math' FROM raw_links WHERE EXISTS ( SELECT 1 FROM special_text WHERE special_id = 1 AND from_id = art_id AND lend > start AND lend <= end ); DELETE FROM raw_links WHERE EXISTS ( SELECT 1 FROM unhandled_links WHERE unhandled_links.pos = raw_links.pos AND unhandled_links.from_id = raw_links.from_id ); UNLOCK TABLES; -- Okay now extract the text for the target of each link. A number of manipulations -- need to be carried out during this process, including: -- -- Removing any descriptive text ( 'Greeting|Hello' -> 'Greeting' ) -- Trimming leading and trailing spaces ( ' Green ' -> 'Green' ); -- Compounding pairs of spaces ( 'hello there' -> 'hello there' ) -- Converting spaces to underscores -- -- A selection of other manipulations are carried out later on. -- This'll take about 20 minutes on a desktop PC LOCK TABLES raw_links WRITE, art WRITE; UPDATE raw_links, art SET LINK = replace( replace( trim( substring_index( mid( art_text, lstart, lend-lstart), '|', 1 )), ' ', '_' ), '__', '_' ) WHERE from_id = art_id; UNLOCK TABLES; -- Tidy up section links (links to locations within a certain page). Where a link is -- to a specific section within another article, we'll pretend it's a plain link, ie -- we'll transform 'human_being#anatomy' into just 'human_being'. Links starting with -- a hash (#) are to another section within the same article, we'll mark those as -- unhandleable for now. INSERT INTO unhandled_links SELECT *, 'Local section link' from raw_links WHERE link LIKE '#%'; DELETE FROM raw_links WHERE link LIKE '#%'; -- Articles cannot have square brackets in their titles INSERT INTO unhandled_links SELECT *, 'Contains a square bracket' from raw_links WHERE link LIKE '%[%'; DELETE FROM raw_links WHERE link LIKE '%[%'; INSERT INTO unhandled_links SELECT *, 'Contains a square bracket' from raw_links WHERE link LIKE '%]%'; DELETE FROM raw_links WHERE link LIKE '%]%'; -- Wikimedia seems to treat multiple spaces in links as one. We need to carry out -- extra processing to handle the few articles that contain more than two spaces here. UPDATE raw_links SET link = replace( link, '__', '_' ); UPDATE raw_links SET link = replace( link, '__', '_' ); UPDATE raw_links SET link = replace( link, '__', '_' ); -- Capitalize the first letter of each link UPDATE raw_links SET link = concat(upper(left( link, 1 )),mid( link, 2 )); -- The above operations ccould have caused the raw_links table to become quite fragmented. -- Optimising the table here is worth the cost. OPTIMIZE TABLE raw_links; -- Okay, now look for article titles matching the text of each link. This is just a -- first pass at the process, we'll carry out further manipulations of the text of those -- links we can't find a match for later. -- This takes quite a while to run - about 11 hours - on my own database server. No idea -- why, there's no data moving involved and a well-ordered full index on art.art_title. ANALYZE TABLE raw_links; LOCK TABLES raw_links WRITE, art READ; UPDATE raw_links SET to_id = ( select art_id from art where art_title = link ); UNLOCK TABLES; -- As the above should have handled the vast majority of links, we'll create indexes -- now and analyze to ensure stats are up to date ALTER TABLE raw_links ADD INDEX( link(8) ), ADD INDEX( to_id ), ADD INDEX( from_id ); ANALYZE TABLE raw_links; -- The next thing to do is convert character codes in links we couldn't match to thier -- ASCII equivalents. This isn't an exhaustive list by any means, but catches a few -- hundred links we would otherwise miss. UPDATE raw_links SET link = replace( link, '%27', '\'') WHERE to_id = 0 AND locate( '%27', link ) > 0; UPDATE raw_links SET link = replace( link, '%2C', ',') WHERE to_id = 0 AND locate( '%2C', link ) > 0; UPDATE raw_links SET link = replace( link, '%3F', '?') WHERE to_id = 0 AND locate( '%3F', link ) > 0; UPDATE raw_links SET link = replace( link, '%92', '\’') WHERE to_id = 0 AND locate( '%92', link ) > 0; UPDATE raw_links SET link = replace( link, '%28', '(') WHERE to_id = 0 AND locate( '%28', link ) > 0; UPDATE raw_links SET link = replace( link, '%29', ')') WHERE to_id = 0 AND locate( '%29', link ) > 0; UPDATE raw_links, art SET to_id = art_id WHERE link = art_title AND to_id = 0; -- Now remove interwiki links INSERT INTO unhandled_links SELECT *, 'Interwiki' FROM raw_links WHERE link LIKE '__:%' AND to_id = 0; DELETE FROM raw_links WHERE link LIKE '__:%' AND to_id = 0; -- HTML character codes UPDATE raw_links SET link = replace( link, 'Á', 'Á') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'á', 'á') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Â', 'Â') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'â', 'â') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Æ', 'Æ') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'æ', 'æ') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'À', 'À') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'à', 'à') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Å', 'Å') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'å', 'å') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Ã', 'Ã') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'ã', 'ã') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Ä', 'Ä') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'ä', 'ä') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Ç', 'Ç') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'ç', 'ç') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, '÷', '÷') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'É', 'É') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'é', 'é') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Ê', 'Ê') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'ê', 'ê') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'È', 'È') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'è', 'è') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Ð', 'Ð') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'ð', 'ð') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Ë', 'Ë') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'ë', 'ë') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, '½', '½') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, '¼', '¼') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, '¾', '¾') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Í', 'Í') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'í', 'í') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Î', 'Î') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'î', 'î') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Ì', 'Ì') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'ì', 'ì') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, '¿', '¿') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Ï', 'Ï') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'ï', 'ï') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Ñ', 'Ñ') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'ñ', 'ñ') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Ó', 'Ó') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'ó', 'ó') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Ô', 'Ô') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'ô', 'ô') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Ò', 'Ò') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'ò', 'ò') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Ø', 'Ø') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'ø', 'ø') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Õ', 'Õ') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'õ', 'õ') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Ö', 'Ö') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'ö', 'ö') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'ß', 'ß') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Þ', 'Þ') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'þ', 'þ') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, '×', '×') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Ú', 'Ú') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'ú', 'ú') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Û', 'Û') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'û', 'û') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Ù', 'Ù') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'ù', 'ù') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Ü', 'Ü') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'ü', 'ü') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'Ý', 'Ý') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'ý', 'ý') WHERE to_id = 0; UPDATE raw_links SET link = replace( link, 'ÿ', 'ÿ') WHERE to_id = 0; UPDATE raw_links, art SET to_id = art_id WHERE link = art_title AND to_id = 0; -- Show remaining HTML-escaped codes in links -- SELECT distinct mid( link, locate( '&', link ) + 1, locate( ';', link, locate( '&', link ) ) - locate( '&', link ) - 1 ) -- FROM raw_links -- WHERE to_id = 0 -- AND link LIKE '%&%;%' -- ORDER BY 1; -- Show remaining HTML-escaped codes in article titles -- SELECT distinct mid( art_title, locate( '&', art_title ) + 1, locate( ';', art_title, locate( '&', art_title ) ) - locate( '&', art_title ) - 1 ) -- FROM art -- WHERE art_title LIKE '%&%;%' -- ORDER BY 1; -- Mark any links still containing HTML-escaped characters as unhandleable INSERT INTO unhandled_links SELECT *, 'Link contains an HTML character code' FROM raw_links WHERE link like '%&%;%' AND to_id = 0; DELETE FROM raw_links WHERE link like '%&%;%' AND to_id = 0; -- Remaining links containing a colon (:) are probably to another namespace, for example -- 'Wikipedia:What links here'. Mark these as unhandleable for now. INSERT INTO unhandled_links SELECT *, 'Link contains a :' FROM raw_links WHERE link like '%:%' AND to_id = 0; DELETE FROM raw_links WHERE link like '%:%' AND to_id = 0; -- Remaining links containing a hash (#) are probably section links = flatten them UPDATE raw_links SET link = substring_index( link, '#', 1 ) WHERE link LIKE '%#%' and to_id = 0; UPDATE raw_links, art SET to_id = art_id WHERE link = art_title AND to_id = 0; -- Okay, we've matched all we can match. Pick out good and bad links into nice tidy -- tables to make reporting easy. LOCK TABLES good_links WRITE, raw_links READ; INSERT INTO good_links SELECT from_id, to_id, count(*) FROM raw_links WHERE to_id > 0 GROUP BY from_id, to_id; UNLOCK TABLES; ALTER TABLE good_links add INDEX( to_id ); ANALYZE TABLE good_links; LOCK TABLES bad_links WRITE, raw_links READ; INSERT INTO bad_links SELECT from_id, link, count(*) FROM raw_links WHERE to_id = 0 GROUP BY from_id, link; UNLOCK TABLES; ALTER TABLE bad_links ADD INDEX( from_id ), ADD INDEX( link(8) ); ANALYZE TABLE bad_links; -- Now fill in a few summary statistics to aid in the generation of reports ALTER TABLE unhandled_links ADD INDEX( from_id ); ANALYZE TABLE unhandled_links; LOCK TABLES good_links READ, art WRITE; UPDATE art SET art.links_to = ( SELECT count(*) FROM good_links WHERE to_id = art_id ); UPDATE art SET art.good_links_from = ( SELECT count(*) FROM good_links WHERE from_id = art_id ); UNLOCK TABLES; LOCK TABLES bad_links READ, art WRITE; UPDATE art SET art.bad_links_from = ( SELECT count(*) FROM bad_links WHERE from_id = art_id ); UNLOCK TABLES; LOCK TABLES unhandled_links READ, art WRITE; UPDATE art SET art.unhandled_links_from = ( SELECT count(*) FROM unhandled_links WHERE from_id = art_id ); UNLOCK TABLES; LOCK TABLES art WRITE; UPDATE art SET links_from = ( good_links_from + bad_links_from + unhandled_links_from ); UNLOCK TABLES; ALTER TABLE art ADD INDEX( links_from ), ADD INDEX( good_links_from ), ADD INDEX( bad_links_from ), ADD INDEX( unhandled_links_from ), ADD INDEX( links_to ); -- Summary table of bad links CREATE TABLE bad_link_summary ( link varchar(255) binary NOT NULL, count smallint(5) NOT NULL, PRIMARY KEY ( link ), INDEX( count ) ) ENGINE=MyISAM; LOCK TABLES bad_link_summary WRITE, bad_links READ; INSERT INTO bad_link_summary SELECT link, count(*) FROM bad_links GROUP BY link; UNLOCK TABLES; OPTIMIZE TABLE unhandled_links;