Archive for September, 2008

I’ve been building database applications for a long time, and building in particular on MySQL since 1998 now.  On MySQL I’ve built some pretty gnarly apps, with dozens of tables, millions of rows and nasty joins. Some, like the one for the education foundation do dozens of queries to build each webpage (pretty typical for a blog), and the internal accounting apps for the ed fund directly generates RTF for parts of the auction catalog from complex four and five table joins, some with nested subqueries. All of this runs quickly, with the most complicated queries taking longer to transfer the result from the server via HTTP than it does to run the query to completion.

So I’m shocked, shocked that queries I want to perform on my CrawlAnalysis database take hours, and in some cases days to run to completion.

I was investigating why this one particular query took so incredibly long (over 6 days) to run to completion:

SELECT ll.URL, ll.Status, ll.FetchTime, ll.ModifiedTime, ll.RetriesSinceFetch,
ll.FetchInterval, ll.Score, ll.Signature, ll.Metadata, ll.CrawlID, ll.Host,
ll.ReversedHost, ll.uto, ll.prk, ll.pas, ll.pts
FROM `crawl-20060820` AS ll
LEFT JOIN `sun-crawl-20060928172854` as rr ON rr.URL = ll.URL
WHERE (
ll.ReversedHost LIKE 'com.sun.%'
OR ll.ReversedHost LIKE 'net.java.%'
)
AND rr.URL IS NULL
AND ll.Status = 'DB_fetched'
GROUP BY ll.URL;

and found a reference High Performance MySQL (which I bought at the bookstore today) that explains in an appropriately named section called Stupid Query Tricks that under unfavorable circumstances doing an OR in a WHERE clause can force MySQL to rowscan the table.

In other words, MySQL is stupider than a dead cat.

The book suggests rewriting the query I have above as a UNION instead:

(
SELECT ll.URL, ll.Status, ll.FetchTime, ll.ModifiedTime, ll.RetriesSinceFetch,
ll.FetchInterval, ll.Score, ll.Signature, ll.Metadata, ll.CrawlID, ll.Host,
ll.ReversedHost, ll.uto, ll.prk, ll.pas, ll.pts
FROM `crawl-20060820` AS ll
LEFT JOIN `sun-crawl-20060928172854` as rr ON rr.URL = ll.URL
WHERE ll.ReversedHost LIKE 'com.sun.%'
AND rr.URL IS NULL
AND ll.Status = 'DB_fetched'
GROUP BY ll.URL
) UNION (
SELECT ll.URL, ll.Status, ll.FetchTime, ll.ModifiedTime, ll.RetriesSinceFetch,
ll.FetchInterval, ll.Score, ll.Signature, ll.Metadata, ll.CrawlID, ll.Host,
ll.ReversedHost, ll.uto, ll.prk, ll.pas, ll.pts
FROM `crawl-20060820` AS ll
LEFT JOIN `sun-crawl-20060928172854` as rr ON rr.URL = ll.URL
WHERE ll.ReversedHost LIKE 'net.java.%'
AND rr.URL IS NULL
AND ll.Status = 'DB_fetched'
GROUP BY ll.URL )

Thereby doubling the number of lines of code in the query and making it sooo much clearer.

Interpretation: the book is suggesting that because the optimizer is too dumb to properly optimize the query, you do it yourself, in a database-specific way.  Cthulu knows what you’ll get for a performance result when you run this query against a different SQL database. But that doesn’t matter anyway, because the ANSI SQL standard is mostly ignored in the breach by application and database developers alike.

Which begs the question of WTF is the MySQL query optimizer doing? Is it really a query pessimizer?  And what’s with the anecdotal story in the same chapter (page 91) where in the case of one particular join query the optimizer took 30 times as long to run as it did to actually execute the query?  Excuse me?

This whole thing with some (but only some) MySQL queries taking forever is pissing me off. It’s like we’ve gone back to bearskins and flint knives (aka toggling in your program on the front panel switches).  And, frankly, I already know from past experience that the alternative suspects (Oracle, Sybase, Postgres, Ingres, Illustra …) are no better – only different.

I was confused. Before today, I thought that the point of a high-performance and excellent modern database is that you don’t have to spend all your time doing a SQL EXPLAIN to figure out why the database is stupider than a dead cat.


Damn! Again! This query returns 650,000 rows, including ones that are NOT in the date range I’m specifying.

SELECT `URL`
 FROM `SquidAccessLog-20061006`
 WHERE `ResultCode` LIKE '%MISS'
   AND `TimeStamp` BETWEEN '2006-10-06 12:80:00'
                       AND '2006-10-06 15:00:00'
 GROUP BY `URL`;

But this query (which should be functionally equivalent) returns the right stuff (less than 10,000 rows)

SELECT *
FROM `SquidAccessLog-20061006`
WHERE `ResultCode` LIKE '%MISS'
   AND `TimeStamp` >= '2006-10-06 12:80:00'
   AND `TimeStamp` <= '2006-10-06 15:00:00'
GROUP BY `URL` ;

Actually, given my experience here it’s probably the case that I should rewrite

SELECT ll.URL, ll.Status, ll.FetchTime, ll.ModifiedTime, ll.RetriesSinceFetch,
ll.FetchInterval, ll.Score, ll.Signature, ll.Metadata, ll.CrawlID, ll.Host,
ll.ReversedHost, ll.uto, ll.prk, ll.pas, ll.pts
FROM `crawl-20060820` AS ll
LEFT JOIN `sun-crawl-20060928172854` as rr ON rr.URL = ll.URL
WHERE (
ll.ReversedHost LIKE 'com.sun.%'
OR ll.ReversedHost LIKE 'net.java.%'
)
AND rr.URL IS NULL
AND ll.Status = 'DB_fetched'
GROUP BY ll.URL;

as

CREATE TEMPORARY TABLE cs
SELECT *
FROM `crawl-20060820`
WHERE ReversedHost LIKE 'com.sun.%'
AND `Status` = 'DB_fetched' ;

CREATE TEMPORARY TABLE cj
SELECT *
FROM `crawl-20060820`
WHERE ReversedHost LIKE 'net.java.%'
AND `Status` = 'DB_fetched';

CREATE TEMPORARY TABLE ss
SELECT *
FROM `sun-crawl-20060928172854`
WHERE ReversedHost LIKE 'com.sun.%'
AND `Status` = 'DB_fetched' ;

CREATE TEMPORARY TABLE sj
SELECT *
FROM `sun-crawl-20060928172854`
WHERE ReversedHost LIKE 'net.java.%'
AND `Status` = 'DB_fetched' ;

CREATE TABLE `not-in-supplemental-20060820` ( SELECT ll.URL, ll.Status, ll.FetchTime, ll.ModifiedTime, ll.RetriesSinceFetch,
ll.FetchInterval, ll.Score, ll.Signature, ll.Metadata, ll.CrawlID, ll.Host,
ll.ReversedHost, ll.uto, ll.prk, ll.pas, ll.pts
FROM ss as ll
LEFT JOIN cs as rr ON rr.URL = ll.URL
WHERE rr.URL IS NULL
) UNION (
SELECT ll.URL, ll.Status, ll.FetchTime, ll.ModifiedTime, ll.RetriesSinceFetch,
ll.FetchInterval, ll.Score, ll.Signature, ll.Metadata, ll.CrawlID, ll.Host,
ll.ReversedHost, ll.uto, ll.prk, ll.pas, ll.pts
FROM sj as ll
LEFT JOIN cj as rr ON rr.URL = ll.URL
WHERE rr.URL IS NULL ) ;

Post to Twitter Post to Delicious Post to Digg Post to Facebook

Comments Off.