<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Barry&#039;s Point Of View</title>
	<atom:link href="http://www.barryspov.com/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://www.barryspov.com</link>
	<description>rants and mumblings about technology from Barry A Dobyns</description>
	<lastBuildDate>Sun, 04 Jul 2010 18:13:20 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Silly Crontab Tester</title>
		<link>http://www.barryspov.com/?p=148</link>
		<comments>http://www.barryspov.com/?p=148#comments</comments>
		<pubDate>Fri, 18 Jun 2010 16:10:16 +0000</pubDate>
		<dc:creator>Barry A Dobyns</dc:creator>
				<category><![CDATA[Unix]]></category>
		<category><![CDATA[bash]]></category>
		<category><![CDATA[crontab]]></category>

		<guid isPermaLink="false">http://www.barryspov.com/?p=148</guid>
		<description><![CDATA[We all write crontab jobs.   I tend to write really complicated ones, where each line in the crontab has multiple statements, for loops and while loops.  Which makes it moderately hard to figure out if it&#8217;s working.
So I created a simple little test script that reads the crontab, and then runs each line of the [...]]]></description>
			<content:encoded><![CDATA[<p>We all write crontab jobs.   I tend to write really complicated ones, where each line in the crontab has multiple statements, <strong>for</strong> loops and <strong>while</strong> loops.  Which makes it moderately hard to figure out if it&#8217;s working.</p>
<p>So I created a simple little test script that reads the crontab, and then runs each line of the crontab separately, showing you what happens inside.   Could you have written this yourself in less time that it took you to google for and find this post?  Sure.  But sometimes it&#8217;s too early in the morning, the coffee maker is broken, and you want to just grab a working script.</p>
<pre class="brush: bash; light: true;">
#!/bin/bash
# $Id: crontab-test.sh 2771 2010-05-28 18:33:22Z bdobyns $

idx=0
#m h d m w cmd
if [ -z &quot;$1&quot; ] ; then
	echo &quot;usage: $0 [crontabname]&quot;
	exit
else
	cat &quot;$1&quot; | grep -v '^#' | grep -v ^MAILTO | while read m d h m w cmd
	do
		F=/tmp/$0.$$.$idx
		echo -n $cmd &gt;$F
		if [ -s $F ] ; then
			bash -ex $F
		fi
		idx=$[ $idx + 1 ]
	done
fi

rm -rf /tmp/$0.*
</pre>
<p align="left"><a target="_blank" class="tt" href="http://twitter.com/home/?status=Silly+Crontab+Tester+http://mnh3q.th8.us" title="Post to Twitter"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-twitter-big4.png" alt="Post to Twitter" /></a> <a target="_blank" class="tt" href="http://delicious.com/post?url=http://www.barryspov.com/?p=148&amp;title=Silly+Crontab+Tester" title="Post to Delicious"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-delicious-big4.png" alt="Post to Delicious" /></a> <a target="_blank" class="tt" href="http://digg.com/submit?url=http://www.barryspov.com/?p=148&amp;title=Silly+Crontab+Tester" title="Post to Digg"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-digg-big4.png" alt="Post to Digg" /></a> <a target="_blank" class="tt" href="http://www.facebook.com/share.php?u=http://www.barryspov.com/?p=148&amp;t=Silly+Crontab+Tester" title="Post to Facebook"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-facebook-big4.png" alt="Post to Facebook" /></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.barryspov.com/?feed=rss2&amp;p=148</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Printing in an all-Mac world to Windows-ready printers</title>
		<link>http://www.barryspov.com/?p=138</link>
		<comments>http://www.barryspov.com/?p=138#comments</comments>
		<pubDate>Wed, 16 Jun 2010 05:21:21 +0000</pubDate>
		<dc:creator>Barry A Dobyns</dc:creator>
				<category><![CDATA[Apple OS X]]></category>
		<category><![CDATA[Journal Notes]]></category>
		<category><![CDATA[Printing]]></category>

		<guid isPermaLink="false">http://www.barryspov.com/?p=138</guid>
		<description><![CDATA[It&#8217;s no secret that I recently upgraded my life to an Apple MacBook.   The older daughter upgraded last year to an Apple MacBook.  The younger daughter upgrades this week to an Apple MacBook Pro (she needed FireWire for video) which probably means I can retire the old G4 tower.  This meant I needed to finally [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s no secret that I recently upgraded my life to an <a href="http://apple.com/macbook">Apple MacBook</a>.   The older daughter upgraded last year to an <a href="http://apple.com/macbook">Apple MacBook</a>.  The younger daughter upgrades this week to an <a href="http://apple.com/macbookpro">Apple MacBook Pro</a> (she needed FireWire for video) which probably means I can retire the old G4 tower.  This meant I needed to finally sort out how to print to the networked Windows-only printers that I have in the house.  Or at least, they were Windows-only when I bought them.</p>
<p>You may have noticed that printing is always a challenge.  This is because printer makers are cheap bastards.  They are always looking for ways to squeeze cost out of the printer &#8211; and this usually means pushing functionality back into the software driver.  It&#8217;s a rare modern printer that&#8217;s even as capable as the old original <a href="http://en.wikipedia.org/wiki/LaserWriter">Apple LaserWriter</a>.</p>
<p>Both of the two printers I own now have unnecessarily complicated drivers on Windows (and Mac too), partly because most of the rendering engine for the printer lives in the driver, not in the printer where it belongs.  What does unnecessarily complicated mean?  There&#8217;s parts of the printer drivers that <span style="text-decoration: underline;">run all the time</span> even when you&#8217;re not printing.</p>
<p><strong>Canon MF4370dn</strong></p>
<p>The first up and slightly newer was the <a href="http://www.usa.canon.com/consumer/controller?act=ModelInfoAct&amp;fcategoryid=124&amp;modelid=17405">Canon MF4370dn</a>, which was relatively easy to set up once I got a new enough version of the drivers.  Originally it shipped without OS X drivers at all.</p>
<p>Later, Canon posted some drivers on their support website, and the V180 drivers don&#8217;t work right &#8211; they&#8217;d usually hang after a page or two.  The <a href="http://www.google.com/search?q=UFR_II_V206_MacOSX_us_EN.dmg">UFR_II_V206_MacOSX_us_EN.dmg</a> works great.  Install it, and bonjour finds the printer easily and does the right thing.</p>
<p>I&#8217;ve not tried the fax or scan drivers.</p>
<p>Note that this printer tends to get easily confused &#8211; and I keep it plugged into a power strip so I can cycle the power on it.</p>
<p><strong>Konica Minolta 2300dl</strong></p>
<p>Next was the <a href="http://www.amazon.com/Konica-Minolta-magicolor-Color-Printer/dp/B00006LHS3">Minolta 2300dl</a>.   A quick google will turn up a lot of places that have complicated instructions involving <span style="text-decoration: underline;">foolzs</span> or the <span style="text-decoration: underline;">zenographics sdk</span> or the like.  Don&#8217;t do that.</p>
<p>The simple instructions are to use the <a href="http://printer.konicaminolta.com/support/current_printers/mc2430dl_sup.htm">2430DL mac drivers</a>.  This is good advice, as far as it goes.   However, you MUST also update your firmware to the latest version.   Certain old versions of the firmware in the 2300dl are certain to NOT work right.  You want firmware v2.86 which everyone who has bothered to write about agrees is the only version that works.</p>
<p>Finding the firmware is a pain in the butt.  It took me several months of searching off and on to find it, even knowing what I was looking for.   This printer is now old enough that it&#8217;s not at the top of the support tree, if you know what I mean.</p>
<p>The file you want is <a href="http://download6.konicaminolta.eu/konmin/public/&amp;&amp;BEU&amp;EN&amp;sw&amp;&amp;&amp;&amp;&amp;0&amp;&amp;&amp;0&amp;&amp;0&amp;&amp;0">v286_Update via network.zip</a> and sadly, it will only run on a PC.   I used an old pc with XP and it worked fine.   Some folks have success running on <a href="http://www.codeweavers.com/products/cxmac/">Wine</a>.   Worse, you have to manually edit a .BAT file with a text editor before you run it.   If this frightens you, get the nephew to do it.</p>
<ol>
<li>Download the <a href="http://printer.konicaminolta.com/support/current_printers/mc2430dl_sup.htm">2430DL mac driver</a></li>
<li>Download the v2.86 firmware update <a href="http://download6.konicaminolta.eu/konmin/public/&amp;&amp;BEU&amp;EN&amp;sw&amp;&amp;&amp;&amp;&amp;0&amp;&amp;&amp;0&amp;&amp;0&amp;&amp;0">v286_Update via network.zip</a> if this link does not take you to the right file directly, then enter &#8220;magicolor 2300dl&#8221; in the <strong>Product</strong> listbox and <strong>firmware</strong> in the <strong>Document Type / Sub Category</strong> listbox</li>
<li>Make sure your 2300dl printer has a static IP address</li>
<li>Edit the firmware update.bat file, as described in the included pdf</li>
<li>run the update.  when it wants you to pause in the middle, wait a few minutes to be sure.   a bricked printer is a bad thing.</li>
<li>use a browser to go to your printer and print the config page.</li>
<li>Install the driver</li>
<li>Install the printer from the system preferences panel
<ol>
<li>click on the &#8220;IP&#8221; icon at the top, and make sure you pick the <strong>HP JetDirect</strong> protocol in the top listbox.  Other settings may work.  these are the ones that work for me.</li>
<li>Enter the IP address of your printer.</li>
<li>in <strong>Print Using:</strong> choose <strong>Select Printer Software&#8230;</strong>
<ul>
<li>in the listbox that comes up, scroll down to<br />
<strong>KONICA MINOLTA magicolor 2430 DL</strong></li>
</ul>
</li>
<li>for me, if I tried to type a human-readable name, it changed the ip address and vice-versa.  and it also wiped out the driver setting each time.  Check carefully &#8211; the <strong>Print Using:</strong> MUST be set to the <strong><br />
KONICA MINOLTA magicolor 2430 DL</strong>, and the <strong>Address:</strong> must be correct before you click add.  These cannot be changed later.</li>
</ol>
</li>
<li>Try to print a page.</li>
<li>If the print queue monitor stops and shows the printer <em>paused</em> after 20% or so, then you did not update the firmware correctly.  Go back to the top and try again.</li>
<li>Afterwards, you can go back to the Print&amp;Fax Systems Preferences to rename this printer so it&#8217;s got a suitable name instead of the ip address.   Again, this may have just been an unhappy coincidence for me, but every time I tried to change the name when I was setting it up the first time, the IP address or the driver or both would be wiped out.  The name is not as important as the other two</li>
<li>It&#8217;s a small irony that the little icon of the printer is the icon of the 2430dl and not the printer you actually have, the 2300dl.</li>
</ol>
<p><img class="aligncenter size-full wp-image-157" title="Screen shot 2010-06-22 at 7.28.22 AM" src="http://www.barryspov.com/wp-content/uploads/2010/06/Screen-shot-2010-06-22-at-7.28.22-AM.png" alt="Screen shot 2010-06-22 at 7.28.22 AM" width="748" height="607" /></p>
<p align="left"><a target="_blank" class="tt" href="http://twitter.com/home/?status=Printing+in+an+all-Mac+world+to+Windows-ready+printers+http://o36k2.th8.us" title="Post to Twitter"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-twitter-big4.png" alt="Post to Twitter" /></a> <a target="_blank" class="tt" href="http://delicious.com/post?url=http://www.barryspov.com/?p=138&amp;title=Printing+in+an+all-Mac+world+to+Windows-ready+printers" title="Post to Delicious"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-delicious-big4.png" alt="Post to Delicious" /></a> <a target="_blank" class="tt" href="http://digg.com/submit?url=http://www.barryspov.com/?p=138&amp;title=Printing+in+an+all-Mac+world+to+Windows-ready+printers" title="Post to Digg"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-digg-big4.png" alt="Post to Digg" /></a> <a target="_blank" class="tt" href="http://www.facebook.com/share.php?u=http://www.barryspov.com/?p=138&amp;t=Printing+in+an+all-Mac+world+to+Windows-ready+printers" title="Post to Facebook"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-facebook-big4.png" alt="Post to Facebook" /></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.barryspov.com/?feed=rss2&amp;p=138</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Roman Unix Tricks</title>
		<link>http://www.barryspov.com/?p=116</link>
		<comments>http://www.barryspov.com/?p=116#comments</comments>
		<pubDate>Thu, 10 Jun 2010 18:54:23 +0000</pubDate>
		<dc:creator>Barry A Dobyns</dc:creator>
				<category><![CDATA[Apple OS X]]></category>
		<category><![CDATA[Journal Notes]]></category>
		<category><![CDATA[Unix]]></category>

		<guid isPermaLink="false">http://www.barryspov.com/?p=116</guid>
		<description><![CDATA[It&#8217;s a true fact that I&#8217;m a Unix greybeard, and invariably reach for a handful of obscure command line tools to do what needs to be done most days, lashing it together with bash, mysql and maybe Python or PHP.  It used to be that I would do so on a traditional Unix, or later [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s a <a href="http://www.team.net/mjb/hawg.html" target="_blank">true fact</a> that I&#8217;m a <a title="Unix Greybeards" href="http://tomayko.com/writings/that-dilbert-cartoon" target="_blank">Unix greybeard</a>, and invariably reach for a handful of obscure command line tools to do what needs to be done most days, lashing it together with <a href="http://www.gnu.org/software/bash/" target="_blank">bash</a>, <a href="http://www.mysql.com/" target="_self">mysql</a> and maybe <a href="http://www.python.org/" target="_blank">Python</a> or <a href="http://php.net/index.php" target="_blank">PHP</a>.  It used to be that I would do so on a <a href="http://en.wikipedia.org/wiki/Version_7_Unix" target="_blank">traditional</a> <a href="http://en.wikipedia.org/wiki/Berkeley_Software_Distribution" target="_blank">Unix</a>, or later on, a <a href="http://en.wikipedia.org/wiki/Solaris_%28operating_system%29" target="_blank">commercial Unix</a>, and then one of the <a href="http://www.freebsd.org/" target="_blank">free</a> <a href="http://www.openbsd.org/">BSD</a> <a href="http://www.netbsd.org/" target="_blank">derivatives</a> (especially on obscure hardware) or <a href="http://en.wikipedia.org/wiki/Linux" target="_blank">Linux</a>.  These days, <a href="http://www.apple.com/macosx/" target="_blank">OSX</a> is the obvious place to get it done, perhaps with the help of <a href="http://www.macports.org/" target="_blank">ports</a> or <a href="http://www.finkproject.org/" target="_blank">fink</a> to fetch the tool first.</p>
<p>More often than not, the tool I want already exists.  Over 30 years of continuous development generally puts all the tools you need somewhere in the tool box.   But shockingly there&#8217;s a tool missing from time to time.   I can&#8217;t recall how many times I&#8217;ve had to reimplement a tool like<a href="http://www.linux-faqs.com/man/htmlman1/shuf.1.html" target="_blank"> shuf(1)</a> as a few lines of bash before it became a part of the standard <a href="http://www.gnu.org/software/coreutils/" target="_blank">coreutils</a>.</p>
<p>This time it happened that I needed (the explanation <em>why</em> borders on silly, so it&#8217;s not worth repeating) a simple conversion of Arabic numbers to Roman numerals.   You know, <strong>MCDXLIV</strong> instead of <strong>1444</strong>.  Anyway, I expected to find that there was already some tool that did this.  Or that printf(1) would do it.  Or bc(1) or dc(1) would do it.</p>
<p><a href="http://google.com" target="_blank">Google</a> located a <a href="http://www.faqs.org/docs/abs/HTML/functions.html" target="_blank">simple little script</a> that worked for numbers up to 255 but that was insufficient for my needs today.   So I ended up tweeking up an improved version of that script that works up to 3999 (which is the largest number you can represent in ASCII roman numerals).</p>
<pre class="brush: bash; light: true;">
#!/bin/bash
# Arabic number to Roman numeral conversion
#
# conforms to http://en.wikipedia.org/wiki/Roman_numerals
#    rather than &quot;motion picture usage&quot; so 1999 is MCMXCIX not MCMLXXXXIX
#    3999 is the largest number we can represent without an overbar
# also see http://www.faqs.org/docs/abs/HTML/functions.html
#
# Range: 0 - 3999
#
# Usage: roman number-to-convert

LIMIT=3999
E_ARG_ERR=65
E_OUT_OF_RANGE=66
REMAINDER=/tmp/$$.$RANDOM

if [ -z &quot;$1&quot; ]
then
 echo &quot;Usage: `basename $0` number-to-convert&quot;
 exit $E_ARG_ERR
fi

echo $1 &gt;$REMAINDER
if [ &quot;$1&quot; -gt $LIMIT ]
then
 echo &quot;Out of range!&quot;
 exit $E_OUT_OF_RANGE
fi

to_roman ()
{
 number=$1
 factor=$2
 rchar=$3
 let &quot;remainder = number - factor&quot;
 while [ &quot;$remainder&quot; -ge 0 ]
 do
 echo -n $rchar
 let &quot;number -= factor&quot;
 let &quot;remainder = number - factor&quot;
 done

 echo $number &gt;$REMAINDER
}

to_roman `cat $REMAINDER` 1000 M
to_roman `cat $REMAINDER` 900 CM
to_roman `cat $REMAINDER` 500 D
to_roman `cat $REMAINDER` 400 CD
to_roman `cat $REMAINDER` 100 C
# for 'motion picture usage' then use this instead of XC
# to_roman `cat $REMAINDER` 90 LXXXX
to_roman `cat $REMAINDER` 90 XC
to_roman `cat $REMAINDER` 50 L
to_roman `cat $REMAINDER` 40 XL
to_roman `cat $REMAINDER` 10 X
to_roman `cat $REMAINDER` 9 IX
to_roman `cat $REMAINDER` 5 V
to_roman `cat $REMAINDER` 4 IV
to_roman `cat $REMAINDER` 1 I

rm -rf $REMAINDER

echo

exit 0</pre>
<p align="left"><a target="_blank" class="tt" href="http://twitter.com/home/?status=Roman+Unix+Tricks+http://5g6zg.th8.us" title="Post to Twitter"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-twitter-big4.png" alt="Post to Twitter" /></a> <a target="_blank" class="tt" href="http://delicious.com/post?url=http://www.barryspov.com/?p=116&amp;title=Roman+Unix+Tricks" title="Post to Delicious"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-delicious-big4.png" alt="Post to Delicious" /></a> <a target="_blank" class="tt" href="http://digg.com/submit?url=http://www.barryspov.com/?p=116&amp;title=Roman+Unix+Tricks" title="Post to Digg"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-digg-big4.png" alt="Post to Digg" /></a> <a target="_blank" class="tt" href="http://www.facebook.com/share.php?u=http://www.barryspov.com/?p=116&amp;t=Roman+Unix+Tricks" title="Post to Facebook"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-facebook-big4.png" alt="Post to Facebook" /></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.barryspov.com/?feed=rss2&amp;p=116</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Evaluating Search Index Relevance</title>
		<link>http://www.barryspov.com/?p=92</link>
		<comments>http://www.barryspov.com/?p=92#comments</comments>
		<pubDate>Sun, 20 Sep 2009 23:33:58 +0000</pubDate>
		<dc:creator>Barry A Dobyns</dc:creator>
				<category><![CDATA[Indexing]]></category>

		<guid isPermaLink="false">http://www.barryspov.com/?p=92</guid>
		<description><![CDATA[I was evaluating a particular search index recently and trying to compare the relevance of results  as compared to a different search engine using the same corpus.    This is the same problem you have when you&#8217;re tuning the relevance functions in a single index, and trying to  see if it&#8217;s &#8220;better&#8221;.  Some sort of external [...]]]></description>
			<content:encoded><![CDATA[<p>I was evaluating a particular search index recently and trying to compare the relevance of results  as compared to a different search engine using the same corpus.    This is the same problem you have when you&#8217;re tuning the relevance functions in a single index, and trying to  see if it&#8217;s &#8220;better&#8221;.  Some sort of external scoring methodology is needed to compare the different result sets.</p>
<p>First, you need a corpus.  A collection of representative documents.</p>
<p>Second you need a selection of representative queries.  Best if they&#8217;re taken from acutal logfiles of real user queries against the actual collection.  Failing that, use the <a href="http://en.wikipedia.org/wiki/AOL_search_data_scandal">AOL Search Data</a>.</p>
<p>Anyway, I was working today with a real non-trivial corpus, a good set of representative queries, and a buddy&#8217;s index and search engine (not <a href="http://en.wikipedia.org/wiki/Lucene">Lucene</a>).  As I performed sample searches and scored the results, I realized the naive scoring method I started with wasn&#8217;t sufficient.</p>
<p>As a user, I don&#8217;t care much about the relative ranking in the first page, I just want to make sure that the best results are not languishing on page seven or nineteen of the results.  In other words, how do I know the specific ten that search engine A picks for it&#8217;s first page are better than the specific ten that search engine B picks for the first page?</p>
<p>This is because it&#8217;s orders of magnitude more important to get the right result on the first page at all than it is to get the right results ordered properly on the first page.  Relative ordering of the first page is irrelevant to me as a user.</p>
<p>This also corresponds with how experts in SEO behave.  Being the first result on Google is mostly an ego-inflating party trick.  The difference between being the second result and the tenth (still on the first page) is negligible.   The difference between being the tenth (on the first page) and the eleventh (not on the first page) is huge.</p>
<p>Finally, there&#8217;s a &#8220;diversity&#8221; measure.  Users don&#8217;t want substantially identical pages to appear on the first results page.</p>
<p>This thinking lead me to a very different notion of how to measure &#8220;quality&#8221; of the results that probably corresponds more naturally to a user&#8217;s intuition, and certainly corresponds to my intuition.</p>
<p>Here&#8217;s how I&#8217;d score, if I had to.</p>
<ol>
<li>Rank all possible results for a query (or at least a very large number, say the top 100 possible results), using some external &#8220;true&#8221; ranking methodology &#8211; (<a href="https://www.mturk.com/mturk/welcome">Amazon Mechanical Turk</a>, or the <a href="http://trec.nist.gov/data/t9_filtering.html" target="_blank">TREK</a> data? ) to discover the &#8220;true top ten.&#8221;  It&#8217;s important that this be external to the relevance mechanism of the search engine itself &#8211; otherwise you&#8217;ve just created a feedback loop, and everything else is meaningless.</li>
<li>Flag substantially identical pages in the &#8220;true top ten&#8221; so that you only count one, but not both.  For example, if your &#8220;top ten&#8221; has two substantially identical ones, you&#8217;ll include the eleventh as part of the &#8220;true top ten&#8221;.</li>
<li>Ignore order on first page.</li>
<li>The base score is a count of the &#8220;true top ten&#8221; that appear in the first page of the result set. (scores from 0 to 10)</li>
<li>Add in a +1 &#8220;bonus&#8221; for getting the &#8220;true best one&#8221; in the first position.</li>
<li>If more than one of the &#8220;substantially identical&#8221; page flagged in rule two earlier appear, only count one, and score the others as zero.</li>
</ol>
<p>Let&#8217;s call this NQM (Naive Quality Measure).  While other more complicated measures might be more defensible academically (and there&#8217;s no shortage of academic ways to compare search results), I believe that NQM is going to correspond much closer to actual user perception.  In addition, it&#8217;s a lot easier to explain than more complicated academic measures, since most of the score comes from rule 4.</p>
<p align="left"><a target="_blank" class="tt" href="http://twitter.com/home/?status=Evaluating+Search+Index+Relevance+http://r277m.th8.us" title="Post to Twitter"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-twitter-big4.png" alt="Post to Twitter" /></a> <a target="_blank" class="tt" href="http://delicious.com/post?url=http://www.barryspov.com/?p=92&amp;title=Evaluating+Search+Index+Relevance" title="Post to Delicious"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-delicious-big4.png" alt="Post to Delicious" /></a> <a target="_blank" class="tt" href="http://digg.com/submit?url=http://www.barryspov.com/?p=92&amp;title=Evaluating+Search+Index+Relevance" title="Post to Digg"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-digg-big4.png" alt="Post to Digg" /></a> <a target="_blank" class="tt" href="http://www.facebook.com/share.php?u=http://www.barryspov.com/?p=92&amp;t=Evaluating+Search+Index+Relevance" title="Post to Facebook"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-facebook-big4.png" alt="Post to Facebook" /></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.barryspov.com/?feed=rss2&amp;p=92</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Corporate Data, History, Future</title>
		<link>http://www.barryspov.com/?p=87</link>
		<comments>http://www.barryspov.com/?p=87#comments</comments>
		<pubDate>Thu, 10 Sep 2009 20:20:30 +0000</pubDate>
		<dc:creator>Barry A Dobyns</dc:creator>
				<category><![CDATA[Indexing]]></category>

		<guid isPermaLink="false">http://www.barryspov.com/?p=87</guid>
		<description><![CDATA[Before 1989, databases were big, specialized things that you did with minicomputers.  Very large companies could afford them. Small companies did not, and mostly could not afford them.  There were cool little desktop products, and DBase from Ashton-Tate was among the best &#8211; but there was little to no standardization.  Whether SQL was or should [...]]]></description>
			<content:encoded><![CDATA[<p>Before 1989, databases were big, specialized things that you did with minicomputers.  Very large companies could afford them. Small companies did not, and mostly could not afford them.  There were cool little desktop products, and <a href="http://en.wikipedia.org/wiki/Dbase" target="_blank">DBase</a> from <a title="If you're going to comment, please don't pick this to argue about" href="http://en.wikipedia.org/wiki/Ashton-Tate" target="_blank">Ashton-Tate</a> was among the best &#8211; but there was little to no standardization.  Whether SQL was or should be a standard that everyone followed was open to debate, and for the most part, <em>jus&#8217; reg&#8217;lar folk</em> ignored databases.</p>
<p>The time from about 1989 to about 1999 were when everyone began to realize that SQL relational databases were not just for airlines and insurance companies.  Starting with the first release of Microsoft SQL Server (yes, yes, with <a href="http://www.therandomblog.org/?p=181" target="_blank">Sybase and Ashton Tate</a>) relational database technology started to enter the mainstream of personal and corporate computing.   During the nineties, just about every company that didn&#8217;t have their tabular data in relational databases got it there.</p>
<p>Starting in 1999, we were able to actually do more with that tabular data than just run the business in the same old ways as before.  Most people realized that the data they had collected could be mined for useful business intelligence.  Looking at aggregate trends in a lot of data, rather than just searching for a single record became possible, allowing savvy companies to build meaningful &#8220;dashboards&#8221; for management.  Business Intelligence was enabled by the fact that we all had our data in  SQL databases.</p>
<p>Also beginning at the end of the 90&#8217;s everyone realized that since we already had not only our data, but all our operational systems wired up to these databases, it was (varying degrees) of easy to wire it directly to the internet to connect to customers, business partners, and potential customers.  E-Commerce and the dot-com boom was enabled by the fact that we already had all our data in SQL databases.  Before the internet, <a href="http://en.wikipedia.org/wiki/Electronic_Data_Interchange" target="_blank">EDI</a> was hard and expensive to build.  The internet, and most especially XML carried over HTTP made it easy for everyone else.  But before widespread adoption of databases, neither internet e-commerce nor EDI was possible.</p>
<p>SQL gave us all a common way to talk to our data, and as we realized how useful it was, business leaders ran around inside companies looking for more data to add to our SQL databases, recognizing that the more was in there, the more we could take advantage of it.  Just about everything that fits the rectangular table format is now in a database, in just about every company, large or small, everywhere in the developed world.  While there&#8217;s many database products &#8211; from free and open source, to expensive and commercial, they all share SQL as the common way to get at the data &#8211; which allows us to leverage that data in many many ways.</p>
<p>Along the way, lots of people asked a reasonable question: what about all the rest of my data?  Things that don&#8217;t really fit into a table, because they&#8217;re not &#8220;records&#8221; of a uniform type.  Documents.  Letters.  Memos. Companies are full of plain old documents, for instance &#8211; and these documents don&#8217;t fit well into databases.</p>
<p>Yeah.  There&#8217;s been some notable attempts to do something interesting with that data, <a href="http://en.wikipedia.org/wiki/IBM_Lotus_Notes" target="_blank">Lotus Notes</a> among the most extravagant (or bizarrely interesting), and <a href="http://en.wikipedia.org/wiki/Index_(search_engine)" target="_blank">indexing</a> products like Verity (now <a href="http://en.wikipedia.org/wiki/Autonomy_Corporation" target="_blank">Autonomy</a>).  But for the most part, these attempts have been large and expensive, and mostly confined to the very largest companies, or to trivial <a href="http://en.wikipedia.org/wiki/Google_desktop_search" target="_blank">desktop toys</a>.  In many ways, the state of the indexing market is much like the state of the database market of 1989 &#8211; there&#8217;s a play at the top for big spenders, and something to toy with at the bottom, but nothing for every-man, or more importantly every-company.</p>
<p>Today, most companies realize that the SQL databases don&#8217;t hold all their data.  In some industries (healthcare in America, anyone?), it&#8217;s arguable that <em>most</em> of the interesting data is not in any sort of usable electronic form.</p>
<p>We&#8217;re now at a tipping point for indexing.  Managers in every company are now looking at all this data that&#8217;s not in a database, and saying to themselves, &#8220;Self, why can&#8217;t I just search my stuff the way I <em>Google</em> for stuff on the internet?&#8221;</p>
<p><a href="http://en.wikipedia.org/wiki/Lucene" target="_blank">Lucene</a>, which has been around for a while, has been a specialist tool for Java nerds until very recently.  It&#8217;s an indexing and search library, but a library is far from a complete usable product.  The vast majority of <a href="http://wiki.apache.org/lucene-java/PoweredBy" target="_blank">vertical search engines</a> you see on the web have Lucene behind them.  But building an indexing product with Lucene is hard work.   Lucene is <em>not </em>for companies whose primary business is something <em>other </em>than search.</p>
<p><a href="http://en.wikipedia.org/wiki/Solr" target="_blank">Solr</a> is a framework around Lucene which allows rapid application development.  By providing a framework where you can plug in parsers, query analyzers, response writers and such, and providing a collection of these to begin with, Solr gives developers a complete toolkit that goes way beyond Lucene.   Corporate developers don&#8217;t need to begin with reinventing the entire workflow each time, since Solr gives them a kickstart on a workable process model and component set.</p>
<p>Solr itself, while opensource, is also <a href="http://www.lucidimagination.com/" target="_blank">commercially supported</a>.  So much in the same way that <a href="http://en.wikipedia.org/wiki/MySQL_AB" target="_blank">MySQL AB</a> made it &#8220;safe&#8221; for companies to use <a href="http://en.wikipedia.org/wiki/MySQL" target="_blank">MySQL</a> (someplace reliable to buy support and get help), <a href="http://lucidimagination.com" target="_blank">Lucid</a> is making it safe for companes to use Solr and Lucene.</p>
<p>I predict that the next decade will show a rapid rise of &#8220;everyone else&#8221; starting to stuff their non-tabular data into indexes, and starting to leverage those indexes in new and interesting ways. Today, 2009, indexing is where databases were in 1989.</p>
<p>What we don&#8217;t yet have are a couple of key enablers that made the widespread adoption of relational technology possible and rapid. It remains to be seen whether these enablers will turn out to be required for indexing to become broadly deployed.</p>
<p>Rock solid tools.   Errors developers see and users see <em>must</em> correlate to errors they committed, not flaws in the tool.  Is Solr+Lucene as stable and bulletproof as your favorite database tools (Oracle, SqlServer, Postgres, MySQL)? No. At the moment, it&#8217;s still brittle and prone to fall over dead, mostly during development, and thankfully not during deployment.  Changing the schema requires that you restart the search engine.   Errors in your schema specification appear in a java stack backtrace, which may not even include the line number of the error, and may or may not prevent the engine from starting up.   Syntax for configuration is inscrutable and poorly documented.  This immaturity prepresents a barrier to entry, but we should expect rapid improvement in both the opensource toolset as it rushes to catch up with the commercial ones, and in the commercial ones as they rush to stay ahead.</p>
<p>A common query language, SQL, made it possible for all the database technologies to offer common functionality and common models.  SQL also implied a common workflow near the database.  It was painful to move from Oracle to Microsoft, but there weren&#8217;t any radical new concepts to learn, just ugly details.  There&#8217;s no commonality like SQL between the various indexing libraries, products and tools.</p>
<p>Another key enabler are reporting, mining and analysis tools.   Today, there&#8217;s a few specialized tools like <a href="http://splunk.com">Splunk</a> and <a href="http://paglo.com">Paglo</a> that analyze server logfiles.  Server logfiles may be &#8220;unstructured&#8221; data, but it&#8217;s a narrow subset of the sorts of data that needs to be analyzed, and is a far cry from the general tools that will need to be built in the coming years.  I&#8217;ve had a few people suggest to me (even a few otherwise smart venture investors) that these are general purpose analysis tools for unstructured data, which even the vendors realize they are not.</p>
<p>Anyway, this is an area I&#8217;ve been thinking about for a long time, and am going to continue to talk more about as we go forward.</p>
<hr />My apologies to those of you who have slaved for years in the <a href="http://en.wikipedia.org/wiki/Information_retrieval" target="_blank">IR</a> space, this is a rich field with a rich academic history, and a rich commercial history in a few industries where it was needed early (government intelligence commuities, and the legal profession are just two).</p>
<p>I&#8217;ve slaved in this area too, building an inverted keyword indexing product from scratch in the mid 80&#8217;s, working again in IR in the mid 90&#8217;s for the intelligence communities, and then working in vertical web search in the middle of this decade.  What I&#8217;m talking about here is widespread adoption by everyone, not just narrow adoption by a few industries, and papers at <a href="http://en.wikipedia.org/wiki/Text_Retrieval_Conference" target="_blank">TREC</a> and <a href="http://en.wikipedia.org/wiki/Special_Interest_Group_on_Information_Retrieval" target="_blank">SigIR</a>.</p>
<p align="left"><a target="_blank" class="tt" href="http://twitter.com/home/?status=Corporate+Data%2C+History%2C+Future+http://bkw8c.th8.us" title="Post to Twitter"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-twitter-big4.png" alt="Post to Twitter" /></a> <a target="_blank" class="tt" href="http://delicious.com/post?url=http://www.barryspov.com/?p=87&amp;title=Corporate+Data%2C+History%2C+Future" title="Post to Delicious"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-delicious-big4.png" alt="Post to Delicious" /></a> <a target="_blank" class="tt" href="http://digg.com/submit?url=http://www.barryspov.com/?p=87&amp;title=Corporate+Data%2C+History%2C+Future" title="Post to Digg"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-digg-big4.png" alt="Post to Digg" /></a> <a target="_blank" class="tt" href="http://www.facebook.com/share.php?u=http://www.barryspov.com/?p=87&amp;t=Corporate+Data%2C+History%2C+Future" title="Post to Facebook"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-facebook-big4.png" alt="Post to Facebook" /></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.barryspov.com/?feed=rss2&amp;p=87</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Moving to WordPress</title>
		<link>http://www.barryspov.com/?p=56</link>
		<comments>http://www.barryspov.com/?p=56#comments</comments>
		<pubDate>Wed, 19 Aug 2009 06:48:37 +0000</pubDate>
		<dc:creator>Barry A Dobyns</dc:creator>
				<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://www.barryspov.com/?p=56</guid>
		<description><![CDATA[I finally decided to abandon phpWebLog which I&#8217;ve been using for nearly a decade now as the primary blogging/content management tool not only for my personal blogs, but also for my wife&#8217;s consulting firm, and several nonprofit organizations that I set up a website for.
Why did I move?  phpWebLog was a great choice when I [...]]]></description>
			<content:encoded><![CDATA[<p>I finally decided to abandon <a href="http://phpweblog.org" target="_blank">phpWebLog</a> which I&#8217;ve been using for nearly a decade now as the primary blogging/content management tool not only for my personal blogs, but also for my wife&#8217;s consulting firm, and several nonprofit organizations that I set up a website for.</p>
<p>Why did I move?  <span style="text-decoration: underline;">phpWebLog</span> was a great choice when I started using it, in April of 2001.  At that time, the popular alternative seemed to be <a href="http://phpnuke.org" target="_blank">phpNuke</a> but it&#8217;s community was busy exploding into a bunch of subprojects, and all of them seemed like they were far too heavyweight for the simple things I had in mind.</p>
<p>By contrast, <span style="text-decoration: underline;">phpWebLog</span> was small and lightweight, and it was clear that I could understand all of the <span style="text-decoration: underline;">phpWebLog</span> code pretty quickly, and bend it to my needs.  As it turns out, I modified heavily along the way, and now have an incompatible variant that is forked from that ancient release I started with in 2001.  For example, when podcasting caught my fancy in 2006, I added code to automatically cross-post show notes to the blog when adding them to the podcast directory.</p>
<p>Today, the heavyweight tool that&#8217;s too complicated to use is now <a href="http://drupal.org/" target="_blank">Drupal</a>, while <span style="text-decoration: underline;">phpNuke</span> and <span style="text-decoration: underline;">phpWebLog</span> are mostly forgotten.  I&#8217;m still trying to avoid overkill tools that cost me far more effort to learn than they&#8217;re worth.</p>
<p>But what&#8217;s wrong is that now the look of the old <span style="text-decoration: underline;">phpWebLog</span> is starting to feel ugly and dated, and it&#8217;s hard to add new functionality in, while modern systems have a useful plugin architecture.  Too many simple things are hard to change in <span style="text-decoration: underline;">phpWebLog</span> or are brittle.  So it&#8217;s time to migrate to something else, where I can spend my time writing the blog post, not writing the code on the server to support a post.</p>
<p>On the other hand, I am willing to invest a little in moving content between two systems.</p>
<p><a href="http://wordpress.org" target="_blank">WordPress</a> is well supported, has a vibrant ecosystem around it for plugins and themes, is <a href="http://www.gnu.org/copyleft/gpl.html" target="_blank">GPL</a>, and is written in <a href="http://php.net" target="_blank">PHP</a>, a language I know well.</p>
<p>So I installed <span style="text-decoration: underline;">WordPress</span> and began to puzzle out how to insert my old content from <span style="text-decoration: underline;">phpWebLog</span> into <span style="text-decoration: underline;">WordPress</span> directly (mucking around with sql queries in the background).   I don&#8217;t recall which parts of <span style="text-decoration: underline;">phpWebLog</span> I&#8217;ve extended and updated.  It&#8217;s possible my <span style="text-decoration: underline;">phpWebLog</span> schema is no longer similar enough to what you get now that this is useful to anyone.  The trickiest part of all of this is that the &#8220;category&#8221; id is right in the post table for <span style="text-decoration: underline;">phpWebLog</span> but for <span style="text-decoration: underline;">WordPress</span> category is indirectly bound to the post via the taxonomy table in <span style="text-decoration: underline;">WordPress</span>.</p>
<p>This blog will focus just on high tech and software development.  So, while I was at it, I also split some of the old categories and their posts off into a separate wordpress blog (which will remain at <a href="http://www.nothingtodeclare.com" target="_blank">nothingtodeclare.com</a>), and only brought a few categories to this one.  And certain categories of content and their posts were just discarded altogether.  You can find the missing content at <a href="http://archive.org" target="_blank">archive.org</a> if you need it.</p>
<p>I thought about building a complete http referrer map in .htaccess, which I have done in the past but decided against that, partly because of the namespace collisions with some of the php files between <span style="text-decoration: underline;">phpWebLog</span> and <span style="text-decoration: underline;">WordPress</span>.  Besides, now that my parents have both passed away, there&#8217;s no one alive with bookmarks into any of my content.</p>
<p>Here&#8217;s the basic SQL that does the bulk migration.  I&#8217;ve left out the additional SQL that trimmed out certain categories and posts for each destination.  Comments inline, of course.</p>
<pre class="brush: sql;">
-- the phpWebLog tables are prefixed with t_
-- the wordpress tables are prefixed with wp_

-- copy over the table which contains the &quot;categories&quot;  (topics in phpWebLog).
-- note that we preserve the numerical value of the topic ids.
-- there's a lowercase version of the topic name that wordpress wants.
TRUNCATE TABLE `wp_terms` ;
INSERT INTO `wp_terms` (`term_id`, `name`, `slug`)
SELECT `Rid`, `Topic`, REPLACE(LOWER(`Topic`),' ','-')
FROM `T_Topics` ;

-- the taxonomy table points to the terms, and is pointed to by the relations table later.
TRUNCATE `wp_term_taxonomy`;
INSERT INTO `wp_term_taxonomy` ( `term_taxonomy_id`, `term_id`, `taxonomy`, `parent`, `count` )
SELECT `Rid`, `Rid`, 'category', 0, 0
FROM `T_Topics`;

-- we're going to copy the stories over, into the posts table.
-- phpWebLog required some quoting with the backslash that WordPress does not.
-- we try to preserve the original post dates and the last edit dates.
-- use phpWebLog Rid as the guid which is necessary later.
-- assumes that all the imported posts are from the admin post_author=1
TRUNCATE `wp_posts`;
TRUNCATE `wp_postmeta`;
INSERT INTO `wp_posts` ( `post_author`, `post_date`, `post_date_gmt`,
 `post_content`, `post_title`,
 `post_excerpt`, `post_name`,
 `post_modified`, `post_modified_gmt`, `guid` )
SELECT 1, `Birthstamp`, `Birthstamp`,
 REPLACE(`Content`,'\\','') , REPLACE(`Heading`,'\\',''),
 REPLACE(`Summary`,'\\','') , REPLACE(`Heading`,'\\',''),
 `Timestamp`, `Timestamp`, `Rid`
FROM `T_Stories`;

-- in order to create the relation table which points into the taxonomy and the posts respectively.
-- we need to join the posts and stories on the Rid/Guid so we can get out the new post id and old topic id.
TRUNCATE TABLE `wp_term_relationships`;
INSERT INTO `wp_term_relationships` ( `object_id`, `term_taxonomy_id`, `term_order` )
SELECT w.ID, t.Topic, 0
FROM `T_Stories` AS t, `wp_posts` AS w
WHERE t.Rid = w.guid ;

-- now we need to update the counts in the taxonomy table for the number of articles
-- that belong to each topic.  We do a simple temporary table of id's and counts first
CREATE TEMPORARY TABLE p
SELECT `Topic`, COUNT(`Topic`) AS Count
FROM `T_Stories`
GROUP BY `Topic`;

-- now update the taxonomy table with the counts
UPDATE `wp_term_taxonomy` AS w , p
 SET w.count = p.Count
WHERE w.term_id = p.Topic ;

-- at this point, if you are lucky, you should be able to go see your content
-- in wordpress, and with the categories still hooked up to the articles
</pre>
<p align="left"><a target="_blank" class="tt" href="http://twitter.com/home/?status=Moving+to+WordPress+http://zgdim.th8.us" title="Post to Twitter"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-twitter-big4.png" alt="Post to Twitter" /></a> <a target="_blank" class="tt" href="http://delicious.com/post?url=http://www.barryspov.com/?p=56&amp;title=Moving+to+WordPress" title="Post to Delicious"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-delicious-big4.png" alt="Post to Delicious" /></a> <a target="_blank" class="tt" href="http://digg.com/submit?url=http://www.barryspov.com/?p=56&amp;title=Moving+to+WordPress" title="Post to Digg"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-digg-big4.png" alt="Post to Digg" /></a> <a target="_blank" class="tt" href="http://www.facebook.com/share.php?u=http://www.barryspov.com/?p=56&amp;t=Moving+to+WordPress" title="Post to Facebook"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-facebook-big4.png" alt="Post to Facebook" /></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.barryspov.com/?feed=rss2&amp;p=56</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bear Skins and Flint Knives</title>
		<link>http://www.barryspov.com/?p=67</link>
		<comments>http://www.barryspov.com/?p=67#comments</comments>
		<pubDate>Wed, 10 Sep 2008 07:44:31 +0000</pubDate>
		<dc:creator>Barry A Dobyns</dc:creator>
				<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://www.barryspov.com/?p=67</guid>
		<description><![CDATA[I&#8217;ve been building database applications for a long time, and building in particular on MySQL since 1998 now.  On MySQL I&#8217;ve built some pretty gnarly apps, with dozens of tables, millions of rows and nasty joins. Some, like the one for the education foundation do dozens of queries to build each webpage (pretty typical for [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been building database applications for a long time, and building in particular on <a href="http://mysql.org" target="_blank">MySQL</a> since 1998 now.  On MySQL I&#8217;ve built some pretty gnarly apps, with dozens of tables, millions of rows and nasty joins. Some, like the one for the <span><a href="http://www.lpef.org/" target="_blank">education foundation</a></span> do dozens of queries to build each webpage (pretty typical for a blog), and the internal accounting apps for the ed fund directly generates <a href="http://en.wikipedia.org/wiki/Rich_Text_Format" target="_blank">RTF</a> for parts of the auction catalog from complex four and five table joins, some with nested subqueries. All of this runs quickly, with the most complicated queries taking longer to transfer the result from the server via HTTP than it does to run the query to completion.</p>
<p>So I&#8217;m shocked, shocked that queries I want to perform on my <span style="text-decoration: underline;">CrawlAnalysis</span> database take hours, and in some cases days to run to completion.</p>
<p>I was investigating why this one particular query took so incredibly long <span style="color: #cc3300;"><strong>(over 6 days)</strong></span> to run to completion:</p>
<pre class="brush: sql; light: true;">
SELECT ll.URL, ll.Status, ll.FetchTime, ll.ModifiedTime, ll.RetriesSinceFetch,
ll.FetchInterval, ll.Score, ll.Signature, ll.Metadata, ll.CrawlID, ll.Host,
ll.ReversedHost, ll.uto, ll.prk, ll.pas, ll.pts
FROM `crawl-20060820` AS ll
LEFT JOIN `sun-crawl-20060928172854` as rr ON rr.URL = ll.URL
WHERE (
ll.ReversedHost LIKE 'com.sun.%'
OR ll.ReversedHost LIKE 'net.java.%'
)
AND rr.URL IS NULL
AND ll.Status = 'DB_fetched'
GROUP BY ll.URL;
</pre>
<p>and found a reference <span><a href="http://www.amazon.com/High-Performance-MySQL-Jeremy-Zawodny/dp/0596003064" target="_blank">High Performance MySQL</a></span> (which I bought at the bookstore today) that explains in an appropriately named section called <strong>Stupid Query Tricks</strong> that under unfavorable circumstances doing an OR in a WHERE clause can force MySQL to <strong>rowscan </strong>the table.</p>
<p>In other words, MySQL is stupider than a dead cat.</p>
<p>The book suggests rewriting the query I have above as a UNION instead:</p>
<pre class="brush: sql; light: true;">
(
SELECT ll.URL, ll.Status, ll.FetchTime, ll.ModifiedTime, ll.RetriesSinceFetch,
ll.FetchInterval, ll.Score, ll.Signature, ll.Metadata, ll.CrawlID, ll.Host,
ll.ReversedHost, ll.uto, ll.prk, ll.pas, ll.pts
FROM `crawl-20060820` AS ll
LEFT JOIN `sun-crawl-20060928172854` as rr ON rr.URL = ll.URL
WHERE ll.ReversedHost LIKE 'com.sun.%'
AND rr.URL IS NULL
AND ll.Status = 'DB_fetched'
GROUP BY ll.URL
) UNION (
SELECT ll.URL, ll.Status, ll.FetchTime, ll.ModifiedTime, ll.RetriesSinceFetch,
ll.FetchInterval, ll.Score, ll.Signature, ll.Metadata, ll.CrawlID, ll.Host,
ll.ReversedHost, ll.uto, ll.prk, ll.pas, ll.pts
FROM `crawl-20060820` AS ll
LEFT JOIN `sun-crawl-20060928172854` as rr ON rr.URL = ll.URL
WHERE ll.ReversedHost LIKE 'net.java.%'
AND rr.URL IS NULL
AND ll.Status = 'DB_fetched'
GROUP BY ll.URL )
</pre>
<p>Thereby doubling the number of lines of code in  the query and making it sooo much clearer.</p>
<p>Interpretation: the book is suggesting that because the optimizer is too dumb to properly optimize the query, you do it yourself, in a database-specific way.  <span><a href="http://en.wikipedia.org/wiki/Cthulu">Cthulu</a></span> knows what you&#8217;ll get for a performance result when you run this query against a different SQL database. But that doesn&#8217;t matter anyway, because the ANSI SQL standard is mostly ignored in the breach by application and database developers alike.</p>
<p>Which begs the question of <strong>WTF</strong> is the MySQL query <strong>optimizer</strong> doing?  Is it really a query <strong>pessimizer</strong>?  And what&#8217;s with the anecdotal story in the same chapter (page 91) where in the case of one particular join query the optimizer took 30 times as long to run as it did to actually execute the query?  Excuse me?</p>
<p>This whole thing with some (but only some) MySQL queries taking forever is pissing me off.  It&#8217;s like we&#8217;ve gone back to <span style="color: #cc9900;"><strong>bearskins and flint knives</strong></span> (aka toggling in your program on the front panel switches).  And, frankly, I already know from past experience that the alternative suspects (Oracle, Sybase, Postgres, Ingres, Illustra &#8230;) are no better &#8211; only different.</p>
<p>I was confused. Before today, I thought that the point of a high-performance and excellent modern database is that you don&#8217;t have to spend all your time doing a SQL EXPLAIN to figure out why the database is stupider than a dead cat.</p>
<hr />Damn!  Again!  This query returns 650,000 rows, including ones that are NOT in the date range I&#8217;m specifying.</p>
<pre class="brush: sql; light: true;">
SELECT `URL`
 FROM `SquidAccessLog-20061006`
 WHERE `ResultCode` LIKE '%MISS'
   AND `TimeStamp` BETWEEN '2006-10-06 12:80:00'
                       AND '2006-10-06 15:00:00'
 GROUP BY `URL`;
</pre>
<p>But this query (which should be functionally equivalent) returns the right stuff (less than 10,000 rows)</p>
<pre class="brush: sql; light: true;">
SELECT *
FROM `SquidAccessLog-20061006`
WHERE `ResultCode` LIKE '%MISS'
   AND `TimeStamp` &gt;= '2006-10-06 12:80:00'
   AND `TimeStamp` &lt;= '2006-10-06 15:00:00'
GROUP BY `URL` ;
</pre>
<hr />Actually, given my experience here it&#8217;s probably the case that I should rewrite</p>
<pre class="brush: sql; light: true;">
SELECT ll.URL, ll.Status, ll.FetchTime, ll.ModifiedTime, ll.RetriesSinceFetch,
ll.FetchInterval, ll.Score, ll.Signature, ll.Metadata, ll.CrawlID, ll.Host,
ll.ReversedHost, ll.uto, ll.prk, ll.pas, ll.pts
FROM `crawl-20060820` AS ll
LEFT JOIN `sun-crawl-20060928172854` as rr ON rr.URL = ll.URL
WHERE (
ll.ReversedHost LIKE 'com.sun.%'
OR ll.ReversedHost LIKE 'net.java.%'
)
AND rr.URL IS NULL
AND ll.Status = 'DB_fetched'
GROUP BY ll.URL;
</pre>
<p>as</p>
<pre class="brush: sql; light: true;">
CREATE TEMPORARY TABLE cs
SELECT *
FROM `crawl-20060820`
WHERE ReversedHost LIKE 'com.sun.%'
AND `Status` = 'DB_fetched' ;

CREATE TEMPORARY TABLE cj
SELECT *
FROM `crawl-20060820`
WHERE ReversedHost LIKE 'net.java.%'
AND `Status` = 'DB_fetched';

CREATE TEMPORARY TABLE ss
SELECT *
FROM `sun-crawl-20060928172854`
WHERE ReversedHost LIKE 'com.sun.%'
AND `Status` = 'DB_fetched' ;

CREATE TEMPORARY TABLE sj
SELECT *
FROM `sun-crawl-20060928172854`
WHERE ReversedHost LIKE 'net.java.%'
AND `Status` = 'DB_fetched' ;

CREATE TABLE `not-in-supplemental-20060820` ( SELECT ll.URL, ll.Status, ll.FetchTime, ll.ModifiedTime, ll.RetriesSinceFetch,
ll.FetchInterval, ll.Score, ll.Signature, ll.Metadata, ll.CrawlID, ll.Host,
ll.ReversedHost, ll.uto, ll.prk, ll.pas, ll.pts
FROM ss as ll
LEFT JOIN cs as rr ON rr.URL = ll.URL
WHERE rr.URL IS NULL
) UNION (
SELECT ll.URL, ll.Status, ll.FetchTime, ll.ModifiedTime, ll.RetriesSinceFetch,
ll.FetchInterval, ll.Score, ll.Signature, ll.Metadata, ll.CrawlID, ll.Host,
ll.ReversedHost, ll.uto, ll.prk, ll.pas, ll.pts
FROM sj as ll
LEFT JOIN cj as rr ON rr.URL = ll.URL
WHERE rr.URL IS NULL ) ;
</pre>
<p align="left"><a target="_blank" class="tt" href="http://twitter.com/home/?status=Bear+Skins+and+Flint+Knives+http://5akq9.th8.us" title="Post to Twitter"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-twitter-big4.png" alt="Post to Twitter" /></a> <a target="_blank" class="tt" href="http://delicious.com/post?url=http://www.barryspov.com/?p=67&amp;title=Bear+Skins+and+Flint+Knives" title="Post to Delicious"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-delicious-big4.png" alt="Post to Delicious" /></a> <a target="_blank" class="tt" href="http://digg.com/submit?url=http://www.barryspov.com/?p=67&amp;title=Bear+Skins+and+Flint+Knives" title="Post to Digg"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-digg-big4.png" alt="Post to Digg" /></a> <a target="_blank" class="tt" href="http://www.facebook.com/share.php?u=http://www.barryspov.com/?p=67&amp;t=Bear+Skins+and+Flint+Knives" title="Post to Facebook"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-facebook-big4.png" alt="Post to Facebook" /></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.barryspov.com/?feed=rss2&amp;p=67</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ISO Mounter, ISO Creator, ISO Recorder</title>
		<link>http://www.barryspov.com/?p=20</link>
		<comments>http://www.barryspov.com/?p=20#comments</comments>
		<pubDate>Thu, 29 May 2008 13:06:12 +0000</pubDate>
		<dc:creator>Barry A Dobyns</dc:creator>
				<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">08/05/29/0772950</guid>
		<description><![CDATA[pointers to my favorite ISO/DVD/CD manipulation tools on winderz]]></description>
			<content:encoded><![CDATA[<p>I recently found the <a href="http://download.microsoft.com/download/7/b/6/7b6abd84-7841-4978-96f5-bd58df02efa2/winxpvirtualcdcontrolpanel_21.exe">Microsoft Virtual CD Control Panel</a> which is an iso-mounter for Windows. This has worked well for me on both W2k and WinXP on several different computers</p>
<p>Like the <a href="http://blog.godshell.com/blog/index.php?url=archives/26-Windows-XP-ISO-Mount-Utility.html#feedback">other commenters</a>, there&#8217;s some ISOs I have found that won&#8217;t mount with this utility, but do mount with (for instance) Alcohol120%.  </p>
<p>On the other hand, this is free, Alcohol120% is not, and the Alcohol guys have recently made their licensing much more onerous (the license breaks on my Sony notebook every time I go to use it. grrr).</p>
<p>On the third hand, this is a great free companion to the free  <a href="http://www.lucersoft.com/freeware.php">ISOCreator</a>  and  free <a href="http://isorecorder.alexfeinman.com/isorecorder.htm">ISORecorder</a></p>
<p>Who doesn&#8217;t like free? </p>
<p>On the topic of free, and making things that are not-free into free, we like <a href="http://www.google.com/search?q=dvd.decrypter">DVD Decrypter</a> for making an CSS-free ISO from your protected DVD.</p>
<p>Then you turn that unprotected DVD into a movie you can copy to your <a href="http://appple.com/ipod">iPod</a> with <a href="http://handbrake.fr/">handbrake</a>.</p>
<p align="left"><a target="_blank" class="tt" href="http://twitter.com/home/?status=ISO+Mounter%2C+ISO+Creator%2C+ISO+Recorder+http://59zbx.th8.us" title="Post to Twitter"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-twitter-big4.png" alt="Post to Twitter" /></a> <a target="_blank" class="tt" href="http://delicious.com/post?url=http://www.barryspov.com/?p=20&amp;title=ISO+Mounter%2C+ISO+Creator%2C+ISO+Recorder" title="Post to Delicious"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-delicious-big4.png" alt="Post to Delicious" /></a> <a target="_blank" class="tt" href="http://digg.com/submit?url=http://www.barryspov.com/?p=20&amp;title=ISO+Mounter%2C+ISO+Creator%2C+ISO+Recorder" title="Post to Digg"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-digg-big4.png" alt="Post to Digg" /></a> <a target="_blank" class="tt" href="http://www.facebook.com/share.php?u=http://www.barryspov.com/?p=20&amp;t=ISO+Mounter%2C+ISO+Creator%2C+ISO+Recorder" title="Post to Facebook"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-facebook-big4.png" alt="Post to Facebook" /></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.barryspov.com/?feed=rss2&amp;p=20</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Instant Notification Mashup</title>
		<link>http://www.barryspov.com/?p=19</link>
		<comments>http://www.barryspov.com/?p=19#comments</comments>
		<pubDate>Thu, 15 Jun 2006 16:14:06 +0000</pubDate>
		<dc:creator>Barry A Dobyns</dc:creator>
				<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">06/06/15/2446638</guid>
		<description><![CDATA[ I was tired of hitting the refresh button in my browser to see if the story I knew was going to break]]></description>
			<content:encoded><![CDATA[<p>I was tired of hitting the refresh button in my browser to see if the story I knew was going to break soon had made it to the slashdot front page.  Oh! the agony!</p>
<p>Grant and I figured out a way this morning to do a trivial technology mashup that gives us instant alerts when something happens.  </p>
<ol>
<li> create a gmail account http://gmail.google.com (not your usual one, but one just for this purpose)
<li> forward the gmail account to your cell phone by using http://teleflip.com which turns yur-phn-nmbr@teleflip.com into an sms message to your phone
<li> create a google alert http://www.google.com/alerts with your search string, and set it to &#8220;as it happens&#8221;
</ol>
<p>wait for the love!</p>
<p align="left"><a target="_blank" class="tt" href="http://twitter.com/home/?status=Instant+Notification+Mashup+http://9eqw5.th8.us" title="Post to Twitter"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-twitter-big4.png" alt="Post to Twitter" /></a> <a target="_blank" class="tt" href="http://delicious.com/post?url=http://www.barryspov.com/?p=19&amp;title=Instant+Notification+Mashup" title="Post to Delicious"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-delicious-big4.png" alt="Post to Delicious" /></a> <a target="_blank" class="tt" href="http://digg.com/submit?url=http://www.barryspov.com/?p=19&amp;title=Instant+Notification+Mashup" title="Post to Digg"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-digg-big4.png" alt="Post to Digg" /></a> <a target="_blank" class="tt" href="http://www.facebook.com/share.php?u=http://www.barryspov.com/?p=19&amp;t=Instant+Notification+Mashup" title="Post to Facebook"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-facebook-big4.png" alt="Post to Facebook" /></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.barryspov.com/?feed=rss2&amp;p=19</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>dircaster.php mods</title>
		<link>http://www.barryspov.com/?p=75</link>
		<comments>http://www.barryspov.com/?p=75#comments</comments>
		<pubDate>Tue, 19 Apr 2005 16:28:41 +0000</pubDate>
		<dc:creator>Barry A Dobyns</dc:creator>
				<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://www.barryspov.com/?p=75</guid>
		<description><![CDATA[I&#8217;ve made a few minor changes to both Ryan King&#8217;s original dircaster.php and Chris Curtis&#8217;s index.php as well.  But due to the anti-html-code filter of the website  of the original author Ryan King, and his refusal to put an email address anywhere on his website, the best I can do is post the [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve made a few minor changes to both <a href="http://www.shadydentist.com/wordpress/software/dircaster">Ryan King&#8217;s original dircaster.php</a> and <a href="http://www.curtisfamily.org.uk/index.php/2005/01/13/77">Chris Curtis&#8217;s index.php</a> as well.  But due to the anti-html-code filter of the <a href="http://www.shadydentist.com">website </a> of the original author Ryan King, and his refusal to put an email address anywhere on his website, the best I can do is post the changes here.</p>
<p>The key change I&#8217;ve made is to also recognize a &#8220;file.txt&#8221; which has the same base name as the &#8220;file.mp3&#8243; and use that file as the contents of the description tag for the RSS feed, and to include the contents of the file inside the table for the index.php.    This implements shownotes if you will.</p>
<p>I&#8217;ve also implemented separately (outside of dircaster.php and index.php) some code that posts the shownotes from this text file and the mp3 tags into my companion weblog when I post a file, but this (1) does not use the <a href="http://www.xmlrpc.com/metaWeblogApi">metaweblog API</a> and (2) is therefore highly specific to my blog software.  Oh well.</p>
<p>The bulk of the action in dircaster.php is simply (instead of just the target of the else):</p>
<pre class="brush: php; light: true;">
$notesfile = str_replace(&quot;mp3&quot;,&quot;txt&quot;,$filename);
if (file_exists($notesfile)) {
echo (&quot;&lt;description&gt;&quot;);
$notes = &quot;&quot;;
$lines = file($notesfile);
foreach ($lines as $line_num =&gt; $line) $notes .= $line;
echo strip_tags($notes);
echo (&quot;&lt;/description&gt;&quot;);
}
else echo (&quot;&lt;description&gt;$mp3file-&gt;title - $mp3file-&gt;album -
$mp3file-&gt;artist&lt;/description&gt;\n&quot;);
</pre>
<p>and the bulk of the action in index.php is similar (replacing the obvious line with this):</p>
<pre class="brush: php; light: true;">
echo (&quot;&lt;td&gt;&lt;a href=&quot;.$rootMP3URL.&quot;/&quot;.
htmlentities(str_replace(&quot; &quot;, &quot;%20&quot;, $filename)) .
&quot;&gt;&quot;.$rootMP3URL.&quot;/&quot;.
htmlentities(str_replace(&quot; &quot;, &quot;%20&quot;, $filename)).&quot;&lt;/a&gt;&quot;);
$notesfile = str_replace(&quot;mp3&quot;,&quot;txt&quot;,$filename);
if (file_exists($notesfile)) {
echo (&quot; &lt;br&gt;\n&quot;);
readfile($notesfile);
}
echo (&quot;&lt;/td&gt;\n&quot;);
</pre>
<p>I also abstracted the common variables into a config.php and share it between the two. You can fetch my version of this here <a href="/podcast/dircaster_0_5.zip">dircaster_0_5.zip</a>.</p>
<p align="left"><a target="_blank" class="tt" href="http://twitter.com/home/?status=dircaster.php+mods+http://nywg8.th8.us" title="Post to Twitter"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-twitter-big4.png" alt="Post to Twitter" /></a> <a target="_blank" class="tt" href="http://delicious.com/post?url=http://www.barryspov.com/?p=75&amp;title=dircaster.php+mods" title="Post to Delicious"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-delicious-big4.png" alt="Post to Delicious" /></a> <a target="_blank" class="tt" href="http://digg.com/submit?url=http://www.barryspov.com/?p=75&amp;title=dircaster.php+mods" title="Post to Digg"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-digg-big4.png" alt="Post to Digg" /></a> <a target="_blank" class="tt" href="http://www.facebook.com/share.php?u=http://www.barryspov.com/?p=75&amp;t=dircaster.php+mods" title="Post to Facebook"><img class="nothumb" src="http://www.barryspov.com/wp-content/plugins/tweet-this/icons/tt-facebook-big4.png" alt="Post to Facebook" /></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.barryspov.com/?feed=rss2&amp;p=75</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
