Friday, April 25, 2008

Improve the performance of NOT IN using MySQL

On occasions you have to do something that is bad for MySQL performance... very bad indeed for any data processing performance:

Ask a system to tell you everything that is in one list, that is (or isn't) also in another list.

Most of this is based on the usual rules of searching and sorting. From previous work I have a few links on searching and sorting amongst my links on algorithms. This is a nice application of theory.

The problem

I have had on a number of occasions issues with queries on databases that have not been normalized (I didn't make the databases). The difficulty is if you want to produce a list of names and email addresses (for whatever purpose) from a MySQL database with numerous tables containing email addresses complete with duplications with simple rules like:

  1. I want all emails from "MainTable"where the "SendEmail" field is set to true.
  2. I dont want any of the emails that appear in the "NoList" table.
  3. I dont want any emails that appear in the MixList table where the "WhatHappened" field is set to 'ThisAction' or 'ThatAction'.
  4. ... I could go on but from this you can see its a database that isn't normalised and contains many email addresses and I want to send emails to a particular subset.

So our basic query is like this:

SELECT UserName, EmailAdd FROM MainTable WHERE SendEmail = '1'
AND EmailAdd NOT IN (
SELECT EmailAdd FROM MixList WHERE WhatHappened IN ('ThisAction','ThatAction')
) AS NoListOfEmails
ORDER BY UserName;

As you can see the query isn't particularly complex however if the number of emails in the system is large this will take some time... For example some hours!

How does one improve the performance of this? For a start I realized that the "NoListOfEmails" was larger in many cases than the eventual list. So that is a LOT of comparing. Going over 400,000 email addresses and asking if each one is in a list of say 300,000 is going to be a painful task.

The solution

Here is the clever part:

To improve performance by making the comparisons faster (You will be surprised how much faster) the second (In this case the "no") list is placed in a separate table ordered by email address and indexed on email. So if the email address exists in the new table it will be found instantly.

Simple eh? once you know what you are doing piece of cake.

In short:

  1. Make a new table (in memory) that is ordered and indexed (for super fast access and searching).
  2. Fill it with the "No" list of emails.
  3. Run the query to produce the list of compared emails.
  4. Destroy the temprary table to save the memory it is using.

Here is how I did it:

  1. First given that I had a large number of emails to compare I needed to increase the maximum size of the virtual table (be careful if your system does not have LOTS of memory). The default is "16777216" about 16M. I figured what the hell I'll make it 500M "524288000". For this (as super user) :
    SET GLOBAL max_heap_table_size=524288000;
  2. Next onto making the email table (with email as a Btree indexed primary key):
    CREATE TABLE mails
    email CHAR(200) NOT NULL,
  3. Populate the "no" list:
    INSERT into mails (email)
    SELECT email FROM(
    SELECT EmailAdd FROM NoList
    SELECT EmailAdd FROM MixList WHERE WhatHappened IN ('ThisAction','ThatAction')
    ) AS EmailList
  4. Now to do the comparison:
    SELECT UserName, EmailAdd FROM MainTable WHERE SendEmail = '1'
    AND NOT IN(SELECT email FROM mails) GROUP BY EmailAdd
    ORDER BY UserName;
  5. Last part is to reclaim memory by removing the temporary table:
    DROP TABLE mails;
  6. Then I decided in my case to reduce the max memory table size back down:
    SET GLOBAL max_heap_table_size=16777216;


You can see the limit of memory table with this command:
show variables like 'max_heap_table_size';


Post a Comment

<< Home