Wednesday, January 06, 2010

Weekly CD 2010

So here it is 2010 and my new resolution is "I want to acquire and listen to a new album each week of 2010". And it deserves its own blog (so I can track what I do): http://weeklycd.blogspot.com/

Friday, March 13, 2009

My column had it's integrity constraints violated by a value

This was the line that appeared after the error log entry on SSIS saying "unknown error":
The value violated the integrity constraints for the column.
For some reason it did tickle me. Simple things simple minds.

Thursday, September 04, 2008

How to restore a single table from a mysql dump of a whole database

The problem

I found that I had made a mistake with an SQL change and wanted to roll a table back to the state it was in before. I didn't make an SQL dump of the table before my change (how clever am I) so there I was with a table in a mess and wishing there was an "undo" button on it.

Having the nightly database dump (this table gets changed only once or twice a week) I figured "fantastic, no problem to restore". Realising that the database dump would restore the whole DB (removing today's orders etc...) I searched around and didn't see anything that provided a simple solution (I'm aware that some software has very full features bud didn't want to install anything that big) or one that was easy to do and I felt safe doing (there were scripts but I didn't want to spend too much time fiddling).

The solution

After some thought I went to the database backup and ran grep on it to see if I could extract the bits I wanted.

First I need to find in the 4G worth of dump where my table starts and finishes.

  • -n causes the line no. to be printed
  • the "." is "any character" this allows me to get round obscure characters like "`" being used to separate names.
  • the <tablename> is the name of the table I'm interested in.
grep -n "Table structure for table ." database.dmp grep -n "ALTER TABLE .<tablename>" database.dmp

Now I know where my table begins and ends in the code. I can subtract the start line no. from the end line no. and have the number of lines it takes:

  • -A xx says to print xx (in this case 27) lines after the line has been found. xx is the difference between the two line nos found previously.
  • We still search for the same start string.
  • ">" redirects the output to a text file
  • The name "out-$(date +%F).sql" produces a file with todays date eg: out-2008-09-04.sql
grep -A 27 "DROP TABLE IF EXISTS .<tablename>" database.dmp > out-$(date +%F).sql

Now we (should) have a dump of the table we want. I took it and checked it in a text editor before running the single table dump on the database server (backing up each step of the way and testing on a dev environment first).

Friday, April 25, 2008

Improve the performance of NOT IN using MySQL

On occasions you have to do something that is bad for MySQL performance... very bad indeed for any data processing performance:

Ask a system to tell you everything that is in one list, that is (or isn't) also in another list.

Most of this is based on the usual rules of searching and sorting. From previous work I have a few links on searching and sorting amongst my links on algorithms. This is a nice application of theory.

The problem

I have had on a number of occasions issues with queries on databases that have not been normalized (I didn't make the databases). The difficulty is if you want to produce a list of names and email addresses (for whatever purpose) from a MySQL database with numerous tables containing email addresses complete with duplications with simple rules like:

  1. I want all emails from "MainTable"where the "SendEmail" field is set to true.
  2. I dont want any of the emails that appear in the "NoList" table.
  3. I dont want any emails that appear in the MixList table where the "WhatHappened" field is set to 'ThisAction' or 'ThatAction'.
  4. ... I could go on but from this you can see its a database that isn't normalised and contains many email addresses and I want to send emails to a particular subset.

So our basic query is like this:

SELECT UserName, EmailAdd FROM MainTable WHERE SendEmail = '1'
AND EmailAdd NOT IN (
SELECT EmailAdd FROM NoList
UNION
SELECT EmailAdd FROM MixList WHERE WhatHappened IN ('ThisAction','ThatAction')
) AS NoListOfEmails
GROUP BY EmailAdd
ORDER BY UserName;

As you can see the query isn't particularly complex however if the number of emails in the system is large this will take some time... For example some hours!

How does one improve the performance of this? For a start I realized that the "NoListOfEmails" was larger in many cases than the eventual list. So that is a LOT of comparing. Going over 400,000 email addresses and asking if each one is in a list of say 300,000 is going to be a painful task.

The solution

Here is the clever part:

To improve performance by making the comparisons faster (You will be surprised how much faster) the second (In this case the "no") list is placed in a separate table ordered by email address and indexed on email. So if the email address exists in the new table it will be found instantly.

Simple eh? once you know what you are doing piece of cake.

In short:

  1. Make a new table (in memory) that is ordered and indexed (for super fast access and searching).
  2. Fill it with the "No" list of emails.
  3. Run the query to produce the list of compared emails.
  4. Destroy the temprary table to save the memory it is using.

Here is how I did it:

  1. First given that I had a large number of emails to compare I needed to increase the maximum size of the virtual table (be careful if your system does not have LOTS of memory). The default is "16777216" about 16M. I figured what the hell I'll make it 500M "524288000". For this (as super user) :
    SET GLOBAL max_heap_table_size=524288000;
  2. Next onto making the email table (with email as a Btree indexed primary key):
    CREATE TABLE mails
    (
    email CHAR(200) NOT NULL,
    PRIMARY KEY USING BTREE (email)
    )
    ENGINE = MEMORY;
  3. Populate the "no" list:
    INSERT into mails (email)
    SELECT email FROM(
    SELECT EmailAdd FROM NoList
    UNION
    SELECT EmailAdd FROM MixList WHERE WhatHappened IN ('ThisAction','ThatAction')
    ) AS EmailList
    GROUP BY EmailList.email
    ORDER BY EmailList.email;
  4. Now to do the comparison:
    SELECT UserName, EmailAdd FROM MainTable WHERE SendEmail = '1'
    AND medion.email NOT IN(SELECT email FROM mails) GROUP BY EmailAdd
    ORDER BY UserName;
  5. Last part is to reclaim memory by removing the temporary table:
    DROP TABLE mails;
  6. Then I decided in my case to reduce the max memory table size back down:
    SET GLOBAL max_heap_table_size=16777216;

Notes

You can see the limit of memory table with this command:
show variables like 'max_heap_table_size';

Wednesday, March 05, 2008

You are using a non-compliant browser. Use IE7 instead

Thursday, December 06, 2007

Fast broadband goes underground

OK so Today I saw this article on the BBC site. Detailing H20 networks which plans to:
H20 networks has been in negotiations with water firms for the last five years and began rolling out its fibre-via-sewers network - known as Focus (Fibre Optical Cable Underground Sewer) in 2003.
Now I read that, checked the date and thought "this idea sounds familiar" In fact I had read a similar article on a similar technology called Tisp from Google. Google having a reputation made many people think for a moment... till the coffee kicked in and they noticed the publication date of the Google press release on Tisp.

Monday, October 22, 2007

What makes JSP shit

JSP I'm aware has many good points... Just as a developer new to the language it has one overriding factor that makes it shit. The FM is FS!
  • There is no neat central documentation & reference (like the php quick reference).
  • One example of a common problem in web dev is uploading a file and storing it somewhere. If I were using PHP a quick google for "php upload file mysql" returned me this match "uploading files to mysql database". If i were using PHP this explains all I need to know and I'd be sorted in minuets. and now for JSP "jsp upload file mysql" and the results? Nothing as useful. Several web searches later and cobbling code together (painfully) and I'm still not quite there but will be in an hour.
  • Not to mention the time taken to build and upload .war files.