-
Interesting challenge
We're having to work with some legacy data. The tables in the so-called database seem to have way more nulls than actual data. One table appears to have around 100 or more columns in it. It has close to 40,000 rows.
This Db has pretty much 0 normalization present.
IOW, your worst nightmare.
Is there a way we could run a query that would return the total number cells inthat contain NULL and another that could return the total number of data-bearing cells so we could come up with a % or a ratio.
-
When you say legacy data, how does the format of that look like?
you can import that to a SQL database and run all the queries. it has only 40,000 rows. it wont take that long to import if the source file is in a standard consistent format.
-
If you get the data into mssql server, then you can use the combination of sysobjects and syscolumns to write a dynamic query to find out how many nulls there are. Its not a trivial thing but its not too hard either.
-
Simple solution
Hi Chaz. Over the years I've converted many databases to SQL Server. Often it's better to think simpler. I use Access as my primary data conversion tool for tables of less than say 500,000 rows.
It is easier and faster to create queries and software to manipulate and correct data. You can transfer legacy data to new normalised tables, then upload the tables to SQL Server - or just use link tables. The only thing you have to watch is the use of primary keys when matching old to new (identity) keys.
Anyway, your problem will take about 10 minutes to solve in Access. Simply import/link the data, create an access query, select each column into the analyser, then use the "Totals/Count" option for each column.
I just tested it for 60+ columns and it works fine (about 3 minutes for me to dummy up and test the solution - took longer to register and submit this comment.)
You take the results, export to excel (which is of course even simpler for stats analysis), then divide the total by (number of rows * the number of columns) - or give a total against each column, etc.
Probably no more 10 minutes to get the result.
Main point is that SQL Server is not the only tool available to assist in data conversions.
Cheers & Good luck,
David Maxwell
Last edited by davidms; 07-12-2005 at 10:18 PM.
-
Does so-called db have so-called query? If so, you should be able to analyze data in where it is.
-
Thanx for all the responses.
The Db in question is on SQL Server 2000. It has to be the worst, gnarliest, most cocked-up DB I've ever seen. I'm amazed that the goofy thing works at all. It would be funny were it not for the the importance and sensitivity of the data it contains.
One table I mentioned has around 38k rows and around 100 columns. The ratio of NULL fields to fields that actually contain data run over 10:1 (a formula for which is why I originally posted). This DB is full of similar tables
We'd like to make a case for blocking sufficient time and resources for a complete, ground-up rebuild of the Db to properly normalize it, and hopefully improve the performance while reducing the system resources currently required to support it.
-
--This will generate a list of sql statements. When those statements gets executed look for results with Zero. THose are the columns which has no data.
select 'Select "'+name+'",count(*) as rows from authors where '+name+' is not null' from syscolumns where id=object_id('authors')
-
please change the 'authors' to your table name.
-
expanding Mak's script you could do
select 'Select "'+name+'",count(*) as total,count('+name+') as nonNull,count(*)-count('+name+') as Nulls from authors' from syscolumns where id=object_id('authors')
G
-
... or expanding (too) much more to do it just in one opperation:
Code:
declare @ColumnName varchar(100), @string nvarchar(4000), @TableName sysname
create table #xxx (ColumnName varchar(100), total int, nonNull int, Nulls int, Ratio decimal(5,2))
set @TableName = 'publishers'
declare ccc cursor for
select name from syscolumns where id=object_id(@TableName)
open ccc
fetch next from ccc into @ColumnName
while @@fetch_status = 0
begin
set @string = 'insert into #xxx
Select ''' + @ColumnName + ''' as ColumnName,
count(*) as total,
count(' + @ColumnName + ') as nonNull,
count(*)-count(' + @ColumnName + ') as Nulls,
cast(count(' + @ColumnName + ') as decimal(5,2)) /
cast(count(*) as decimal(5,2))
as Ratio
from ' + @TableName
exec sp_executesql @string
fetch next from ccc into @ColumnName
end
close ccc
deallocate ccc
select CASE WHEN (GROUPING(ColumnName) = 1) THEN 'TOTAL.............'
ELSE ColumnName
END AS ColumnName,
sum(total) as total,
sum(nonNull)as nonNull,
sum(Nulls) as Nulls,
cast(avg(Ratio) as decimal(5,2)) as Ratio
from #xxx
group by ColumnName
WITH ROLLUP
drop table #xxx
The results will be:
Code:
ColumnName total nonNull Nulls Ratio
city 8 8 0 1.00
country 8 8 0 1.00
pub_id 8 8 0 1.00
pub_name 8 8 0 1.00
state 8 6 2 .75
TOTAL............. 40 38 2 .95
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
|