We have a list of vendors that is 100,000 records long. We just need the vendor names so I was able to eliminate the duplicates (due to multiple vendor addresses) so now it is down to 50,000 records.

But, there is still duplication. For example. If there is a vendor name called "John's Business Inc.", there is a vendor number for that. But then, there is also a "Johns Business Inc" record where the apostrophe and period after "inc" is not in the name. By creating a new name with these differences, a new vendor number was created. But again, all we need are the names.

So, at this point, I am thinking the only way to eliminate these duplicates is for a human to go through the list. But the list is 50,000 records!

I just need to ask if there is any other way to find similar records somehow and eliminate them.

The end use for this list will be that we do searches of all of the vendor names but we were trying to quantify just how many vendor names we need to search and since there are duplicates, the list may be a lot shorter than 50,000 records which will reduce the amount of time we will be doing with the searches.

Any ideas?