Data Sorting

Sorting: the Algorithms are good; knowing When to sort is the key question

Many years ago Bruce at the carpet mill on Dyer Road in Santa Ana controlled the data processing at his firm. He had many kinds of tapes drives, and I used to visit him with tapes, so that he could convert my tapes to some other density. Then I would drive to my customer site and deliver the tape. (We had no internet and ftp was too primitive.) In return for his service, I gave him free consulting. I gained a service that cost him little more than a few tape mounts; he gained knowledge which he treasured.

On that particular day, he was concerned about his billing program. The program for each customer sorted many of lines of data. Each line represented a spool of yarn for the carpet on the order. (Later we consolidated the lines.) As the carpet makers put spools into the weaver, the makers entered the code for the spool into the computer. For small orders, his program whizzed thru the process. For big orders, like several thousand spools, the program took hours. We looked at his sort routine; it was a good bubble sort.

A bubble sort makes passes thru the data and finds a maximum item on each pass. The sort might also put adjacent pairs in order. The formula for time varies with the square of the number of items, like t = k * N * N, Typically, k is some small number that represents the average time per data item. The value might be like one thousandth of a second. Thus, when N = 100, the time would be 100 * 100 / 1000, which yields about ten seconds. A thousand spools took twenty minutes and ten thousand would take 30 hours.

In contrast, the quick sort has a time estimate, like t = K * N * log N using base 2. So, for N = 100, log(100) > 6 and the number of work units is about 600 K, where this K involves a more complicate algorithm than the bubble sort, Usually 600 K > 10,000 k. However, with N equal ten thousand, the time becomes K * 10000 * log(10000), which equals about 140,000 K, or 24 minutes.

<<chart goes here>>

Deltas and Differences

Current sort routine automatically optimize. If the array is small, the sort routine might choose a simple bubble sort. Somewhere upwards of 200 records, the routine probably chooses a quick sort. Beyond a 100,000 keys, the routine might choose any of several exotic sorts. The point is that sort routines generally are efficient.

When loading an array in a screen, the sort occurs during the SQL that loads the array. The little sorts usually present no load on the system. Nevertheless, a crafty programmer tends to store data in human presentation order. A screen read might avoid any sort at all.

The big challenge comes in the massive amounts of data being sorted and in the frequency of the sorts. One analyst wanted to do a ranking with each record showing is percent of the grand total. The analyst used a sort and report to grab the grand total and then used a second sort and report for the actual report. Some deep analysis revealed that the load of the sort could total a column, so the better implementation grabbed the total during the load and thus only sorted one time.

Similarly, when comparing to vary large files, using the delta files might be faster, especially for on-going comparisons that occur daily. Generate the first ....