UNIX Programming, Certification, System Administration, Performance Tuning Reference Books
Format a unix file on Teradata

I have a requirement to format a file and am noticing a bug and inefficency. Can somebody give some suggestions?

Input Data ( i mean columns header and rows)
-------------

current_licl_nbr| policy_id | plociyhold_id|mail_allowed_id|email_address_txt
--------------------
701000002990.| 200000000175.| 200000000175.| 2|xyz@abcd.ATT.NET
 

output data should look like

701000002990|200000000175|200000000175|2|XYZ@ABCD.COM

The Current commands we are using has a bug and is generating the output like the below and is very ineffiecent.
There is no dot between ABCD and COM

Current output generated
------------------------
701000002990|200000000175|200000000175|2|XYZ@ABCDCOM

The code we have in the script is

csplit -ks -f ${DWH_OUT}/other/a1prefix ${DWH_OUT}/other/a1_xxxx.tmp 3
cat ${DWH_OUT}/other/a1prefix01|sed -e 's/ //g' -e 's/\.//g' >${DWH_OUT}/other/a1_xxxx.tmp\

The amount of data that is being formatted would be around 6,000,000.
Does anybody have a suggestion to fix the bug in a efficent manner?
 

Have you thought about using Perl?
I noticed that your desired output has 2 additional changes made that your code doesn't show.
1) changed the line to uppercase
2) changed @abcd.ATT.NET to @ABCD.COM

Here's 2 variations of 1 Perl solution (with Perl "there is always more than 1 way to to anything"). These will not make the changes I noted above, but could easily be added.

from command line:

perl -pe "s/\.\|\s*/|/g" input.txt > output.txt

or

perl -pi -e "s/\.\|\s*/|/g" input.txt

The second one does an inline edit of the original file.

I ran a benchmark test on a 6,000,000 line file and it took between 120 to 130 seconds to complete on a slow Windows PII 550 machine.

----------------------------------

For your reference, here are the complete scripts that I used to test/benchmark.script to create the source file:
#!/usr/bin/perl -w

open OUT, ">braveking.txt" or die $!;

for (1..6000000) {
print OUT "701000002990.| 200000000175.| 200000000175.| 2|xyz\@abcd.ATT.NET\n";
}
 

benchmark script:
#!/usr/bin/perl -w

use Time::HiRes 'time';

for $i (1..50) {
open IN, "<braveking.txt" or die $!;
open OUT, ">reformat.txt" or die $!;
$start = time;
while (<IN>) {
s/\.\|\s*/|/g;
print OUT;
}
$delta = time - $start;
printf "Loop $i took %.2f seconds\n", $delta;
}

Quick Links:
Do you have a UNIX Question?

Unix Home: Unix System Administration Hints and Tips