I have a requirement to format a file and am noticing a bug and inefficency. Can somebody give some suggestions?
Input Data ( i mean columns header and rows)
-------------
current_licl_nbr| policy_id | plociyhold_id|mail_allowed_id|email_address_txt
--------------------
701000002990.| 200000000175.| 200000000175.| 2|xyz@abcd.ATT.NET
output data should look like
701000002990|200000000175|200000000175|2|XYZ@ABCD.COM
The Current commands we are using has a bug and is generating the
output like the below and is very ineffiecent.
There is no dot between ABCD and COM
Current output generated
------------------------
701000002990|200000000175|200000000175|2|XYZ@ABCDCOM
The code we have in the script is
csplit -ks -f ${DWH_OUT}/other/a1prefix ${DWH_OUT}/other/a1_xxxx.tmp
3
cat ${DWH_OUT}/other/a1prefix01|sed -e 's/ //g' -e 's/\.//g' >${DWH_OUT}/other/a1_xxxx.tmp\
The amount of data that is being formatted would be around 6,000,000.
Does anybody have a suggestion to fix the bug in a efficent manner?
Have you thought about using Perl?
I noticed that your desired output has 2 additional changes made that
your code doesn't show.
1) changed the line to uppercase
2) changed @abcd.ATT.NET to @ABCD.COM
Here's 2 variations of 1 Perl solution (with Perl "there is always more than 1 way to to anything"). These will not make the changes I noted above, but could easily be added.
from command line:
perl -pe "s/\.\|\s*/|/g" input.txt > output.txt
or
perl -pi -e "s/\.\|\s*/|/g" input.txt
The second one does an inline edit of the original file.
I ran a benchmark test on a 6,000,000 line file and it took between 120 to 130 seconds to complete on a slow Windows PII 550 machine.
----------------------------------
For your reference, here are the complete scripts that I used to test/benchmark.script
to create the source file:
#!/usr/bin/perl -w
open OUT, ">braveking.txt" or die $!;
for (1..6000000) {
print OUT "701000002990.| 200000000175.| 200000000175.| 2|xyz\@abcd.ATT.NET\n";
}
benchmark script:
#!/usr/bin/perl -w
use Time::HiRes 'time';
for $i (1..50) {
open IN, "<braveking.txt" or die $!;
open OUT, ">reformat.txt" or die $!;
$start = time;
while (<IN>) {
s/\.\|\s*/|/g;
print OUT;
}
$delta = time - $start;
printf "Loop $i took %.2f seconds\n", $delta;
}
Quick Links:
Do you have
a UNIX Question?
Unix Home: Unix System Administration
Hints and Tips