I am looking for a script/command to identify duplicate records by
certain columns in a given file and write them to an other file.
I would use the unix sort command and use the -u option to eliminate
duplicates.
Can you specify columns in sort -u ? Could u please let me know the
syntax for the following example
123456|55555|678
234567|55555|654
123456|66666|745
345678|77777|654
So when if i want to find out duplicates in column one and want the
following results in the output file listing the duplicates.
123456|55555|678
123456|66666|745
That would eliminate duplicates based on entire line. I think he wants
to output all duplicate lines, with the duplication based not on the entire
line, but only on certain columns.
Is that correct? How many and which columns, and how are the columns
delimited, etc.
They are delimited by pipe charcters. The above example suits my requirements.
To find out duplicates, let's go with field 1. But here is what I plan
to do seperately
1. Check duplicates on field 1
2. Check duplicates on field 3
3. Check duplicates for combined field for feild1 and field3
Lot of ways to do this. My solution involves sorting by desired key(s), then feed the sorted file into awk. awk will hold each line by one cycle so that it will know if it is dealing with single or multiple lines (per key), and print the group or not. First solution below finds duplicates based on field 3. Change all five 3's in that script to 1's to find duplicates on field 1:
#!/bin/sh
sort -t\| -k3,3 in.txt |
awk -F\| 'BEGIN {\
getline
holdline=$0
holdkey=$3
k=1}
function flushhold() {\
if (k>1)
????print holdline
#else
# ???print "UNIQUE: " holdline
}
{
if ($3==holdkey)
??{k++
???flushhold()}
else
# closeout current key, start new key
??{flushhold()
???holdkey=$3
???k=1}
holdline=$0
}
END {flushhold()}' > duplicates.txt
Use same logic to check for duplicates based upon
multiple fields. ?Change 4 lines in the script above:
sort -t\| -k3,3 -k1,1 in.txt |
holdkey=$3 $1
if ($3 $1==holdkey)
???holdkey=$3 $1
For awk, expressions separated by space means concatenation.
You will notice in the flushhold function that this script could just as easily output only the unique lines instead of the duplicates, or both. ?And if outputting both, you could output unique lines to one file and duplicate lines to another.
Quick Links:
Do you have
a UNIX Question?
Unix Home: Unix System Administration
Hints and Tips