Identify Duplicate Records in UNIX

Identify Duplicate Records in UNIX

I am looking for a script/command to identify duplicate records by certain columns in a given file and write them to an other file.

I would use the unix sort command and use the -u option to eliminate duplicates.

Can you specify columns in sort -u ? Could u please let me know the syntax for the following example
123456|55555|678
234567|55555|654
123456|66666|745
345678|77777|654

So when if i want to find out duplicates in column one and want the following results in the output file listing the duplicates.
123456|55555|678
123456|66666|745

That would eliminate duplicates based on entire line. I think he wants to output all duplicate lines, with the duplication based not on the entire line, but only on certain columns.
Is that correct? How many and which columns, and how are the columns delimited, etc.

They are delimited by pipe charcters. The above example suits my requirements.
To find out duplicates, let's go with field 1. But here is what I plan to do seperately

1. Check duplicates on field 1
2. Check duplicates on field 3
3. Check duplicates for combined field for feild1 and field3

Lot of ways to do this. My solution involves sorting by desired key(s), then feed the sorted file into awk. awk will hold each line by one cycle so that it will know if it is dealing with single or multiple lines (per key), and print the group or not. First solution below finds duplicates based on field 3. Change all five 3's in that script to 1's to find duplicates on field 1:

#!/bin/sh

sort -t\| -k3,3 in.txt |
awk -F\| 'BEGIN {\
getline
holdline=$0
holdkey=$3
k=1}

function flushhold() {\
if (k>1)
????print holdline
#else
# ???print "UNIQUE: " holdline
}

{
if ($3==holdkey)
??{k++
???flushhold()}
else
# closeout current key, start new key
??{flushhold()
???holdkey=$3
???k=1}
holdline=$0
}
END {flushhold()}' > duplicates.txt

Use same logic to check for duplicates based upon
multiple fields. Change 4 lines in the script above:

sort -t\| -k3,3 -k1,1 in.txt |

holdkey=$3 $1

if ($3 $1==holdkey)

???holdkey=$3 $1

For awk, expressions separated by space means concatenation.

You will notice in the flushhold function that this script could just as easily output only the unique lines instead of the duplicates, or both. And if outputting both, you could output unique lines to one file and duplicate lines to another.

Have a Unix Problem
Do you have a UNIX Question?

Books About Unix
UNIX Programming, Certification, System Administration, Performance Tuning Reference Books

Unix Home: Unix System Administration Hints and Tips

(c) www.gotothings.com All material on this site is Copyright.
Every effort is made to ensure the content integrity. Information used on this site is at your own risk.
All product names are trademarks of their respective companies. The site www.gotothings.com is in no way affiliated with SAP AG.
Any unauthorised copying or mirroring is prohibited.