I Found: ElasticSearch De-duplicator Script

So I've been dabbling in ElasticSearch. Very cool software. More stuff to come.

I did run across an issue where I accidentally imported a bunch of duplicates. Since it was a re-import, these duplicates had different ES id's. No rhyme or reason to the duplicates order so there was no easy way to remove them. I thought surely that ES had a built in way to remove them. Maybe based on a field. Every document in my ES index has a unique guid. So essentially I needed to delete duplicates based on a guid in a field. After a lot of researching and cursing the internet godz, I made a little shell script that did it for me. Is it fast? Absolutely...not. Very slow. But I made it so it can just run in the background for a given days' index from logstash. My importing produces the occasional duplicate so this has helped me manage it while I track down that issue.

So the main requirement is that there needs to be a field in your document you can search by and is unique to only that document. Like if you fed it a field that has email addresses, this script will delete all documents with that matching field except 1.

It's quick and dirty but it works.

#! /bin/bash
# elasticsearch de-duplicator

# Get todays date for the Logstash index name
index=$(date +%Y.%m.%d)

# Or specify a date manually
if [[ -n $1 ]]
then
index=$1
fi

# Loop through deleting duplicates
while true
do
   # This lists 50 unique serials by count. The serial.raw text is the field it's looking in.
   curl -silent -XGET 'http://10.60.0.82:9200/logstash-'$index'/_search' -d '{"facets": {"terms": {"terms": {"field": "serial.raw","size": 50,"order": "count","exclude": []}}}}'|grep -o "\"term\":\"[a-f0-9-]\+\",\"count\":$[^1]\|[1-9][0-9]$"|sed 's/\"term\":\"$[a-f0-9-]\+$\",\"count\":$[0-9]\+$/\1/' > /tmp/permadedupe.serials
   # If no serials over 1 count were found, sleep
   serialdupes=$(wc -l /tmp/permadedupe.serials|awk '{print $1}')
   if [[ $serialdupes -eq 0 ]]
   then
       echo "`date`__________No duplicate serials found in $index" >> /home/logstash/dedupe.log
       sleep 300
       # The index has to be respecified in case it rolls in to another day
       index=$(date +%Y.%m.%d)
       if [[ -n $1 ]]
       then
           index=$1
       fi

   else
       # For serials greater than 1 count, delete duplicates, but leave 1
       for serial in `cat /tmp/permadedupe.serials`
       do
           # Get the id's of all the duplicated serials
           curl -silent -XGET 'http://10.60.0.82:9200/logstash-'$index'/_search?q=serial.raw:'$serial''|grep -o "\",\"_id\":\"[A-Za-z0-9_-]\+\""|awk -F\" '{print $5}' > /tmp/permadedupelist
           # Delete the top line since you want to keep 1 copy
           sed -i '1d' /tmp/permadedupelist
           # The curl command doesn't like a hyphen unless it is escaped
           sed -i 's/^-/\\-/' /tmp/permadedupelist
           # If duplicates do exist delete them
           for line in `cat /tmp/permadedupelist`
           do
               curl -silent -XDELETE 'http://10.60.0.82:9200/logstash-'$index'/_query?q=_id:'$line'' &> /dev/null
               echo "`date`__________ID:$line serial:$serial index:$index" >> /home/logstash/dedupe.log
           done
       done
   fi
done

I Found

Pages

Friday, July 11, 2014

ElasticSearch De-duplicator Script

No comments:

Post a Comment