Friday, July 11, 2014

ElasticSearch De-duplicator Script

So I've been dabbling in ElasticSearch. Very cool software. More stuff to come.

I did run across an issue where I accidentally imported a bunch of duplicates. Since it was a re-import, these duplicates had different ES id's. No rhyme or reason to the duplicates order so there was no easy way to remove them. I thought surely that ES had a built in way to remove them. Maybe based on a field. Every document in my ES index has a unique guid. So essentially I needed to delete duplicates based on a guid in a field. After a lot of researching and cursing the internet godz, I made a little shell script that did it for me. Is it fast? Absolutely...not. Very slow. But I made it so it can just run in the background for a given days' index from logstash. My importing produces the occasional duplicate so this has helped me manage it while I track down that issue.

So the main requirement is that there needs to be a field in your document you can search by and is unique to only that document. Like if you fed it a field that has email addresses, this script will delete all documents with that matching field except 1.


It's quick and dirty but it works.


#! /bin/bash
# elasticsearch de-duplicator

# Get todays date for the Logstash index name
index=$(date +%Y.%m.%d)

# Or specify a date manually
if [[ -n $1 ]]
then
index=$1
fi

# Loop through deleting duplicates
while true
do
    # This lists 50 unique serials by count. The serial.raw text is the field it's looking in.
    curl -silent -XGET 'http://10.60.0.82:9200/logstash-'$index'/_search' -d '{"facets": {"terms": {"terms": {"field": "serial.raw","size": 50,"order": "count","exclude": []}}}}'|grep -o "\"term\":\"[a-f0-9-]\+\",\"count\":\([^1]\|[1-9][0-9]\)"|sed 's/\"term\":\"\([a-f0-9-]\+\)\",\"count\":\([0-9]\+\)/\1/' > /tmp/permadedupe.serials
    # If no serials over 1 count were found, sleep
    serialdupes=$(wc -l /tmp/permadedupe.serials|awk '{print $1}')
    if [[ $serialdupes -eq 0 ]]
    then
        echo "`date`__________No duplicate serials found in $index" >> /home/logstash/dedupe.log
        sleep 300
        # The index has to be respecified in case it rolls in to another day
        index=$(date +%Y.%m.%d)
        if [[ -n $1 ]]
        then
            index=$1
        fi
      
    else
        # For serials greater than 1 count, delete duplicates, but leave 1
        for serial in `cat /tmp/permadedupe.serials`
        do
            # Get the id's of all the duplicated serials
            curl -silent -XGET 'http://10.60.0.82:9200/logstash-'$index'/_search?q=serial.raw:'$serial''|grep -o "\",\"_id\":\"[A-Za-z0-9_-]\+\""|awk -F\" '{print $5}' > /tmp/permadedupelist
            # Delete the top line since you want to keep 1 copy
            sed -i '1d' /tmp/permadedupelist
            # The curl command doesn't like a hyphen unless it is escaped
            sed -i 's/^-/\\-/' /tmp/permadedupelist
            # If duplicates do exist delete them
            for line in `cat /tmp/permadedupelist`
            do
                curl -silent -XDELETE 'http://10.60.0.82:9200/logstash-'$index'/_query?q=_id:'$line'' &> /dev/null
                echo "`date`__________ID:$line serial:$serial index:$index" >> /home/logstash/dedupe.log
            done
        done
    fi
done


No comments:

Post a Comment