So I've been dabbling in ElasticSearch. Very cool software. More stuff to come.
I did run across an issue where I accidentally imported a bunch of duplicates. Since it was a re-import, these duplicates had different ES id's. No rhyme or reason to the duplicates order so there was no easy way to remove them. I thought surely that ES had a built in way to remove them. Maybe based on a field. Every document in my ES index has a unique guid. So essentially I needed to delete duplicates based on a guid in a field. After a lot of researching and cursing the internet godz, I made a little shell script that did it for me. Is it fast? Absolutely...not. Very slow. But I made it so it can just run in the background for a given days' index from logstash. My importing produces the occasional duplicate so this has helped me manage it while I track down that issue.
So the main requirement is that there needs to be a field in your document you can search by and is unique to only that document. Like if you fed it a field that has email addresses, this script will delete all documents with that matching field except 1.
It's quick and dirty but it works.
#! /bin/bash
# elasticsearch de-duplicator
# Get todays date for the Logstash index name
index=$(date +%Y.%m.%d)
# Or specify a date manually
if [[ -n $1 ]]
then
index=$1
fi
# Loop through deleting duplicates
while true
do
# This lists 50 unique serials by count. The serial.raw text is the field it's looking in.
curl -silent -XGET 'http://10.60.0.82:9200/logstash-'$index'/_search' -d '{"facets": {"terms": {"terms": {"field": "serial.raw","size": 50,"order": "count","exclude": []}}}}'|grep -o "\"term\":\"[a-f0-9-]\+\",\"count\":\([^1]\|[1-9][0-9]\)"|sed 's/\"term\":\"\([a-f0-9-]\+\)\",\"count\":\([0-9]\+\)/\1/' > /tmp/permadedupe.serials
# If no serials over 1 count were found, sleep
serialdupes=$(wc -l /tmp/permadedupe.serials|awk '{print $1}')
if [[ $serialdupes -eq 0 ]]
then
echo "`date`__________No duplicate serials found in $index" >> /home/logstash/dedupe.log
sleep 300
# The index has to be respecified in case it rolls in to another day
index=$(date +%Y.%m.%d)
if [[ -n $1 ]]
then
index=$1
fi
else
# For serials greater than 1 count, delete duplicates, but leave 1
for serial in `cat /tmp/permadedupe.serials`
do
# Get the id's of all the duplicated serials
curl -silent -XGET 'http://10.60.0.82:9200/logstash-'$index'/_search?q=serial.raw:'$serial''|grep -o "\",\"_id\":\"[A-Za-z0-9_-]\+\""|awk -F\" '{print $5}' > /tmp/permadedupelist
# Delete the top line since you want to keep 1 copy
sed -i '1d' /tmp/permadedupelist
# The curl command doesn't like a hyphen unless it is escaped
sed -i 's/^-/\\-/' /tmp/permadedupelist
# If duplicates do exist delete them
for line in `cat /tmp/permadedupelist`
do
curl -silent -XDELETE 'http://10.60.0.82:9200/logstash-'$index'/_query?q=_id:'$line'' &> /dev/null
echo "`date`__________ID:$line serial:$serial index:$index" >> /home/logstash/dedupe.log
done
done
fi
done
No comments:
Post a Comment