Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: July 27, 2024
ElasticSearch, a powerful distributed search and analytics engine, excels at ingesting and querying vast amounts of data. However, there comes a time when data needs to be removed, whether for compliance, storage optimization, or data accuracy reasons.
In this tutorial, we explore various methods for removing data from ElasticSearch, ranging from deleting individual documents to managing large-scale deletions in production environments.
To begin with, ElasticSearch provides several ways to remove individual documents from an index.
To begin with, perhaps the simplest way to remove a single document from ElasticSearch is by using the Delete API. This method is ideal when we know the exact document ID and index name:
$ curl -X DELETE "localhost:9200/customers/_doc/1"
{"_index":"customers","_id":"1","_version":3,"result":"deleted","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":20,"_primary_term":1}
In this example, customers is the name of the index and 1 is the ID of the document we want to DELETE.
When we execute this command, ElasticSearch attempts to delete the document with ID 1 from the customers index. Subsequently, if the document exists and is successfully deleted, ElasticSearch returns a JSON response indicating the operation was successful.
On the other hand, when we need to delete multiple documents that match certain criteria, the Delete By Query API is more efficient. This method enables the removal of documents based on a query, similar to how we would search for documents:
$ curl -X POST "localhost:9200/customers/_delete_by_query" -H 'Content-Type: application/json' -d'
{
"query": {
"range": {
"last_purchase_date": {
"lt": "now-1y"
}
}
}
}'
{"took":258,"timed_out":false,"total":4,"deleted":4,"batches":1,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1.0,"throttled_until_millis":0,"failures":[]}
Let’s break down this example:
This query deletes all customer documents where the last purchase was more than a year ago, hence, an efficient way to remove outdated or irrelevant data based on specific criteria.
However, there are some things we should note when using Delete By Query:
We can also add a size parameter to limit the number of documents deleted in a single operation. This can further help manage the load on the cluster.
Moving on to more efficient methods for large-scale deletions, when dealing with a large number of documents, bulk operations can significantly improve performance. The Bulk API performs multiple delete operations in a single request, thus reducing network overhead and improving overall efficiency.
Let’s see an example of how to use the Bulk API for deletions using Python with the ElasticSearch client:
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch(["http://localhost:9200"])
def generate_actions(inactive_customer_ids):
for customer_id in inactive_customer_ids:
yield {
"_op_type": "delete",
"_index": "customers",
"_id": customer_id
}
inactive_customer_ids = ["3", "5", "8"]
response = helpers.bulk(es, generate_actions(inactive_customer_ids))
print(f"Deleted {response[0]} documents")
First, we create an ElasticSearch client instance, connecting to the local ElasticSearch server. Then, we define a generator function generate_actions that yields delete actions for each customer ID. After that, we create a list of inactive customer IDs. In a real scenario, such a list might come from a database query or another data source.
Subsequently, we use the helpers.bulk() function to perform the bulk delete operation. Finally, we print the number of documents deleted.
Now, let’s run the script:
$ python3 bulk-removal.py
Deleted 3 documents
The Bulk API is more efficient than sending individual delete requests for each document because it reduces the number of network round trips to the ElasticSearch cluster as well as the overhead during the actual internal operations.
In addition to document-level operations, sometimes we might need to remove larger chunks of data. In such cases, index-level operations can be more efficient.
If we need to remove all data from an index, deleting the entire index is the fastest approach:
$ curl -X DELETE "localhost:9200/customers"
{"acknowledged":true}
This command deletes the customers index and all its data. Notably, it’s an extremely fast operation but it’s also irreversible.
This method is useful when managing time-based indices and want to remove old data. For example, this is a common way to delete the log index from last month.
Alternatively, for a more nuanced approach that enables the removal of data while maintaining availability, we can use index aliases. This method is particularly useful when we want to remove a subset of data from an index without any downtime.
To start, we create the alias for the existing index:
$ curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
{
"actions": [
{ "add": { "index": "customers", "alias": "current_customers" }}
]
}'
{"acknowledged":true,"errors":false}
Then, we create a new index with updated settings:
$ curl -X PUT "localhost:9200/customers_v2" -H 'Content-Type: application/json' -d'
{
"mappings": {
"properties": {
"email": { "type": "keyword" },
"name": { "type": "text" }
}
}
}'
{"acknowledged":true,"shards_acknowledged":true,"index":"customers_v2"}
Next, we reindex the data, excluding inactive customers:
$ curl -X POST "localhost:9200/_reindex" -H 'Content-Type: application/json' -d'
{
"source": {
"index": "customers",
"query": {
"bool": {
"must_not": {
"term": { "status": "inactive" }
}
}
}
},
"dest": {
"index": "customers_v2"
}
}'
{"took":251,"timed_out":false,"total":7,"updated":0,"created":7,"deleted":0,"batches":1,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1.0,"throttled_until_millis":0,"failures":[]}
Finally, we switch the alias to the new index:
$ curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
{
"actions": [
{ "remove": { "index": "customers", "alias": "current_customers" }},
{ "add": { "index": "customers_v2", "alias": "current_customers" }}
]
}'
{"acknowledged":true,"errors":false}
Using this method, applications can continue to read and write to the current_customers alias throughout the process. Once the reindexing is complete and the alias is switched, the old index can be deleted.
In this article, we explored various methods for removing data from ElasticSearch, ranging from deleting individual documents to managing large-scale deletions in production environments. We covered the use of Delete API, Delete By Query API, Bulk API, and index-level operations.
With these techniques, we can effectively manage data in ElasticSearch clusters, ensuring optimal performance, compliance with data retention policies, and efficient use of storage resources.