Elasticsearch

Elasticsearch = A distributed RESTful search and analytics engine

Think of it as: A super-fast, scalable database designed for full-text search and complex queries.
It's built on top of Apache Lucene.

Apache Lucene

A high-performance, Java-based search library created by Doug Cutting (the same person behind Hadoop).
It’s not a standalone server, but a library that provides:
- Text analysis (tokenization, stemming, stop-word filtering)
- Inverted indexes
- Scoring and ranking of search results
Think of Lucene as the core search brain — Elasticsearch is the distributed system wrapper around it.
Elasticsearch = Lucene + REST API + Clustering + Sharding + Replication

Full-text search

Searching for text within documents, rather than matching exact field values.
Instead of SQL’s WHERE column = 'value', you can search conceptually similar text.
For example:

Query: "fast car"

Matches: "speedy vehicle", "quick cars", etc.

How it works:
- Documents are analyzed → broken into tokens.
- Lucene builds an inverted index (word → list of docs containing that word).
- At query time, Elasticsearch finds and ranks results by TF-IDF or BM25 (term frequency–inverse document frequency).

Use cases

Search Engine: Full-text search on websites or apps (e.g., e-commerce, blogs)
Logging & Monitoring: Centralized log analytics with Kibana visualization
Metrics & Observability: Time-series data from servers or containers
Data Analytics: Aggregations and dashboards
Autocomplete & Suggesters: Real-time suggestions and recommendations

Elastic Stack (formerly ELK Stack): Elasticsearch + Logstash + Kibana (+ Beats)

Beats: Lightweight data shippers (send logs/metrics)
Logstash: Ingest, filter, and transform data before indexing
Elasticsearch: Store and index data
Kibana: Visualization and dashboard layer

Architecture

Cluster: A collection of one or more nodes (servers) that hold all data and provide search capabilities across indices.
Node: A single instance of Elasticsearch running on a machine. Each node stores data and participates in indexing/searching.
- Nodes can serve different roles:
  - Master node → manages cluster state, shard allocation, and index creation.
  - Data node → stores data and executes queries/aggregations.
  - Coordinating node → routes client requests to the right nodes and merges results.
Shard: An index is divided into pieces called shards for horizontal scalability.
- Each shard is a Lucene instance — a self-contained search engine.
Replica: A copy of a shard for high availability and load balancing.

How it works

Each index is split into primary shards → Sharding
Each primary shard can have replica shards (copies) → Replication
Elasticsearch automatically distributes shards across nodes → Scalability/Horizontal Scaling
If one node fails, data can still be served from replicas → High Availability/ Fault tolerance

✅ Strengths

Scalable (horizontal scaling via shards)
Real-time search and analytics
Schemaless JSON storage (flexible)

⚠️ Limitations

High memory and storage usage
Complex cluster management at scale
Updates/deletes are costly (immutable segments)
Not an OLTP database — better for search/analytics

Data Model

Index: A collection of documents that share similar characteristics (e.g., logs, users, products) — similar to a table in RDBMS.
Document: The basic unit of information, stored as JSON. Like a row in RDBMS
Field: An attribute (as key-value pairs) in the document. Like a column in RDBMS
Mapping: Defines the schema for documents within an index:
- Field names
- Data types (text, keyword, integer, date, etc.)
- How each field is indexed or analyzed
Analyzer: Used when indexing and searching text fields.
- Breaks down text into tokens and normalizes them (lowercasing, stemming, stop-word removal).
- Improves the quality of full-text search.

Data Flow

Client sends data to Elasticsearch via REST API.
The coordinating node receives the request and routes it to the appropriate primary shard.
The shard indexes the document using Lucene (with analyzers and mappings).
The data is asynchronously replicated to replica shards on other nodes.
Search queries are broadcast to all shards (scatter), results are merged (gather) and returned.

Indexing Data

When you index (store) a document:

ES assigns it to a primary shard.
The shard indexes the data using Lucene.
Any replica shards get updated asynchronously.

Example (REST API):

PUT /books/_doc/1

{

"title": "Elasticsearch Deep Dive",

"author": "Bob A"

"year": 2025

}

Searching Data

When you search:

The query is broadcast to all shards (scatter phase).
Each shard returns its results (gather phase).
ES merges and ranks the results by relevance score.

Example (search API):

GET /books/_search

{

"query": {

"match": {

"title": "elasticsearch"

}

Under the hood (Lucene):

Text is analyzed using an analyzer → tokenized, normalized, filtered.
A term index (inverted index) is created mapping terms → document IDs.
Queries use these inverted indexes for super fast lookups.

Query Types

Query Type | Purpose

---------- | ----------------------------------------------------------

`match` | Full-text search

`term` | Exact value search

`range` | Numeric or date range

`bool` | Combine multiple conditions (`must`, `should`, `must_not`)

`aggs` | Aggregations (e.g., counts, averages, histograms)

Example:

GET /books/_search

{

"query": {

"bool": {

"must": [

{ "match": { "author": "Bob" }},

{ "range": { "year": { "gte": 2020 }}}

]

}

"aggs": {

"by_year": { "terms": { "field": "year" }}

}

RESTful API

HTTP methods = CRUD operations

PUT= Create a resource with a specified ID, or replace if exists
- The general pattern is: PUT /index_name/_doc/document_id
- Example: PUT /books/_doc/1
POST = Create without specifying ID (auto-generated) or trigger an operation
- POST /books/_doc/
GET = Retrieve a resource or perform a search
- GET /books/_doc/1 or GET /books/_search
DELETE = Remove a resource (document, index, etc.)
- DELETE /books/_doc/1
HEAD = Check if a resource exists (returns 200/404)
- HEAD /books/_doc/1

Endpoints

Endpoint | Purpose | Example

-----------| ----------------------------------------------- | -----------------------

/_search | Search API | GET /books/_search

/_doc | Document API (create/read/update/delete) | GET /books/_doc/1

/_mapping | View or define index mappings | GET /books/_mapping

/_settings | View or update index settings | GET /books/_settings

/_cat | Human-readable info APIs (for debugging) | GET /_cat/indices?v

/_cluster | Cluster-wide info | GET /_cluster/health

/_nodes | Node-level info | GET /_nodes/stats

/_count | Count matching documents | GET /books/_count

/_bulk | Perform multiple index/update/delete operations | POST /_bulk

/_update | Update part of a document | POST /books/_update/1

User-level resource
- Data endpoints: /books/, /books/_doc/
System-level resource:
- Cluster management: /_cluster/health, /_cat/indices

Google Sites

Report abuse