=================
Resiliency Status
=================

Overview
========

The team at Elasticsearch is committed to continuously improving both
Elasticsearch and Apache Lucene to protect your data. As with any
distributed system, Elasticsearch is complex and has many moving parts,
each of which can encounter edge cases that require proper handling. Our
resiliency project is an ongoing effort to find and fix these edge
cases. If you want to keep up with all this project on GitHub, see our
issues list under the tag
`resiliency <https://github.com/elasticsearch/elasticsearch/issues?q=label%3Aresiliency>`__.

While GitHub is great for sharing our work, it can be difficult to get
an overview of the current state of affairs and the previous work that
has been done from an issues list. This page provides an overview of all
the resiliency-related issues that we are aware of, improvements that
have already been made and current in-progress work. We’ve also listed
some historical improvements throughout this page to provide the full
context.

If you’re interested in more on how we approach ensuring resiliency in
Elasticsearch, you may be interested in Igor Motov’s recent talk
`Improving Elasticsearch
Resiliency <http://www.elasticsearch.org/videos/improving-elasticsearch-resiliency/>`__.

You may also be interested in our blog post `Resiliency in
Elasticsearch <http://www.elasticsearch.org/blog/resiliency-elasticsearch/>`__,
which details our thought processes when addressing resiliency in both
Elasticsearch and the work our developers do upstream in Apache Lucene.

Data Store Recommendations
==========================

Some customers use Elasticsearch as a primary datastore, some set-up
comprehensive back-up solutions using features such as our Snapshot and
Restore, while others use Elasticsearch in conjunction with a data
storage system like Hadoop or even flat files. Elasticsearch can be used
for so many different use cases which is why we have created this page
to make sure you are fully informed when you are architecting your
system.

Work in Progress
================

**Known Unknowns (STATUS: ONGOING)**

We consider this topic to be the most important in our quest for
resiliency. We put a tremendous amount of effort into testing
Elasticsearch to simulate failures and randomize configuration to
produce extreme conditions. In addition, our users are an important
source of information on unexpected edge cases and your bug reports help
us make fixes that ensure that our system continues to be resilient.

If you encounter an issue, `please report
it <https://github.com/elasticsearch/elasticsearch/issues>`__!

We are committed to tracking down and fixing all the issues that are
posted.

**Loss of documents during network partition (STATUS: ONGOING)**

If a network partition separates a node from the master, there is some
window of time before the node detects it. The length of the window is
dependent on the type of the partition. This window is extremely small
if a socket is broken. More adversarial partitions, for example,
silently dropping requests without breaking the socket can take longer
(up to 3x30s using current defaults).

If the node hosts a primary shard at the moment of partition, and ends
up being isolated from the cluster (which could have resulted in
`split-brain <https://github.com/elasticsearch/elasticsearch/issues/2488>`__
before), some documents that are being indexed into the primary may be
lost if they fail to reach one of the allocated replicas (due to the
partition) and that replica is later promoted to primary by the master.
`#7572 <https://github.com/elasticsearch/elasticsearch/issues/7572>`__

A test to replicate this condition was added in
`#7493 <https://github.com/elasticsearch/elasticsearch/issues/7493>`__.

**Prevent use of known-bad Java versions (STATUS: ONGOING)**

Certain versions of the JVM are known to have bugs which can cause index
corruption.
`#7580 <https://github.com/elasticsearch/elasticsearch/issues/7580>`__
prevents Elasticsearch startup if known bad versions are in use.

**Lucene checksums phase 2 (STATUS:ONGOING)**

When Lucene opens a segment for reading, it validates the checksum on
the smaller segment files — those which it reads entirely into
memory — but not the large files like term frequencies and positions, as
this would be very expensive. During merges, term vectors and stored
fields are validated, as long the segments being merged come from the
same version of Lucene. Checksumming for term vectors and stored fields
is important because merging consists of performing optimized byte
copies. Term frequencies, term positions, payloads, doc values, and
norms are currently not checked during merges, although Lucene provides
the option to do so. These files are less prone to silent corruption as
they are actively decoded during merge, and so are more likely to throw
exceptions if there is any corruption.

There are a few ongoing efforts to improve coverage:

-  `#7360 <https://github.com/elasticsearch/elasticsearch/issues/7360>`__
   validates checksums on all segment files during merges. (STATUS:
   DONE, fixed in 1.4.0.Beta1)

-  `LUCENE-5842 <https://issues.apache.org/jira/browse/LUCENE-5842>`__
   validates the structure of the checksum footer of the postings lists,
   doc values, stored fields and term vectors when opening a new
   segment, to ensure that these files have not been truncated. (STATUS:
   DONE, Fixed in Lucene 4.10 and 1.4.0.Beta1)

-  `LUCENE-5894 <https://issues.apache.org/jira/browse/LUCENE-5894>`__
   lays the groundwork for extending more efficient checksum validation
   to all files during optimized bulk merges, if possible. (STATUS:
   ONGOING, Fixed in Lucene 5.0)

-  `#7586 <https://github.com/elasticsearch/elasticsearch/issues/7586>`__
   adds checksums for cluster and index state files. (STATUS: ONGOING,
   fixed in 1.5.0)

**Add per-segment and per-commit ID to help replication (STATUS:
ONGOING)**

`LUCENE-5895 <https://issues.apache.org/jira/browse/LUCENE-5895>`__ adds
a unique ID for each segment and each commit point. File-based
replication (as performed by snapshot/restore) can use this ID to know
whether the segment/commit on the source and destination machines are
the same. Fixed in Lucene 5.0.

**Improving Zen Discovery (STATUS: ONGOING)**

Recovery from failure is a complicated process, especially in an
asynchronous distributed system like Elasticsearch. With several
processes happening in parallel, it is important to ensure that recovery
proceeds swiftly and safely. While fixing the `split-brain
issue <https://github.com/elasticsearch/elasticsearch/issues/2488>`__ we
have been hunting down corner cases that were not handled optimally,
adding tests to demonstrate the issues, and working on fixes:

-  Faster & better detection of master & node failures, including not
   trying to reconnect upon disconnect, fail on disconnect error on
   ping, verify cluster names in pings. Previously, Elasticsearch had to
   wait a bit for the node to complete the process required to join the
   cluster. Recent changes guarantee that a node has fully joined the
   cluster before we start the fault detection process. Therefore we can
   do an immediate check causing faster detection of errors and
   validation of cluster state after a minimum master node breach.
   `#6706 <https://github.com/elasticsearch/elasticsearch/issues/6706>`__,
   `#7399 <https://github.com/elasticsearch/elasticsearch/issues/7399>`__
   (STATUS: DONE, v1.4.0.Beta1)

-  Broaden Unicast pinging when master fails: When a node loses it’s
   current master it will start pinging to find a new one. Previously,
   when using unicast based pinging, the node would ping a set of
   predefined nodes asking them whether the master had really
   disappeared or whether there was a network hiccup. Now, we ping all
   nodes in the cluster to increase coverage. In the case that all
   unicast hosts are disconnected from the current master during a
   network failure, this improvement is essential to allow the cluster
   to reform once the partition is healed.
   `#7336 <https://github.com/elasticsearch/elasticsearch/issues/7336>`__
   (STATUS: DONE, v1.4.0.Beta1)

-  After joining a cluster, validate that the join was successful and
   that the master has been set in the local cluster state.
   `#6969 <https://github.com/elasticsearch/elasticsearch/issues/6969>`__.
   (STATUS: DONE, v1.4.0.Beta1)

-  Write additional tests that use the test infrastructure to verify
   proper behavior during network disconnections and garbage
   collections.
   `#7082 <https://github.com/elasticsearch/elasticsearch/issues/7082>`__
   (STATUS: ONGOING)

-  Make write calls return the number of total/successful/missing shards
   in the same way that we do in search, which ensures transparency in
   the consistency of write operations.
   `#7572 <https://github.com/elasticsearch/elasticsearch/issues/7572>`__.
   (STATUS: ONGOING)

**Validate quorum before accepting a write request (STATUS: ONGOING)**

Today, when a node holding a primary shard receives an index request, it
checks the local cluster state to see whether a quorum of shards is
available before it accepts the request. However, it can take some time
before an unresponsive node is removed from the cluster state. We are
adding an optional live check, where the primary node tries to contact
its replicas to confirm that they are still responding before accepting
any changes. See
`#6937 <https://github.com/elasticsearch/elasticsearch/issues/6937>`__.

While the work is going on, we tightened the current checks by bringing
them closer to the index code. See
`#7873 <https://github.com/elasticsearch/elasticsearch/issues/7873>`__
(STATUS: DONE, fixed in 1.4.0)

**Jepsen Test Failures (STATUS: ONGOING)**

We have increased our test coverage to include scenarios tested by
Jepsen. We make heavy use of randomization to expand on the scenarios
that can be tested and to introduce new error conditions. You can follow
the work on the master branch of the
```DiscoveryWithServiceDisruptions``
class <https://github.com/elasticsearch/elasticsearch/blob/master/src/test/java/org/elasticsearch/discovery/DiscoveryWithServiceDisruptions.java>`__,
where we will add more tests as time progresses.

**Document guarantees and handling of failure (STATUS: ONGOING)**

This status page is a start, but we can do a better job of explicitly
documenting the processes at work in Elasticsearch, and what happens in
the case of each type of failure. The plan is to have a test case that
validates each behavior under simulated conditions. Every test will
document the expected results, the associated test code and an explicit
PASS or FAIL status for each simulated case.

**Take filter cache key size into account (STATUS: ONGOING)**

Commonly used filters are cached in Elasticsearch. That cache is limited
in size (10% of node’s memory by default) and is being evicted based on
a least recently used policy. The amount of memory used by the cache
depends on two primary components - the values it stores and the keys
associated with them. Calculating the memory footprint of the values is
easy enough but the keys accounting is trickier to achieve as they are,
by default, raw Lucene objects. This is largely not a problem as the
keys are dominated by the values. However, recent optimizations in
Lucene have changed the balance causing the filter cache to grow beyond
it’s size.

While we are working on a longer term solution, we introduced a minimum
weight of 1k for each cache entry. This puts an effective limit on the
number of entries in the cache. See
`#8304 <https://github.com/elasticsearch/elasticsearch/issues/8304>`__
(STATUS: DONE, fixed in 1.4.0)

Completed
=========

**Don’t allow unsupported codecs (STATUS: DONE, v1.4.0.Beta1)**

Lucene 4 added a number of alternative codecs for experimentation
purposes, and Elasticsearch exposed the ability to change codecs. Since
then, Lucene has settled on the best choice of codec and provides
backwards compatibility only for the default codec.
`#7566 <https://github.com/elasticsearch/elasticsearch/issues/7566>`__
removes the ability to set alternate codecs.

**Use checksums to identify entire segments (STATUS: DONE,
v1.4.0.Beta1)**

A hash collision makes it possible for two different files to have the
same length and the same checksum. Instead, a segment’s identity should
rely on checksums from all of the files in a single segment, which
greatly reduces the chance of a collision. This change has been merged
(`#7351 <https://github.com/elasticsearch/elasticsearch/issues/7351>`__).

**Fix *'Split Brain can occur even with minimum\_master\_nodes*'
(STATUS: DONE, v1.4.0.Beta1)**

Even when minimum master nodes is set, split brain can still occur under
certain conditions, e.g. disconnection between master eligible nodes,
which can lead to data loss. The scenario is described in detail in
`issue
2488 <https://github.com/elasticsearch/elasticsearch/issues/2488>`__:

-  Introduce a new testing infrastructure to simulate different types of
   node disconnections, including loss of network connection, lost
   messages, message delays, etc. See
   `MockTransportService <https://github.com/elasticsearch/elasticsearch/issues/5631>`__
   support and `service
   disruption <https://github.com/elasticsearch/elasticsearch/issues/6505>`__
   for more details. (STATUS: DONE, v1.4.0.Beta1).

-  Added tests that simulated the bug described in issue 2488. You can
   take a look at the `original
   commit <https://github.com/elasticsearch/elasticsearch/commit/7bf3ffe73c44f1208d1f7a78b0629eb48836e726>`__
   of a reproduction on master. (STATUS: DONE, v1.2.0)

-  The bug described in `issue
   2488 <https://github.com/elasticsearch/elasticsearch/issues/2488>`__
   is caused by an issue in our zen discovery gossip protocol. This
   specific issue has been fixed, and work has been done to make the
   algorithm more resilient. (STATUS: DONE, v1.4.0.Beta1)

**Translog Entry Checksum (STATUS: DONE, v1.4.0.Beta1)**

Each translog entry in Elasticsearch should have its own checksum, and
potentially additional information, so that we can properly detect
corrupted translog entries and act accordingly. You can find more detail
in issue
`#6554 <https://github.com/elasticsearch/elasticsearch/issues/6554>`__.

To start, we will begin by adding checksums to the translog to detect
corrupt entries. Once this work has been completed, we will add translog
entry markers so that corrupt entries can be skipped in the translog
if/when desired.

**Request-Level Memory Circuit Breaker (STATUS: DONE, v1.4.0.Beta1)**

We are in the process of introducing multiple circuit breakers in
Elasticsearch, which can “borrow” space from each other in the event
that one runs out of memory. This architecture will allow limits for
certain parts of memory, but still allow flexibility in the event that
another reserve like field data is not being used. This change includes
adding a breaker for the BigArrays internal object used for some
aggregations. See issue
`#6739 <https://github.com/elasticsearch/elasticsearch/issues/6739>`__
for more details.

**Doc Values (STATUS: DONE, v1.4.0.Beta1)**

Fielddata is one of the largest consumers of heap memory, and thus one
of the primary reasons for running out of memory and causing node
instability. Elasticsearch has had the “doc values” option for a while,
which allows you to build these structures at index time so that they
live on disk instead of in memory. Up until recently, doc values were
significantly slower than in-memory fielddata.

By benchmarking and profiling both Lucene and Elasticsearch, we
identified the bottlenecks and have made a series of improvements to
improve the performance of doc values. They are now almost as fast as
the in-memory option.

See
`#6967 <https://github.com/elasticsearch/elasticsearch/issues/6967>`__,
`#6908 <https://github.com/elasticsearch/elasticsearch/issues/6908>`__,
`#4548 <https://github.com/elasticsearch/elasticsearch/issues/4548>`__,
`#3829 <https://github.com/elasticsearch/elasticsearch/issues/3829>`__,
`#4518 <https://github.com/elasticsearch/elasticsearch/issues/4518>`__,
`#5669 <https://github.com/elasticsearch/elasticsearch/issues/5669>`__,
`LUCENE-5748 <https://issues.apache.org/jira/browse/LUCENE-5748>`__,
`LUCENE-5703 <https://issues.apache.org/jira/browse/LUCENE-5703>`__,
`LUCENE-5750 <https://issues.apache.org/jira/browse/LUCENE-5750>`__,
`LUCENE-5721 <https://issues.apache.org/jira/browse/LUCENE-5721>`__,
`LUCENE-5799 <https://issues.apache.org/jira/browse/LUCENE-5799>`__.

**Index corruption when upgrading Lucene 3.x indices (STATUS: DONE,
v1.4.0.Beta1)**

Upgrading indices create with Lucene 3.x (Elasticsearch v0.20 and
before) to Lucene 4.7 - 4.9 (Elasticsearch 1.1.0 to 1.3.x), could result
in index corruption.
`LUCENE-5907 <https://issues.apache.org/jira/browse/LUCENE-5907>`__
fixes this issue in Lucene 4.10.

**Improve error handling when deleting files (STATUS: DONE,
v1.4.0.Beta1)**

Lucene uses reference counting to prevent files that are still in use
from being deleted. Lucene testing discovered a bug
(`LUCENE-5919 <https://issues.apache.org/jira/browse/LUCENE-5919>`__)
when decrementing the ref count on a batch of files. If deleting some of
the files resulted in an exception (e.g. due to interference from a
virus scanner), the files that had had their ref counts decremented
successfully could later have their ref counts deleted again,
incorrectly, resulting in files being physically deleted before their
time. This is fixed in Lucene 4.10.

**Using Lucene Checksums to verify shards during snapshot/restore
(STATUS:DONE, v1.3.3)**

The snapshot process should verify checksums for each file that is being
snapshotted to make sure that created snapshot doesn’t contain corrupted
files. If a corrupted file is detected, the snapshot should fail with an
error. In order to implement this feature we need to have correct and
verifiable checksums stored with segment files, which is only possible
for files that were written by the officially supported append-only
codecs. See
`#7159 <https://github.com/elasticsearch/elasticsearch/issues/7159>`__.

**Rare compression corruption during shard recovery (STATUS: DONE,
v1.3.2)**

During recovery, the primary shard is copied over the network to become
a new replica shard. In rare cases, it was possible for a hash collision
to trigger a bug in the compression library that is used to produce
corruption in the replica shard. This bug was exposed by the change to
validate checksums during recovery. We tracked down the bug in the in
compression library and submitted a patch, which was accepted and merged
by the upstream project. See
`#7210 <https://github.com/elasticsearch/elasticsearch/issues/7210>`__.

**Safer recovery of replica shards (STATUS: DONE, v1.3.0)**

If a primary shard fails or is closed while a replica is using it for
recovery, we need to ensure that the replica is properly failed as well,
and allow recovery to start from the new primary. Also check that an
active copy of a shard is available on another node before physically
removing an inactive shard from disk.
`#6825 <https://github.com/elasticsearch/elasticsearch/issues/6825>`__,
`#6645 <https://github.com/elasticsearch/elasticsearch/issues/6645>`__,
`#6995 <https://github.com/elasticsearch/elasticsearch/issues/6995>`__.

**Using Lucene Checksums to verify shards during recovery (STATUS: DONE,
v1.3.0)**

Elasticsearch can use Lucene checksums to validate files while
`recovering a replica shard from a
primary <https://github.com/elasticsearch/elasticsearch/issues/6776>`__.

This issue exposed a bug in Elasticsearch’s handling of primary shard
failure when having more than 2 replicas, causing the second replica to
not be properly unassigned if it is in the middle of recovery. It was
fixed with the merge of issue
`#6808 <https://github.com/elasticsearch/elasticsearch/issues/6808>`__.

In order to verify the checksumming mechanism, we added functionality to
our testing infrastructure that can corrupt an arbitrary index file and
at any point, such as while it’s traveling over the wire or residing on
disk. The tests utilizing this feature expect full or partial recovery
from the failure while neither losing data nor spreading the corruption.

**Detect File Corruption (STATUS: DONE, v1.3.0)**

When a corrupted index can be detected during merging or refresh,
Elasticsearch will fail the shard if a checksum failure is detected. You
can read the full details in pull request
`#6776 <https://github.com/elasticsearch/elasticsearch/issues/6776>`__.

**Network disconnect events could be lost, causing a zombie node to stay
in the cluster state (STATUS: DONE, v1.3.0)**

Previously, there was a very short window in which we could lose a node
disconnect event. To prevent this from occurring, we added extra
handling of connection errors to our nodes & master fault detection
pinging to make sure the node disconnect event is detected. See issue
`#6686 <https://github.com/elasticsearch/elasticsearch/issues/6686>`__.

**Other fixes to Lucene to address resiliency (STATUS: DONE, v1.3.0)**

-  NativeLock is released if Lock is closed after failing on obtain
   `LUCENE-5738 <https://issues.apache.org/jira/browse/LUCENE-5738>`__.

-  NRT Reader close can wipe an index it doesn’t own.
   `LUCENE-5574 <https://issues.apache.org/jira/browse/LUCENE-5574>`__

-  FSDirectory’s fsync() is lenient, now throws exceptions when errors
   occur
   `LUCENE-5570 <https://issues.apache.org/jira/browse/LUCENE-5570>`__

-  fsync() directory when committing
   `LUCENE-5588 <https://issues.apache.org/jira/browse/LUCENE-5588>`__

**Backwards Compatibility Testings (STATUS: DONE, v1.3.0)**

Since founding Elasticsearch Inc, we grew our test base from ~1k tests
to about 4k in just about over a year. We invested massively into our
testing infrastructure, running our tests continuously on different
operating systems, bare metal hardware and cloud environments, all while
randomizing JVMs and their settings.

Yet, backwards compatibility testing was a very manual thing until we
released a pretty `insane
bug <https://github.com/elasticsearch/elasticsearch/issues/6393>`__ with
Elasticsearch 1.2. We tried to fix places where the absolute value of a
number was negative (a documented behavior of Math.abs(int) in Java) and
missed that the fix for this also changed the result of our routing
function. No matter how much randomization we applied to the tests, we
didn’t catch this particular failure. We always had backwards
compatibility tests on our list of things to do, but didn’t have them in
place back then.

We recently tweaked our testing infrastructure to be able to run tests
against a hybrid cluster composed of a released version of Elasticsearch
and our current stable branch. This test pattern allowed us to mimic
typical upgrade scenarios like rolling upgrades, index backwards
compatibility and recovering from old to new nodes.

Now, even the simplest test that relies on routing fails against 1.2.0,
which is exactly we were aiming for. The test would not have caught the
aforementioned `routing
bug <https://github.com/elasticsearch/elasticsearch/issues/6393>`__
before releasing 1.2.0, but it immediately saved us from `another
problem <https://github.com/elasticsearch/elasticsearch/issues/6660>`__
in the stable branch.

The work on our testing infrastructure is more than just issue
prevention, it allows us to develop and test upgrade paths, introduce
new features and evolve indexing over time. It isn’t enough to introduce
more resilient implementations, we also have to ensure that users take
advantage of them when they upgrade.

You can read more about backwards compatibility tests in issue
`#6497 <https://github.com/elasticsearch/elasticsearch/issues/6497>`__.

**Full Translog Writes on all Platforms (STATUS: DONE, v1.2.2 and
v1.3.0)**

We have recently received bug reports of transaction log corruption that
can occur when indexing very large documents (in the area of 300 KB).
Although some Linux users reported this behavior, it appears the problem
occurs more frequently when running Windows. We traced the source of the
problem to the fact that when serializing documents to the transaction
log, the Operating System can actually write only part of the document
before returning from the write call. We can now detect this situation
and make sure that the entire document is properly written. You can read
the full details in pull request
`#6576 <https://github.com/elasticsearch/elasticsearch/issues/6576>`__.

**Lucene Checksums (STATUS: DONE, v1.2.0)**

Before Apache Lucene version 4.8, checksums were not computed on
generated index files. The result was that it was difficult to identify
when or if a Lucene index got corrupted, whether by hardware failure,
JVM bug or for an entirely different reason.

For an idea of the checksum efforts in progress in Apache Lucene, see
issues
`LUCENE-2446 <https://issues.apache.org/jira/browse/LUCENE-2446>`__,
`LUCENE-5580 <https://issues.apache.org/jira/browse/LUCENE-5580>`__ and
`LUCENE-5602 <https://issues.apache.org/jira/browse/LUCENE-5602>`__. The
gist is that Lucene 4.8+ now computes full checksums on all index files
and it verifies them when opening metadata or other smaller files as
well as other files during merges.

**Detect errors faster by locally failing a shard upon an indexing error
(STATUS: DONE, v1.2.0)**

Previously, Elasticsearch notified the master of the shard failure and
waited for the master to close the local copy of the shard, thus
assigning it to other nodes. This architecture caused delays in failure
detection, potentially causing unneeded failures of other incoming
requests. In rare cases, such as concurrency racing conditions or
certain network partitions configurations, we could lose these failure
notifications. We solved this issue by locally failing shards upon
indexing errors. See issue
`#5847 <https://github.com/elasticsearch/elasticsearch/issues/5847>`__.

**Snapshot/Restore API (STATUS: DONE, v1.0.0)**

In Elasticsearch version 1.0, we significantly improved the backup
process by introducing the Snapshot/Restore API. While it was always
possible to make backups of Elasticsearch, the Snapshot/Restore API made
the backup process much easier.

The backup process is incremental, making it very efficient since only
files changed since the last backup are copied. Even with this
efficiency introduced, each snapshot contains a full picture of the
cluster at the moment when backup started. The restore API allows speedy
recovery of a full cluster as well as selected indices.

Since that first release in version 1.0, the API has continued to
evolve. In version 1.1.0, we added a new snapshot status API that allows
users to monitor the snapshot process. In 1.3.0 we added the ability to
`restore indices without their
aliases <https://github.com/elasticsearch/elasticsearch/issues/6457>`__
and in 1.4 we are planning to add the ability to `restore partial
snapshots <https://github.com/elasticsearch/elasticsearch/issues/6368>`__.

The Snapshot/Restore API supports a number of different repository types
for storing backups. Currently, it’s possible to make backups to a
shared file system, Amazon S3, HDFS, and Azure storage. We are
continuing to work on adding other types of storage systems, as well as
improving the robustness of the snapshot/restore process.

**Circuit Breaker: Fielddata (STATUS: DONE, v1.0.0)**

Currently, the `circuit
breaker <http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-fielddata.html>`__
protects against loading too much field data by estimating how much
memory the field data will take to load, then aborting the request if
the memory requirements are too high. This feature was added in
Elasticsearch version 1.0.0.

**Use of Paginated Data Structures to Ease Garbage Collection (STATUS:
DONE, v1.0.0 & v1.2.0)**

Elasticsearch has moved from an object-based cache to a page-based cache
recycler as described in issue
`#4557 <https://github.com/elasticsearch/elasticsearch/issues/4557>`__.
This change makes garbage collection easier by limiting fragmentation,
since all pages have the same size and are recycled. It also allows
managing the size of the cache not based on the number of objects it
contains, but on the memory that it uses.

These pages are used for two main purposes: implementing higher level
data structures such as hash tables that are used internally by
aggregations to eg. map terms to counts, as well as reusing memory in
the translog/transport layer as detailed in issue
`#5691 <https://github.com/elasticsearch/elasticsearch/issues/5691>`__.

**Dedicated Master Nodes Resiliency (STATUS: DONE, v1.0.0)**

In order to run a more resilient cluster, we recommend running dedicated
master nodes to ensure master nodes are not affected by resources
consumed by data nodes. We also have made master nodes more resilient to
heavy resource usage, mainly associated with large clusters / cluster
states.

These changes include:

-  Improve the balancing algorithm to execute faster across large
   clusters / many indices. (See issue
   `#4458 <https://github.com/elasticsearch/elasticsearch/issues/4458>`__
   and
   `#4459 <https://github.com/elasticsearch/elasticsearch/issues/4459>`__)

-  Improve cluster state publishing to not create an additional network
   buffer per node. (More in `this
   commit <https://github.com/elasticsearch/elasticsearch/commit/a9e259d438c3cb1d3bef757db2d2a91cf85be609>`__.)

-  Improve master handling of large scale mapping updates from data
   nodes by batching them into a single cluster event. (See issue
   `#4373 <https://github.com/elasticsearch/elasticsearch/issues/4373>`__.)

-  Add an ack mechanism where next phase cluster updates are processed
   only when nodes acknowledged they received the previous cluster
   state. (See issues
   `#3736 <https://github.com/elasticsearch/elasticsearch/issues/3736>`__,
   `#3786 <https://github.com/elasticsearch/elasticsearch/issues/3786>`__,
   `#4114 <https://github.com/elasticsearch/elasticsearch/issues/4114>`__,
   `#4169 <https://github.com/elasticsearch/elasticsearch/issues/4169>`__,
   `#4228 <https://github.com/elasticsearch/elasticsearch/issues/4228>`__
   and
   `#4421 <https://github.com/elasticsearch/elasticsearch/issues/4421>`__,
   which also include enhancements to the ack mechanism implementation.)

**Multi Data Paths May Falsely Report Corrupt Index (STATUS: DONE,
v1.0.0)**

When using multiple data paths, an index could be falsely reported as
corrupted. This has been fixed with pull request
`#4674 <https://github.com/elasticsearch/elasticsearch/issues/4674>`__.

**Randomized Testing (STATUS: DONE, v1.0.0)**

In order to best validate for resiliency in Elasticsearch, we rewrote
the Elasticsearch test infrastructure to introduce the concept of
`randomized
testing <http://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/media/documents/dawidweiss-randomizedtesting-pub.pdf>`__.
Randomized testing allows us to easily enhance the Elasticsearch testing
infrastructure with predictably irrational conditions, making the
resulting code base more resilient.

Each of our integration tests runs against a cluster with a random
number of nodes, and indices have a random number of shards and
replicas. Merge settings change for every run, indexing is done in
serial or async fashion or even wrapped in a bulk operation and thread
pool sizes vary to ensure that we don’t produce a deadlock no matter
what happens. The list of places we use this randomization
infrastructure is long, and growing every day, and has saved us
headaches several times before we shipped a particular feature.

At Elasticsearch, we live the philosophy that we can miss a bug once,
but never a second time. We make our tests more evil as you go,
introducing randomness in all the areas where we discovered bugs. We
figure if our tests don’t fail, weare not trying hard enough! If you are
interested in how we have evolved our test infrastructure over time
check out `issues labeled with “test” on
GitHub <https://github.com/elasticsearch/elasticsearch/issues?q=label%3Atest>`__.

**Lucene Loses Data On File Descriptors Failure (STATUS: DONE,
v0.90.0)**

When a process runs out of file descriptors, Lucene can causes an index
to be completely deleted. This issue was fixed in Lucene (`version
4.2.1 <https://issues.apache.org/jira/browse/LUCENE-4870>`__) and fixed
in an early version of Elasticsearch. See issue
`#2812 <https://github.com/elasticsearch/elasticsearch/issues/2812>`__.
