Strong consistency 👉🏻 MongoDB highly available durable writes (Replication)

#mongodb #acid #consistency #database

In the previous post, I used strace to display all calls to write and sync to disk from any MongoDB server thread:

strace -tT -fp $(pgrep -d, mongod) -yye trace=pwrite64,fdatasync -qqs 0

Adding replicas for High Availability

I did this with a single server, started with Atlas CLI. Let's do the same on a replicaset with three servers. I start it with the following Docker Compose:

services:  

  mongo-1:  
    image: mongo:8.0.10  
    ports:  
      - "27017:27017"  
    volumes:  
      - ./pgbench-mongo.js:/pgbench-mongo.js:ro  
      - mongo-data-1:/data/db  
    command: mongod --bind_ip_all --replSet rs0  
    networks:  
      - mongoha

  mongo-2:  
    image: mongo:8.0.10  
    ports:  
      - "27018:27017"  
    volumes:  
      - ./pgbench-mongo.js:/pgbench-mongo.js:ro  
      - mongo-data-2:/data/db  
    command: mongod --bind_ip_all --replSet rs0 
    networks:  
      - mongoha

  mongo-3:  
    image: mongodb/mongodb-community-server:latest  
    ports:  
      - "27019:27017"  
    volumes:  
      - ./pgbench-mongo.js:/pgbench-mongo.js:ro  
      - mongo-data-3:/data/db  
    command: mongod --bind_ip_all --replSet rs0   
    networks:  
      - mongoha

  init-replica-set:  
    image: mongodb/mongodb-community-server:latest  
    depends_on:  
      - mongo-1  
      - mongo-2  
      - mongo-3  
    entrypoint: |  
      bash -xc '  
        sleep 10  
        mongosh --host mongo-1 --eval "  
         rs.initiate( {_id: \"rs0\", members: [  
          {_id: 0, priority: 3, host: \"mongo-1:27017\"},  
          {_id: 1, priority: 2, host: \"mongo-2:27017\"},  
          {_id: 2, priority: 1, host: \"mongo-3:27017\"}]  
         });  
        "  
      '     
    networks:  
      - mongoha

volumes:  
  mongo-data-1:  
  mongo-data-2:  
  mongo-data-3:  

networks:  
  mongoha:  
    driver: bridge

I started this with docker compose up -d ; sleep 10 and then ran the strace command. I connected to the primary node with docker compose exec -it mongo-1 mongosh

run some transactions

I've executed the same as in the previous post, with ten writes to a collection:

db.mycollection.drop();
db.mycollection.insert( { _id: 1, num:0 });

for (let i = 1; i <= 10; i++) {
 print(` ${i} ${new Date()}`)
 db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
 print(` ${i} ${new Date()}`)
}

 1 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 1 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 2 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 2 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 3 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 3 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 4 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 4 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 5 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 5 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 6 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 6 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 7 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 7 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 8 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 8 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 9 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 9 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 10 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 10 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)

Here is the strace output during this:

[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 61184) = 512 <0.000086>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002> <unfinished ...>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 55808) = 384 <0.000097>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000656>
[pid  8786] 10:05:38 <... fdatasync resumed>) = 0 <0.002739>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 54528) = 384 <0.000129>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000672>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 61696) = 512 <0.000094>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.001070>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56192) = 384 <0.000118>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000927>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 54912) = 384 <0.000112>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000687>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 62208) = 512 <0.000066>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000717>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56576) = 384 <0.000095>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000745>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 55296) = 384 <0.000063>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000782>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 62720) = 512 <0.000084>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000712>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56960) = 384 <0.000080>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000814>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 55680) = 384 <0.000365>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000747>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 63232) = 512 <0.000096>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000724>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57344) = 384 <0.000108>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.001432>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56064) = 384 <0.000118>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000737>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 63744) = 512 <0.000061>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000636>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57728) = 384 <0.000070>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000944>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56448) = 384 <0.000105>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000712>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 64256) = 512 <0.000092>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000742>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 58112) = 384 <0.000067>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000704>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56832) = 384 <0.000152>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000732>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 64768) = 512 <0.000061>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000672>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 58496) = 384 <0.000062>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000653>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57216) = 384 <0.000102>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.001502>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 65280) = 512 <0.000072>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002> <unfinished ...>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 58880) = 384 <0.000123>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002> <unfinished ...>
[pid  8786] 10:05:38 <... fdatasync resumed>) = 0 <0.001538>
[pid  8736] 10:05:38 <... fdatasync resumed>) = 0 <0.000625>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57600) = 384 <0.000084>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000847>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 65792) = 512 <0.000060>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000661>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 59264) = 384 <0.000074>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000779>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57984) = 384 <0.000077>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000816>

I can see writes and sync from three processes. Let's check which process belongs to which container:

for pid in 8736 8786 8889; do  
  cid=$(grep -ao 'docker[-/][0-9a-f]\{64\}' /proc/$pid/cgroup | head -1 | grep -o '[0-9a-f]\{64\}')  
    svc=$(docker inspect --format '{{ index .Config.Labels "com.docker.compose.service"}}' "$cid" 2>/dev/null)  
    echo "PID: $pid -> Container ID: $cid -> Compose Service: ${svc:-<not-found>}"  
done  

PID: 8736 -> Container ID: 93e3ebd715867f1cd885d4c6191064ba0eb93b02c0884a549eec66026c459ac2 -> Compose Service: mongo-3
PID: 8786 -> Container ID: cf52ad45d25801ef1f66a7905fa0fb4e83f23376e4478b99dbdad03456cead9e -> Compose Service: mongo-1
PID: 8889 -> Container ID: c28f835a1e7dc121f9a91c25af1adfb1d823b667c8cca237a33697b4683ca883 -> Compose Service: mongo-2

This confirms that by default, the WAL is synced to disk at commit on each replica and not only on the primary.

Simulate one node failure

[pid 8786] is mongo-1 and it is my primary:

rs0 [direct: primary] test> rs.status().members.find(r=>r.state===1).name
... 
mongo-1:27017

I stop one replica:

docker compose pause mongo-3

[+] Pausing 1/0
 ✔ Container pgbench-mongo-mongo-3-1  Paused

I run my updates again, they are not impacted by one replica down:

rs0 [direct: primary] test> rs.status().members.find(r=>r.state===1).name
mongo-1:27017

rs0 [direct: primary] test> for (let i = 1; i <= 10; i++) {
...  print(` ${i} ${new Date()}`)
...  db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
...  print(` ${i} ${new Date()}`)
... }
...
 1 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 1 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 2 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 2 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 3 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 3 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 4 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 4 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 5 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 5 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 6 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 6 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 7 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 7 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 8 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 8 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 9 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 9 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 10 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 10 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)

Simulate two nodes failure

I stopped another replica:

docker compose pause mongo-2

[+] Pausing 1/0
 ✔ Container demo-mongo-2-1  Paused

As there's no quorum anymore, with only one replica in a replicaset of three members, the primary was stepped down and cannot serve reads or updates:

rs0 [direct: primary] test> for (let i = 1; i <= 10; i++) {
...  print(` ${i} ${new Date()}`)
...  db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
...  print(` ${i} ${new Date()}`)
... }
 1 Mon Jun 30 2025 09:28:36 GMT+0000 (Coordinated Universal Time)
MongoServerError[NotWritablePrimary]: not primary

Reads from secondary

The node that remains is now a secondary and exposes the last writes acknowledged by the majority:

rs0 [direct: secondary] test> db.mycollection.find()

[ { _id: 1, num: 20 } ]

rs0 [direct: secondary] test> db.mycollection.find().readConcern("majority")  

[ { _id: 1, num: 20 } ]

If the other nodes restart but are isolated from this secondary, the secondary still show the same timeline consistent but stale reads.

I simulate that by dicoonnecting this node, and restarting the others:

docker network disconnect demo_mongoha demo-mongo-1-1
docker unpause demo-mongo-2-1
docker unpause demo-mongo-3-1

As the two others form a quorum, there is a primary that accepts the writes:

-bash-4.2# docker compose exec -it mongo-2 mongosh                                                                                                                                                                                                                                                         
Current Mongosh Log ID: 686264bd3e0326801369e327
Connecting to:          mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+2.5.2
Using MongoDB:          8.0.10
Using Mongosh:          2.5.2

rs0 [direct: primary] test> for (let i = 1; i <= 10; i++) {
...  print(` ${i} ${new Date()}`)
...  db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
...  print(` ${i} ${new Date()}`)
... }
 1 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 1 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 2 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 2 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 3 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 3 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 4 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 4 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 5 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 5 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 6 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 6 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 7 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 7 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 8 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 8 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 9 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 9 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 10 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 10 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)

rs0 [direct: primary] test> db.mycollection.find().readConcern("majority")

[ { _id: 1, num: 30 } ]

rs0 [direct: primary] test>

Still, the secondary that is diconnected serves the last state it knows:

rs0 [direct: primary] test> exit
-bash-4.2# docker compose exec -it mongo-1 mongosh
Current Mongosh Log ID: 6862654959d1c9bbde69e327
Connecting to:          mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+2.5.2
Using MongoDB:          8.0.10
Using Mongosh:          2.5.2

rs0 [direct: secondary] test> db.mycollection.find().readConcern("majority")  
[ { _id: 1, num: 20 } ]
rs0 [direct: secondary] test>

To avoid reading from a stale state during a network failure, set the Read Preference to primary. This approach ensures you read from the primary database, with the most current data. Achieving the last consistent state across all secondary nodes would require halting writes on the primary during network failures, hurting the availability.

Most distributed databases use a similar replication strategy where write operations are synchronized to a quorum to ensure durability and resilience against failures. Waiting for acknowledgment from the quorum allows the system to maintain high availability, while ensuring that data is safely stored on multiple nodes and committed changes are synchronized to persistent storage.