In the previous post, I used strace
to display all calls to write and sync to disk from any MongoDB server thread:
strace -tT -fp $(pgrep -d, mongod) -yye trace=pwrite64,fdatasync -qqs 0
Adding replicas for High Availability
I did this with a single server, started with Atlas CLI. Let's do the same on a replicaset with three servers. I start it with the following Docker Compose:
services:
mongo-1:
image: mongo:8.0.10
ports:
- "27017:27017"
volumes:
- ./pgbench-mongo.js:/pgbench-mongo.js:ro
- mongo-data-1:/data/db
command: mongod --bind_ip_all --replSet rs0
networks:
- mongoha
mongo-2:
image: mongo:8.0.10
ports:
- "27018:27017"
volumes:
- ./pgbench-mongo.js:/pgbench-mongo.js:ro
- mongo-data-2:/data/db
command: mongod --bind_ip_all --replSet rs0
networks:
- mongoha
mongo-3:
image: mongodb/mongodb-community-server:latest
ports:
- "27019:27017"
volumes:
- ./pgbench-mongo.js:/pgbench-mongo.js:ro
- mongo-data-3:/data/db
command: mongod --bind_ip_all --replSet rs0
networks:
- mongoha
init-replica-set:
image: mongodb/mongodb-community-server:latest
depends_on:
- mongo-1
- mongo-2
- mongo-3
entrypoint: |
bash -xc '
sleep 10
mongosh --host mongo-1 --eval "
rs.initiate( {_id: \"rs0\", members: [
{_id: 0, priority: 3, host: \"mongo-1:27017\"},
{_id: 1, priority: 2, host: \"mongo-2:27017\"},
{_id: 2, priority: 1, host: \"mongo-3:27017\"}]
});
"
'
networks:
- mongoha
volumes:
mongo-data-1:
mongo-data-2:
mongo-data-3:
networks:
mongoha:
driver: bridge
I started this with docker compose up -d ; sleep 10
and then ran the strace
command. I connected to the primary node with docker compose exec -it mongo-1 mongosh
run some transactions
I've executed the same as in the previous post, with ten writes to a collection:
db.mycollection.drop();
db.mycollection.insert( { _id: 1, num:0 });
for (let i = 1; i <= 10; i++) {
print(` ${i} ${new Date()}`)
db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
print(` ${i} ${new Date()}`)
}
1 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
1 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
2 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
2 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
3 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
3 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
4 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
4 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
5 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
5 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
6 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
6 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
7 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
7 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
8 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
8 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
9 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
9 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
10 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
10 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
Here is the strace output during this:
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 61184) = 512 <0.000086>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002> <unfinished ...>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 55808) = 384 <0.000097>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000656>
[pid 8786] 10:05:38 <... fdatasync resumed>) = 0 <0.002739>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 54528) = 384 <0.000129>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000672>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 61696) = 512 <0.000094>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.001070>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56192) = 384 <0.000118>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000927>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 54912) = 384 <0.000112>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000687>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 62208) = 512 <0.000066>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000717>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56576) = 384 <0.000095>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000745>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 55296) = 384 <0.000063>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000782>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 62720) = 512 <0.000084>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000712>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56960) = 384 <0.000080>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000814>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 55680) = 384 <0.000365>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000747>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 63232) = 512 <0.000096>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000724>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57344) = 384 <0.000108>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.001432>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56064) = 384 <0.000118>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000737>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 63744) = 512 <0.000061>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000636>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57728) = 384 <0.000070>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000944>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56448) = 384 <0.000105>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000712>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 64256) = 512 <0.000092>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000742>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 58112) = 384 <0.000067>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000704>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56832) = 384 <0.000152>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000732>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 64768) = 512 <0.000061>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000672>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 58496) = 384 <0.000062>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000653>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57216) = 384 <0.000102>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.001502>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 65280) = 512 <0.000072>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002> <unfinished ...>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 58880) = 384 <0.000123>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002> <unfinished ...>
[pid 8786] 10:05:38 <... fdatasync resumed>) = 0 <0.001538>
[pid 8736] 10:05:38 <... fdatasync resumed>) = 0 <0.000625>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57600) = 384 <0.000084>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000847>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 65792) = 512 <0.000060>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000661>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 59264) = 384 <0.000074>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000779>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57984) = 384 <0.000077>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000816>
I can see writes and sync from three processes. Let's check which process belongs to which container:
for pid in 8736 8786 8889; do
cid=$(grep -ao 'docker[-/][0-9a-f]\{64\}' /proc/$pid/cgroup | head -1 | grep -o '[0-9a-f]\{64\}')
svc=$(docker inspect --format '{{ index .Config.Labels "com.docker.compose.service"}}' "$cid" 2>/dev/null)
echo "PID: $pid -> Container ID: $cid -> Compose Service: ${svc:-<not-found>}"
done
PID: 8736 -> Container ID: 93e3ebd715867f1cd885d4c6191064ba0eb93b02c0884a549eec66026c459ac2 -> Compose Service: mongo-3
PID: 8786 -> Container ID: cf52ad45d25801ef1f66a7905fa0fb4e83f23376e4478b99dbdad03456cead9e -> Compose Service: mongo-1
PID: 8889 -> Container ID: c28f835a1e7dc121f9a91c25af1adfb1d823b667c8cca237a33697b4683ca883 -> Compose Service: mongo-2
This confirms that by default, the WAL is synced to disk at commit on each replica and not only on the primary.
Simulate one node failure
[pid 8786]
is mongo-1
and it is my primary:
rs0 [direct: primary] test> rs.status().members.find(r=>r.state===1).name
...
mongo-1:27017
I stop one replica:
docker compose pause mongo-3
[+] Pausing 1/0
✔ Container pgbench-mongo-mongo-3-1 Paused
I run my updates again, they are not impacted by one replica down:
rs0 [direct: primary] test> rs.status().members.find(r=>r.state===1).name
mongo-1:27017
rs0 [direct: primary] test> for (let i = 1; i <= 10; i++) {
... print(` ${i} ${new Date()}`)
... db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
... print(` ${i} ${new Date()}`)
... }
...
1 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
1 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
2 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
2 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
3 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
3 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
4 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
4 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
5 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
5 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
6 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
6 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
7 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
7 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
8 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
8 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
9 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
9 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
10 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
10 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
Simulate two nodes failure
I stopped another replica:
docker compose pause mongo-2
[+] Pausing 1/0
✔ Container demo-mongo-2-1 Paused
As there's no quorum anymore, with only one replica in a replicaset of three members, the primary was stepped down and cannot serve reads or updates:
rs0 [direct: primary] test> for (let i = 1; i <= 10; i++) {
... print(` ${i} ${new Date()}`)
... db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
... print(` ${i} ${new Date()}`)
... }
1 Mon Jun 30 2025 09:28:36 GMT+0000 (Coordinated Universal Time)
MongoServerError[NotWritablePrimary]: not primary
Reads from secondary
The node that remains is now a secondary and exposes the last writes acknowledged by the majority:
rs0 [direct: secondary] test> db.mycollection.find()
[ { _id: 1, num: 20 } ]
rs0 [direct: secondary] test> db.mycollection.find().readConcern("majority")
[ { _id: 1, num: 20 } ]
If the other nodes restart but are isolated from this secondary, the secondary still show the same timeline consistent but stale reads.
I simulate that by dicoonnecting this node, and restarting the others:
docker network disconnect demo_mongoha demo-mongo-1-1
docker unpause demo-mongo-2-1
docker unpause demo-mongo-3-1
As the two others form a quorum, there is a primary that accepts the writes:
-bash-4.2# docker compose exec -it mongo-2 mongosh
Current Mongosh Log ID: 686264bd3e0326801369e327
Connecting to: mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+2.5.2
Using MongoDB: 8.0.10
Using Mongosh: 2.5.2
rs0 [direct: primary] test> for (let i = 1; i <= 10; i++) {
... print(` ${i} ${new Date()}`)
... db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
... print(` ${i} ${new Date()}`)
... }
1 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
1 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
2 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
2 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
3 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
3 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
4 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
4 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
5 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
5 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
6 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
6 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
7 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
7 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
8 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
8 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
9 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
9 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
10 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
10 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
rs0 [direct: primary] test> db.mycollection.find().readConcern("majority")
[ { _id: 1, num: 30 } ]
rs0 [direct: primary] test>
Still, the secondary that is diconnected serves the last state it knows:
rs0 [direct: primary] test> exit
-bash-4.2# docker compose exec -it mongo-1 mongosh
Current Mongosh Log ID: 6862654959d1c9bbde69e327
Connecting to: mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+2.5.2
Using MongoDB: 8.0.10
Using Mongosh: 2.5.2
rs0 [direct: secondary] test> db.mycollection.find().readConcern("majority")
[ { _id: 1, num: 20 } ]
rs0 [direct: secondary] test>
To avoid reading from a stale state during a network failure, set the Read Preference to primary
. This approach ensures you read from the primary database, with the most current data. Achieving the last consistent state across all secondary nodes would require halting writes on the primary during network failures, hurting the availability.
Most distributed databases use a similar replication strategy where write operations are synchronized to a quorum to ensure durability and resilience against failures. Waiting for acknowledgment from the quorum allows the system to maintain high availability, while ensuring that data is safely stored on multiple nodes and committed changes are synchronized to persistent storage.
Top comments (0)