How zookeeper watches nodes for going down
My question might seem silly. However, I cannot figure out how zookeeper watches for changes.
One of the thing that zookeeper handles is election. Assume I have a scenario that I have three redis instances which one of them is master and two of them are slave, and I want not to send the command to master when it's down.
My first question is that how zookeeper can figure out the redis master is down.
The second scenario is that one of instances of zookeeper is down. assume we have 5 instances of zookeeper and node 1 is down. what happens if in the application i'm trying to connect to node 1?
See also questions close to this topic
Nif / Zoopkeeper - Containerization and rolling deploys
Hoping someone can send a small lifeline here, as Im deep into a rabbit hole trying to search for proper method / best practice.
Attempting to standing up a Nifi / Zookeeper service based on AWS ECS (Fargate). For basic setup, this is pretty straightforward. The thing that is tripping me up is rolling deploys and maintaining state.
If we have a shared state (EFS/other) between the Nifi instances for a single task, when the new deploy starts up, will Nifi have duplicative jobs running since they are identical services looking at the same state until the original falls over. If so, this sounds bad. What are the possible solutions to this?
Same type of question with Zookeeper. If we have a 3 node cluster, on a deploy, each node will have a new and existing instance running until the existing is full terminated. Since ECS will control which instance takes traffic, this seems less prolematic that the above Nifi concern.
If anyone has ideas / feedback on this .. it would be most appreciated.
How to transactionally update a node only if another one is absent?
I understand that the transactional operations are only:
- Create a node
- Delete a node
- Update a node
- Check if a node exists
So I can start a transaction composed of one or more operations of the operations I mentioned, and they all must succeed to successfully commit the transaction.
But what I need to do is different.
In the event of an ephemeral node removal (i.e. a client goes down), I need to update the data of another node. But I need to do this only if the removed node is still removed. The reason is that I need to make sure that the client didn't connect back to Zookeeper (A race condition).
So practically, I need an extra transactional operation that checks if a node is absent. But that operation doesn't exist.
Is there a transactional way to achieve the same goal (i.e. update a node if another node doesn't exist)?
Java Curator/ZooKeeper client hanging indefinitely when ZooKeeper is not available on some executions
While testing Curator service discovery start multiple times, on some executions where ZooKeeper server is not available while starting, it got waiting and hanging indefinitely, with thread dump stack trace which can be seen below. Example scenario is starting when ZooKeeper server is not available, and then ZooKeeper is available - but the Java thread remain hanging. This is also reproduced with Curator unit test server: org.apache.curator.test.TestingServer.
ZooKeeper version is 3.6.
As this is not reproduced consistently, it seems like a race condition from Curator/ZooKeeper client.
java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:502) org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1561) org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1533) org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1834) org.apache.curator.framework.imps.CreateBuilderImpl$16.call(CreateBuilderImpl.java:1131) org.apache.curator.framework.imps.CreateBuilderImpl$16.call(CreateBuilderImpl.java:1113) org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93) org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1110) org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:593) org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:583) org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:48) org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.internalRegisterService(ServiceDiscoveryImpl.java:237) org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.reRegisterServices(ServiceDiscoveryImpl.java:456) org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.start(ServiceDiscoveryImpl.java:135) ...
Connect to HDFS HA (High Availability) from Scala
I have a Scala code that now is able to connect to HDFS through a single namenode (non-HA). Namenode, location, conf.location and Kerberos parameters are specified in a .conf file inside of the Scala project. However, now there's a new cluster with HA (involving standby and primary namenodes). Do you know how to configure the client in Scala to support both environments non-HA and HA(with auto-switching of namenodes)?
RabbitMQ Fetch from Closest Replica
In a cluster scenario with mirrored queues, is there a way for consumers to consume/fetch data from a mirrored queue/Slave node instead of always reaching out to the master node?
If you think on scalability, having all consumers call a single node responsible to be the master of a specific queue means all traffic goes to a single node.
Kafka allows consumers to fetch data from the closest node if that node contains a replica of the leader, is there something similar on RabbitMQ?
Kafka scalability if consuming from slave node
In a cluster scenario with data replication > 1, why is that we must always consume from a master/leader of a partition instead of being able to consume from a slave/follower node that contains a replica of this master node?
I understand the Kafka will always route the request to a master node(of that particular partition/topic) but doesn't this affect scalability (since all requests go to a single node)? Wouldnt it be better if we could read from any node containing the replica information and not necessarily the master?
System Design, Low level Design, High Level Design - never ending questions
I have gone through System Design, Low level Design & High Level Design questions. There are so many questions I wonder how to thorugh prepare these many questions in limited time for interview preparation. I know the systematic process to be followed for quality preparation but again it takes time.
Please suggest your approach or idea on this.
When does one have to call share_memory_() in Pytorch when using distributed training?
I want to train things in parallel/distributed in 1 machine using multiple CPUS or GPUs. Thus my question is when does one ever run:
# I DONT THINK THIS IS NEEDED, since you don't need to share data...each process loads in own stuff and uses it x, y = x.share_memory_(), y.share_memory_()
as my comment says, my belief is that
.share_memory_()is never actually needed in pytorch because:
- when using gpu you have to move the data to the correct GPU anyway so you would do
- when using cpus, each process would have a copy of their own data (sending data is expensive in distributed training as far as I understand), so there is no need to have it "shared" with
.shared_memory_()since they are already reading the same data from disk (or something similar to that I assume).
Are these assumptions correct? Is it true that for data we never need
.memory_share_()? Btw the docs say:
Moves the underlying storage to shared memory.
This is a no-op if the underlying storage is already in shared memory and for CUDA tensors. Tensors in shared memory cannot be resized.
Note however that if one is using CPUs only for the model I think we do need:
else: # if we want multiple cpu just make sure the model is shared properly accross the cpus with shared_memory() # note that op is a no op if it's already in shared_memory model = model.share_memory() ddp_model = DDP(model) # I think removing the devices ids should be fine...?
- when using gpu you have to move the data to the correct GPU anyway so you would do
Calling external services (Rest api)/database CRUD operations within Orleans Grain
I am using Orleans framework in one of my projects. This is the pattern of calls from our grains.
Domain Service -> Calls CustomerGrain.Save() -> Call1 CustomerService.Save (external service) -> Saves customer in sql server Call2 ProfileService.Save (external service) -> Saves customer recent activity in sql server Call3 Repo.Save -> Call1 from Repo.Save - Calls Stored proc 1 to save in sql server Call2 from Repo.Save - Calls Stored proc 2 to save in sql server
- All the calls are using asynchronous code.
- Usually external service (using HttpClient) or db operation calls takes <1s for all the operations but a few calls may take longer (<30s) during load.
- SiloMessagingOptions - ResponseTimeout is increased as per need.
is it OK to call external services/database operations within the grain? Would this call pattern cause any issues during high load? like
- Orleans task scheduler Threadpool starvation?
- Silo messaging Timeouts