Server!/Horror! I have a magnet and I don't mind using it!

Monitoring Thoughts

Monitoring Thoughts

How would you scale monitoring and how would you ensure that with hundreds of thousands of events per minute you’ll still get the important ones?

A lot of stuff is missing here. This is merely a note how I think a scaling architecture for monitoring should look like. Also one should be able to do math on the events!

On Agents

RULE: events generated by Agents are stateless

  1. Run a monitoring agent on each node!
  2. Each agent performs a number of tasksThese are specifically called tasks since those are not necessarily checks. Also I associate checks with nagios checks. It’s not what we want to do!

    A task does one thing, and one thing only:

    • do not create tasks that are what NimSoft does (CMD – CPU/Memory/Disk)
  3. Each tasks generates an event
    • Everything is an event!
    • A successful task just a taks that ran without (programmatical errors)
    • A failed task is something where a programmatical error occured!
  4. Create a JSON String from the event
  5. Submit the JSON string to some messaging middleware (preferrably RabbitMQ)

On Middleware

  1. Messagesmust be persistent
    • It is safe to restart the server!
  2. What are just messages to the middleware are the guts of the system.Those are the events generated by agents

On Servers

There are 2 kinds of servers:

  • PersistenceServersThese run somewhere in a rack. They will grab one event after another from a queue and store them in a safe place for later reference.

    Once a Persistence Server grabbed an event from the queue it is no longer visible to other servers. Each event will reside on and exatly on Persistence Server.</li>

  • NotificationServersThese run either on physical serves in a rack and just grab one notification at a time from the messaging middleware.
    • All notification server can retrieve all events.
    • Notification Servers can subscribe to a certain subset of topics.
  • There may be a lot of servers. We don’t want our monitoring failing
  • </ul>

    On Persistence Servers

    1. Subscribe to the global queue
    2. Start grabbing events
    3. Store the event on diskWhat exactly storing means is yet to be determined!
    4. Start over again

    On Notification Servers

    1. Subscribe to the notification queue or a topic queue
    2. Start grabbing events
    3. Display the eventWhat exactly displaying means is yet to be determined!
    4. Start over again 
    </div>

Generated: 2017-11-02 10:20:47 +0100