Skip to main content

65 posts tagged with "architecture"

View All Tags

Improved back end processing

· One min read

I had hard time doing so, but I analysed all logs (Warnings, Errors) from the backend processing.

I then solved a few bugs, corrected a concurrency issue on pollers, and reached a point when:

  • There is less processing :)
  • There is no more Warning / Errors in logs on standard load
  • And: the non regression tests are valid ;)
  • I also fixed some bugs in the meantime on the UI that cause no timeline refresh (but there are still some, for sure ;) )

I'll check tomorrow on a standard loaded day how it behaves.

Architecture improvements

· 8 min read

Many things have evolved since last time I wrote here.

Spider went from monitoring 5 000 packets by minutes to 45 000 packets by minute, with more than one month system stability. And it was some challenge. Indeed, one goal was to limit at maximum the size of the needed servers.

The architecture changes I put in place are as follows:

Queues on Whisperers​

First, Whisperers were sending packets on the fly in asynchronous mode, without carrying much on results (except for circuit breaker).

Now, they have a limited number of simultaneous connections to back office, with a queue for stocking packets to send later. This is similar to a connection pool for DB. When packets overflow the pool, this is tracked in the monitoring.

This allows smoothing spikes of network traffic on the client side to reduce the impact on the server.

Avoid retry pattern​

Retry pattern is bad in real time systems.

I had several retry implemented throughout the system to handle occasional 503 or time outs. And to handle synchronisation delay between packets and sessions.

The issue is that when something get bad, referring helps getting the process done, but new comers get stacked up. And the issues accumulate. Especially if getting don't solve the previous one and you have been delaying processing for many seconds.

I removed retrials done during processing to keep only those for configuration and synchronisation (polling).

Errors on processing are marked and reprocessed once by another process ten minutes after. Then I can get errors on the fly, most of them being solved later. If not, then, they are definitely errors.

Don't forget Node.js event loop​

Node is mono threaded. For it to answer to connection requests, you have to give the have back to the event loop as soon as you can.

I added setImmediate calls in the processing loops, this got rid of most of my 503s.

Limit Redis volatile keys​

Redis is a great in memory database. It gives building blocks for many many architecture patterns.

However, things that looks easy out of the box need to be taken with care. There is no magic.

Redis has a feature of automatic removal of volatile keys. However, to find keys to remove, Redis pick 20 keys at random every 100 ms, and if more than 25% keys were to be removed, it loops!

As I am using Redis as a write ahead cache, I use volatile keys to clean the DB automatically. But at the end, this kills Redis performance, and I had timeouts while calling Redis! Keys should be removed explicitely whenever possible.

I made many changes in the data flows to remove processed data from Redis as spoons as the data was synchronized in ES and not needed any more for real time processing. This brings a lot of complexity in the data flow, but this is pretty much handled at 75% by setting lifecycle to resources and checking it in the pollers. Only exception is packets that are asked to be removed from Redis by the parsers.

All this made Spider go from 50 000 items stored in Redis in continuous flow to... 400 ;-) And from timeouts to Redis, I achieve now 20 ms in max 99% !!

RTFM they say =)

I also moved Redis outside the Swarm. Indeed, I found issues in stability of Redis in Alpine containers.

Use Redis Lua scripting​

Manual removal of items from Redis implied pipelining lots of instructions and complex preparation on client side.

Redis documentation advises against it, and favor going to Lua server side scripting now. Lua is integrated in Redis and allow scripting Redis calls, while manipulating and checking Redis responses while still on the server. It also allows parsing and modifying JSON objects and can return values, arrays or hashes to the client.

The scripts do not need to be send to Redis for each request and are cached on the server. The client can call a script execution using the SHA signature of the script. Simply enough, my microservices subscribe the scripts to Redis at start, or on Redis reconnection.

My main difficulty was to understand enough of Lua to do the scripts I needed knowing there is no example of 'complex' scripts available in the web resources. And there is no debugging embedded in current Redis version (but it is coming!).

The performance improvement is huge, and this simplifies much the client processing. But it must be keep in mind that, as the processing is done on Redis server side, it is less scalable than when performed in the application layer.

Use automatic Id generation in ES​

When you index in Elasticsearch, you may provide the id of the document, or let ES generate it.

Choosing a good id as much performance impact on the indexing process.

Providing the id on the applications layer is easier for direct access and updates. However, this makes ES search for a document with existing id whenever you index. In order to update it, or not.

This slows down indexing.

If you don't provide the id, ES generates on the fly a global unique id for each document, and don't check previous existence. Much faster.

I moved packet indexation to automatic ids, since I never update them.

Adapt ES refresh rate​

Another quick indexing performance improvement on ES concerns refresh rate (I am already doing bulk requests all over the place). Default refresh rate is at 1s. It is the time that ES waits before closing current working segment and making it available for searching.

As ES is used for display and not any more for real time processing, I slowed down the refresh rate to 5s for packets, and 2 or 3s for web and tcp indices.

Scale your Pollers​

ES is multithreaded. Rather than sending big chunks in serial, it is best to send big chunks in parallel. ES has as many indexing threads as there are cores on the servers. And when sharding is in place, it can index on all servers.

The best setting is to be found in size of index pages, number of application indexing threads and so on...

When the loading increased in Spider, I met a moment when the new packets were piling up faster than they were consumed... one command in Docker Swarm to scale the pollers, and it as settled. I love this :-)

Change NGINX settings: persistent connections and timeouts​

NGINX default settings are not for system under load. They need change to be able to handle hundreds or thousands of requests by seconds to upstream servers. This own NGINX documentation is the base for first tuning actions.

Main changes applied:

  • Use HTTP 1.1 between NGINX and upstreams to keep connections open.
  • Increase timeouts for connection to upstreams (needed to be compatible with Node monothreaded architecture): proxy_*_timeout

Handle concurrency requests​

When the load increase, concurrency issues show up! ;-)

  • I implemented concurrency checks in Redis thanks to Lua.
  • ES offers a great system of version control to implement custom concurrency checks. That helped me get out of it !

Think about the load spread in the cluster and avoid scaling complexity​

Microservices are great for development and maintenance. However I'm often impacting many services at once when developing new features. You need good team communication to be able to work it out. No issue in my case :)

When scaling, scaling independently is definitely great. But sometimes, finding the good tuning when you need to increase scale of several dependent microservices is difficult. Polling instead of pushing is a good solution, but sometimes you need to scale B for more processing load, but B depends on A to get its data. So you need to scale A as well.

If A is doing processing (and not only offering data) this can prove a challenge.

I had the issue in Spider:

  1. Web parsing microservices was calling Packet read service
  2. The reassembling of TCP packets was made in Packet read

So, both microservices were doing heavy processing, and scaling them was difficult tuning job.

I moved the reassembling of packet knowledge to a library and included in the Web parsing microservice. This relocating the scaling need as the load was only performed on parsing module, and the Packet read was focused only on getting and formatting packets.

This simplified the configuration of the cluster. For a single heavy loaded business process, it is easier if the high CPU processing need is focused in one place.

A distributed system needs an integrated monitoring.

· One min read

Elasticsearch, Redis, Polling Queues, Circuit breakers, REST microservices...

Got an error, a time out in logs? Who generated it, what was the load, how was the system behaving at that time? How did it recover?

If you don't have integrated monitoring, you're driving blind!

I was starting to feel this way. I dedicated this week work to develop a new app in the system that monitors all others. Other applications are existing circuit breakers stats, queue stats, and soon REST stats, and I access ElasticSearch stats and Redis stats through their APIs.

All is collected continuously,  sent to another ES index, and Kibana helped me build easy and fast reports on spider health and speed through time. Really easy when you got all your pieces already in place and easily extendable!

Architecture upgrade

· One min read

Yesterday,  Spider encountered a major architecture upgrade:

  • I moved to one index by resource type
    • To get ready for ES 6
    • To improve shards balance in the cluster
    • To reduce IO access because smaller volume resources are separated from big ones
    • To improve ES aggregations speed
  • I introduced a poller between Redis and ElasticSearch for the four main resources
    • The load constraint is now only focused on two microservices and only on Redis
    • The load on ES is smoothed
  • I added some in memory cache on Whisperer configuration access from microservices to reduce drastically unnecessary calls
  • I have now 20 microservices in my cluster, with many being multi instantiated 😎

Streesmart instance has been upgraded with a complete data purge (sorry).

Let's look how it behaves. It should be much more stable!

Move to cluster.

· One min read

To be able to capture more environments, Spider has moved to Cluster mode:

  • Docker Swarm using Docker stacks
  • ElasticSearch in cluster

Next step will be Redis architecture: ES is not fast enough in indexing.

So I'll put all input load to Redis only, and decouple performance needs by adding synchronisations pollers. Like we do on Streetsmart. I thought I could live without for longer using bulks... But, if I don't want to increase the hardware, I need to improve the architecture!