Skip to main content

65 posts tagged with "architecture"

View All Tags

Redis reconnection on failure

· One min read

A recent failure in AWS storage revealed that Spider wasn't resilient to Redis failures.

  • I then upgraded all services to a better Redis reconnection pattern with auto resubscription of Lua scripts. It works much better :) !
  • I also added fail fast checks on services in front of Whisperers that only store in Redis: if Redis is not available, they answer straight with a 502 error :)

Metricbeat integration in Cluster

· One min read

I added Elastic Metricbeat inside Docker Swarm cluster to gather metrics and performance information of all Dockers in Cluster.

Nothing apart Dockers runs on Spider cluster now.

This allowed adding graphic in monitoring with containers CPU and RAM usage.

This allowed also to assert the load of Traefik and Beats.

Spider is cookie free :)

· One min read

Thanks to the latest knowledge I got from Javascript and HTML5, I managed to remove the need for Cookies in Spider UI!!

So, Spider is low fat now, no more cookies needed, the JWT token is managed in all communications to the server.

API updates

· One min read

Hi,

Over the last couple of weeks, I took over the - not so interesting - mission of updating the API specification. It is available here: https://spider-analyzer.io/home/api/

Changes:

  • Specification updated to Open API specification 3.0
  • New objects diagram
  • All APIs are described with:
    • Structure of inputs
    • Structure of responses / resources
    • Examples
    • Quick start guide
  • I improved the usability of the API with:
    • Date parameters available on top of timestamp parameters: startDate can be used instead of startTime
    • Added hostnames in TcpSessions and HttpCommunications resources for easier search on FQDN

During the API review, I found some security issues... All are fixed now ;-) So doing, I added:

  • The possibility to purge on shared Whisperers for those who have the right. (Before, it was free for all ;))
  • A protection against wrong login: after 5 attempts, the account is blocked for some time
  • More security around customers details access
  • The name of the customer owning the Whisperer in Whisperer details's view

Wilfried already tested it, and... was happy to integrate it in its test results checks ;-)

Cheers

Technical migrations... and how much Spider is fast now !!

· 2 min read

December has been a month of migrations for Spider. And how much I'm happy to have done them! Read below.

Migration path

  • From NGINX to Traefik on 6/12
  • From Node 7 to Node 8
  • From Javascript generators to async/await programmation pattern
  • From eTags generation on full resource to eTags based on id+lastUpdate date (CPU saving)
  • From Node 8 to Node 10 (actually done in January)

The result of all this?

  1. A division by 2 to 5 of microservices CPU usage !
  2. A division by 4 to 6 of response times as seen from Whisperers !!
  3. A division by 5 to 12 of internal response times !!!

This was amazing!

I did not believe it, but yes that's proven. Node was saying that async await was faster than generators, then Google improved async/await processing speed by 8x in the V8 version embedded in Node 10.

Examples

  • Processing of packets improved from 484ms 90th percentile for 100 packets bulk request to ... 69 ms!!
  • Patching TcpSessions improved from 266ms 90th percentile to 13 ms!!! Excuse me!
  • CPU using of parsing job improved from 172% to 80% and for packets decoding, from 295% to ... 55% !! Amazing :-)
  • Previously Spider needed 4 servers to process 2500 packets/s, now... i could almost do it on two, and much much faster ! :)

Conclusion

Yes, Node.js is blazingly fast. This was right for callback mode, and now it is back for async/await ! :-)

Figures

Source: Google spreadsheet

And Streetsmart?

And you know what? Streetsmart is almost in a state before all my migrations. Imagine if migrations have the same effect for Streetsmart. It would be awesome ! :-)

Well, that's part of my plan indeed!!

Change of API gateway / reverse proxy / ingress controller

· 2 min read

NGINX

Spider internal cluster gateway was until this week NGINX. However, NGINX was presenting various issues in current setup:

  • In order to absorb scaling up and down of replicas, I was asking NGINX to resolve the IP of the services VIP on every call. DNS resolver had a cache of 30s.
  • The main issue was that NGINX can't do persistent connections to the upstreams in this case.
  • This made NGINX create a new TCP socket for every request. But soon enough when all TCP sockets were booked, it implied an increase of response time of 1 or 2s for linux to recycle sockets.

Change was needed!

Traefik

Traefik is more and more used as a gateway for Docker clusters (and others). It is indeed designed to integrate with Docker and to keep updated with cluster state.

So I switched Spider to Traefik this week. And the results are ... astonishing !!

Although the response time from the clients have not changed much, the response time internal to the cluster have improved of 80% !!

Note: I only struggled with Traefik configuration on Path rewriting.  It has less options than NGINX on this field. I had to add some custom rerouting inside the UIs themselves.

Docker 18.06 and Swarm load balancer change

· 2 min read

Spider was working really fine with Docker Swarm until... Docker version 18.06.

Docker v18.06 improved load balancing scalability impact

Docker 18.06 includes a change to improve scalability of Swarm Load Balancing: https://github.com/docker/engine/pull/16. The impact on Spider is so:

  • Previously, when sniffing communications between services, the IP used were the IP of the replicas that sent/received the request (DSR mode).
  • Now, the IPs used in the packets are the VIPs (NAT mode).
  • The main issue is that... the VIPs have no PTR record in Swarm DNS, and so, the Whisperers cannot reverse resolve their names... And Spider is then much less usable.

Workaround with hosts preresolving in Whisperers

To overcome the problem, I added the possibility to give a list of Hostnames to the Whisperers that are resolved regularly against the DNS and preloaded inside the DNS resolving mecanism.

This has many advantages:

  • You can define a list of hosts without PTR records.
  • Docker resolving works better that reverse resolving (more stable: you don't face bug: )
  • The list can be given in Whisperer config (UI) or through environment variables of the Whisperer: HOSTS_TO_RESOLVE
    • Thus, you can script the Whisperer launch and get, at start, the list of services in the Swarm prior to launching.

This has a main drawback: you loose the power of service discovery... as the list is static. The other way would be to get the info by linking the Whisperer to the Docker socket... But this is a security risk, and would tie to much to Docker.

Localhost IPs preloading

While at it, I added another environment variable : CONTAINER_NAME. When present:

  • The list of local container own IPs are preloaded in the DNS resolving mecanism with as hostname value the CONTAINER_NAME value.

Docker 18.09

Docker 18.09 includes a parameter when creating the Overlay network to deactivate this new NAT feature and be like before: --opt dsr

With this parameter active, Swarm behavior is back to before 18.06, and Spider works like a charm. But at the cost of scalability.

If scalability is a matter, while using Spider, the best is to move the cluster replica settings from VIP to DNSRR, and use a load balancer like Traefik ;) See my other post from today.

Architecture upgrade: splitting a monolith

· 2 min read

One service in Spider back-end has been growing too much. It included:

  • Whisperer configurations
  • Users rights on this Whisperer
  • Whisperer current status
  • Whisperer status history
  • Whisperer hosts resolving

The last 2 were on different indices, but the 3 first 'data aggregates' were inside the same resource/document.

This resulted in a complex service to update, in conflicts in Optimistic Concurrency management, and in slow response time due to the size of the resources.

It needed split.

I firstly tried to split it logically from the resource perspective, extracting the configuration as it is the most stable data... But this was a bad idea: splitting configuration and rights was complexifying a lot the access and usage of the resources from the UI and the other services that needed the information!

So I figured out I had to split the monolith from the client perspective.

In result, I extracted from the first module:

  • An operating  service to process status input and to store both status and current status
  • An operating service to process hosts input and to store them
  • A configuration service to manage configuration and rights

This was much better. But I had slowness due to the fact that all those modules were accessing and storing to ES directly. So, I switch to saving in Redis and configure pollers to serialize the data to ES. Everything was already available to do this easily from the saving processes of Packets, Sessions and Http communications. I also added a pure cache to Whisperer configs resources:

  • On save, save in Redis and ES
  • On read, read from Redis, and if not, read from ES and save in Redis

All in all, requests from Whisperers clients went from 200ms+ to save Status or Hosts to... 50 and 15 ms ;-) Yeah !!

Upgrade to ES 6.4.1

· One min read

Spider has been upgraded to ElasticSearch 6.4.1 !!

It now benefits to APM access, SQL queries and so on :-) I'll tell you more later.

Fixed concurrency issue

· One min read

Last week I fixed what I hope is the last concurrency issue on the backend real time processing.

Some times (< 100 per day), a first version of an HTTP communication was serialized in ES after the complete version had been. This lead to HTTP communications without response for instance... however the TCP session was parsed correctly and included all packets.

I added version control in Elasticsearch, and this time is over :-)