Helping the customer to understand what he did a long time ago...

9 mars 2021

In a galaxy far, far away…

EV-9D9: “How many languages do you speak?”

C-3PO: “I am fluent in over six million forms of communication and can readily…”

EV-9D9: “Splendid. We have been without an interpreter since our master got angry with our last protocol droid and disintegrated him.”

C-3PO: “Disintegrated?”

EV-9D9: “Yes, Disintegrated …. Can you read and write a dialect of SQL, MySQL and generate AST to translate to another database language ? We have 70.000 SQL statements that we don’t really understand …”

C-3PO: “Of course I can ! I have a built-in ANTLR4 device that can do that!”

EV-9D9: “Perfect! You’re in then! Follow me…

When I started to work for one of my customers, they asked me to change their monolithic platform to a full distributed one with HA. If it was so monolithic and full of  SPoF ’s, it was not because of a real lack of skills or vision but it was more a history made by events, people and… turn over. The lines of code I found and the design of the platform were telling me a real story implying machines, code and human beings spread over more than 4 years of hotfixes, maintenance and development.

Who is my secret customer ?

Of course, I can’t say any name of company (or I will have to hire Boba Fett …) but I need to describe the flow of data which is the key to designing a new architecture.

First things first, Auto !

The first thing to know is : we’re gonna listen to clusters of machines. These are industrial machines producing a huge amount of data every second. All these expensive bunches of steel and iron are grouped by site and will send all that data through an ftp connection (yes, you read it correctly…) using a 4G or landline connection. Sometimes those sites are really in the middle of nowhere (it sounds like Tatooine, I know…).

Second things… Second !

These data going through the FTP pipe are then processed and inserted in a MySQL database. All this data will provide dashboards and visual tools to monitor customer sites.

What’s up doc ?

My customer needed to build a brand new platform that will be able to scale-out, to be easy to maintain and, of course, that will be iso-functional to the existing one.


That’s when the tricky part comes in : all the computation logic for aggregations are stored in the SQL database as SQL strings (e.g. : “AVG(FIELD1)”) in a table. When the computation needs to be done, all requests are concatenated in one to be executed later.


The caveat is : it’s really hard to maintain a system where you have strings that contain something written in a language (In that case SQL) that you manipulate like simple pieces you put together. Without knowing if the concatenated query is valid or makes sense, running it is at your own risk. If you have a few hundreds of it, then you can have human eyes to check it … but when it’s 70.000 request blocks then… you don’t want to hear about “testing it all by hand » !

Goal of my mission (you said impossible ?)


The main goal of my mission was to build the new platform and migrate the current SQL logic to something else. The chosen database to store all messages from the different sites was Elasticsearch. The problem was : the SQL computation was not translatable to an Elasticsearch query or aggregation.


So we decided to explore all these SQL requests to try to simplify and maybe afterwards to translate to Elasticsearch “jsonish” queries (with painless script of course) or to run a dockerized SQL engine for some of them in a streaming way. I mean whenever we receive a message from FTP it will trigger computation using a SQL engine on a limited time window. These aggregations will then consolidate Elasticsearch database. Helping the customer to understand these queries would maybe bring questions or a different point of view of the running platform.

How to evaluate complexity of SQL queries stored as strings ?

The answer is : use a lexer and a parser that understand SQL grammar. This, maybe, looks tricky but if you use ANTLR4 … it’s not !

My idea was to use Zeppelin and therefore Spark to load and work on these bunch of queries. The advantage of doing this is that you can try and fail, changing your code quickly in the Zeppelin notebook and find exactly what you were looking for.

So the first thing to do was to package a lexer/parser in order to add it as a spark dependency and use it directly from the notebook.

Find the grammar for your parser

To find the right grammar for my parser, I checked  this  repository. It’s a compilation of a lot of grammars organized by folder. I found the  droids  ones I was looking for (even if a weird old bearded man told me : “these aren’t the ones you’re looking for”). There were 2 files : one for the lexer ( MySqlLexer.g4  : tokens allowed) and one for the parser ( MySqlParser.g4 : rules). These are the 2 files I downloaded to prepare my lexer/parser project.

Compile and package the SQL lexer/parser

To compile and package the lexer/parser, I wrote a small project using maven and its really useful plugin  antlr4-maven-plugin . First I wrote a  pom.xml  to describe the project ( groupId artifactId , …) and some version stuff. Here is the  pom.xml  (I removed some verbose xml to make it more readable) :
<project ...>     <modelVersion>4.0.0</modelVersion>     <groupId>com.agaetis.antlr4.mysql</groupId>     <artifactId>parslex</artifactId>     <version>0.0.1-SNAPSHOT</version>     <name>MySQL Antlr 4 Lexer/Parser</name>     <properties>         …         <antlr4.visitor>true</antlr4.visitor>         <antlr4.listener>true</antlr4.listener>     </properties>     <dependencies>         <dependency>             <groupId>org.antlr</groupId>             <artifactId>antlr4-runtime</artifactId>             <version>4.7.2</version>         </dependency>     </dependencies>     <build>         <plugins>             <plugin>                 <groupId>org.antlr</groupId>                 <artifactId>antlr4-maven-plugin</artifactId>                 <version>4.7.2</version>                 <executions>                     <execution>                         <goals>                             <goal>antlr4</goal>                         </goals>                     </execution>                 </executions>             </plugin>             <plugin>                 <groupId>org.apache.maven.plugins</groupId>                 <artifactId>maven-compiler-plugin</artifactId>                 <version>3.6.1</version>                 <configuration>                     <source>1.8</source>                     <target>1.8</target>                 </configuration>             </plugin>         </plugins>     </build> </project>

pom.xml

Then in the root folder of the project (next to the  pom.xml ) I created one folder to put all the grammar files :  src/main/antlr4/com/agaetis/antlr4/mysql . Here is the structure of the project :

File structure

When you have your project setup, the next step is to package the jar. For that purpose, of course, we use the antlr4 maven plugin like this :

mvn clean install


After that you should have in your local maven repository the jar that you will use with Zeppelin/Spark.


~/.m2/repository/com/agaetis/antlr4/mysql/parslex/0.0.1-SNAPSHOT/parslex-0.0.1-SNAPSHOT.jar

Add the parser package as a dependency in the Zeppelin Spark Interpreter

When you have your parser packaged and ready to be used, you need to tell Zeppelin and Spark to make it available from your notebook. In the interpreter section you just have to search for spark interpreter, edit it and add in the end of the page the dependency to the package located in your maven repository.

All requests that will be parsed are in a MySQL database, therefore I added a dependency to the connector to access a MySQL database and because we’re going to use ANTLR4, obviously, I added as well the ANTLR4 runtime dependency.

Load queries from the database

Here is the snippet to create a dataframe from a MySQL table :
val options = Map(    "url"       -> dbUrl,     "dbtable"   -> tableName,     "user"      -> userName,     "password"  -> userPassword,     "driver"    -> "com.mysql.jdbc.Driver" ) val queries = spark.read.format("jdbc").options(options).load queries.cache()
queries.count

jdbc-spark-connect.scala

That means, you’ll be able to use Spark dataframe operators to manipulate (filter, map …) your queries. How to compute the complexity of a SQL statement ?

The most intuitive way to compute the complexity of a SQL query (the one that came in my mind first), was, IMHO, the computation of the complexity of the AST (Abstract Syntax Tree) that represents a given SQL query (that’s why I used a library like ANTLR).

Let’s take an example : here is a query that is just a  SELECT  statement without any  FROM  part or  WHERE  statement.

 SELECT IF(AVG(age) > 10, (AVG(age) * 20), 0)

ANTLR4 will compute an AST for that query that looks like this :

Example of an AST (simplified) for a SQL query

In that case, the purpose of the query is : if the average of the field  age  is greater than 10, then the value returned is the average of the field  age  multiplied by 20, otherwise the result is 0. This query is not a complicated one but it shows what the AST looks like.

We can see that we have five layers (I don’t count the first one  AST  in the picture). My first idea was to compute the number of elements for each layer times the layer index (first level has index equals to 1) and then to multiply them. In practice, the number returned in that case is too high. So I computed the natural logarithm of each number of nodes per layer times the level of the layer (plus 1 because it’s a logarithm) and then multiplied them to apply finally a last natural logarithm.

So in this example the complexity would be :

ln((ln(level*N1)+1)*(ln(level*N2)+1)*(ln(level*N3)+1)*(ln(level*N4)+1)*(ln(level*N5)+1))

which equals to :

ln((ln(1*2)+1)*(ln(2*4)+1)*(ln(3*5)+1)*(ln(4*4)+1)*(ln(5*2)+1))

That gives 5.48. When we use this query in the function implemented in the notebook we don’t have the same result because I simplified the abstract syntax tree to illustrate the complexity factor computation. ANTLR4 generates a more complex tree from this same query and therefore, we get a higher complexity factor when it is executed in the notebook.

Here is the code to compute the complexity of a query :


import com.agaetis.antlr4.mysql.MySqlParserBaseListener import com.agaetis.antlr4.mysql.MySql Lexer import com.agaetis.antlr4.mysql.MySqlParser import org.antlr.v4.runtime.ParserRuleContext import org.antlr.v4.runtime.CharStream import org.antlr.v4.runtime.CharStreams import org.antlr.v4.runtime.CommonTokenStream import org.antlr.v4.runtime.tree.ParseTree import org.antlr.v4.runtime.tree.ParseTreeWalker import scala.collection.mutable.Map
class MyListener extends MySqlParserBaseListener {     val nodeCountPerLevel = LinkedHashMap[Int, Int]();     override def enterEveryRule(ctx:ParserRuleContext) {         nodeCountPerLevel(ctx.depth()) = nodeCountPerLevel.getOrElse(ctx.depth(),0) + 1     } }
class MyErrorListener extends BaseErrorListener {     @throws(classOf[ParseCancellationException])     override def syntaxError( recognizer: Recognizer[_, _],                               offendingSymbol:Object, line:Int ,                               charPositionInLine:Int ,                               msg:String ,                               e:RecognitionException) = {                                   println(s"ERROR $msg")                                   throw new ParseCancellationException(msg);                               } }
def evalComplexity(input:String, exprType:String) : Double = {     if(input == null    input.isEmpty)         0     else {         try {             val inputCharStream = CharStreams.fromString(input)             val sqlLexer:MySqlLexer = new MySqlLexer(inputCharStream)             val tokenStream = new CommonTokenStream(sqlLexer)             val parser = new MySqlParser(tokenStream)             parser.removeErrorListeners();             parser.addErrorListener(new MyErrorListener());             val tree:ParseTree = exprType match {                 case "SELECT" => parser.selectStatement()                 case "EXPR" => parser.expression()                 case _ => parser.root()             }             val walker = new ParseTreeWalker()             val listener = new MyListener()             walker.walk(listener, tree)             val nodeCountPerLevel = listener.nodeCountPerLevel             println(nodeCountPerLevel.size)             val score = nodeCountPerLevel.map {                 case (k,v) => {                     Math.log(k * v) + 1.0                 }             }.reduce( (a,b) => a * b )             Math.log(score)         } catch {             case e:ParseCancellationException => -1.0         }     } }

evcplx.scala

Here is an example of using this function :

Using dataframe (where the queries are) and udf’s (user defined functions), you can add a  complexity  column to your dataframe like this :
def evalComplexityUdf = udf((input:String) => {     evalComplexity(input, "EXPR") })
val allQueriesWithComplexity = allQueries.withColumn("valCplx", evalComplexityUdf($"query"))

df-evcp lx.scala

Then you can display and see the complexity of any query in the dataset :

And so what?

Then we were able to identify complex queries, to analyze why it was so complex and to check if it was still in use in the system. There was also an advantage in using this method : we found few invalid statements in the MySQL database (missing parenthesis, invalid variable names …). These queries have since been corrected or removed if it was not used anymore.

The customer was able to understand a bit more what was in the database, and it was a good start to rethink the way these queries should be kept or mapped to Elasticsearch.

Conclusion

When you have a SQL-based system that contains SQL queries which will be concatenated or manipulated as strings then you can assume that it’s not the best practice you can find. Indeed, it’s a trick that hides a deep misconception problem of your platform. In that case you can either run away and say you can’t do anything for them or you can pluck up the courage to give sense to customer’s data and make them understand that there is always a better way : rethink the data modeling and use a database that fits better the use case.

Space Odyssey – Warner Bros. Entertainment Inc.

Even if the project looks like a space odyssey to the customer (like fighting against an “already in place” platform to change the way it thinks), there’s always a way or a workaround to ignite a seismic shift.

Ressources Agaetis

4 septembre 2025
Le contexte du projet : Un prototype de stylo connecté destiné au secteur de la santé avait rencontré un vif succès auprès du marché. Face à une demande croissante, le client devait passer à une phase d’ industrialisation afin de répondre aux attentes tout en respectant les réglementations strictes en matière de données de santé ( Hébergement de Données de Santé – HDS ). L’objectifs : L’objectif principal était de transformer un prototype en solution industrialisée en : définissant les critères de sélection et les options technologiques, garantissant la conformité aux réglementations de santé, et assurant la montée en charge (scale-up) pour répondre à la demande croissante. Durée de mission : Mission en plusieurs phases : cadrage, tests techniques, mise en conformité et accompagnement au scale-up industriel. Mise en œuvre : Agaetis a déployé une approche complète combinant expertise IoT et réglementaire : Définition des critères de sélection : cadrage des besoins fonctionnels et techniques. Évaluation technologique : étude des solutions potentielles et tests de leur adéquation. Mise en conformité HDS : accompagnement dans la sélection de l’hébergement et structuration du modèle de données. Développement et industrialisation : assistance dans l’implémentation des composants techniques et préparation à la montée en charge. Résultats obtenus : Accélération de la production : industrialisation réussie permettant de répondre rapidement à la demande. Conformité assurée : solution alignée sur les exigences HDS et réglementations de santé. Innovation valorisée : passage du prototype au produit commercialisable sur le marché santé. Flexibilité opérationnelle : architecture et modèle de données prêts à évoluer avec les usages. Facteurs clés de succès : Expertise pointue en IoT santé et données réglementées . Approche sur mesure intégrant la dimension technique et humaine. Collaboration rapprochée avec les équipes du client. Vision orientée impact concret et mise sur le marché rapide. Et vous ? Vous vous interrogez sur : l’industrialisation de vos prototypes IoT santé, la conformité réglementaire (HDS, ISO, etc.) de vos solutions, ou la préparation de vos innovations pour passer du prototype au scale-up industriel ? 👉 Contactez nos experts pour transformer vos prototypes IoT en solutions santé industrialisées et conformes.
Aérospacial Ariane Space
par David Walter 4 septembre 2025
Le contexte du projet : Groupe Aérospatial souhaitait optimiser le temps de contrôle dimensionnel des réservoirs de son lanceur spatial. Les méthodes traditionnelles, longues et peu satisfaisantes, ralentissaient la production et augmentaient les risques d’erreurs. Le besoin était de développer une application de contrôle qualité et dimensionnel intégrant de nouveaux moyens de mesure plus rapides et précis. L’objectifs : L’objectif principal était de concevoir et déployer une application installée sur un PC concentrateur capable de : lancer différents programmes de contrôle dimensionnel, intégrer des technologies de mesure avancées (profilomètres lasers, trackers laser), et améliorer la précision et la répétabilité des contrôles. Durée de mission : Mission de plusieurs mois, de la conception logicielle à la formation des équipes, en passant par l’intégration et les tests. Mise en œuvre : Agaetis a déployé une approche technique et collaborative : Développement de l’application : architecture logicielle adaptée aux besoins d’intégration industrielle. Collecte et traitement des données : intégration des mesures issues des machines à commande numérique, trackers laser et profilomètres. Optimisation des processus : automatisation des contrôles pour gagner en rapidité et réduire les erreurs. Accompagnement & formation : transfert de compétences aux équipes internes pour assurer la continuité. Résultats obtenus : Temps de contrôle réduit : amélioration notable de la productivité. Précision accrue : fiabilisation des mesures grâce à l’intégration de nouvelles technologies. Réduction des erreurs : contrôles plus rapides et répétables. Compétences préservées : maintien de la connaissance technique dans l’organisation. Facteurs clés de succès : Expertise technique d’Agaetis en développement industriel et IoT . Grande flexibilité dans la collaboration avec le client. Intégration fluide des données issues de différents équipements. Approche orientée impact et résultats mesurables. Et vous ? Vous vous interrogez sur : l’optimisation de vos processus de contrôle industriel, l’intégration de nouvelles technologies de mesure, ou la digitalisation de vos applications qualité ? 👉 Contactez nos experts pour moderniser vos contrôles industriels et accroître votre performance opérationnelle.
Show More