Helping the customer to understand what he did a long time ago...

9 mars 2021

In a galaxy far, far away…

EV-9D9: “How many languages do you speak?”

C-3PO: “I am fluent in over six million forms of communication and can readily…”

EV-9D9: “Splendid. We have been without an interpreter since our master got angry with our last protocol droid and disintegrated him.”

C-3PO: “Disintegrated?”

EV-9D9: “Yes, Disintegrated …. Can you read and write a dialect of SQL, MySQL and generate AST to translate to another database language ? We have 70.000 SQL statements that we don’t really understand …”

C-3PO: “Of course I can ! I have a built-in ANTLR4 device that can do that!”

EV-9D9: “Perfect! You’re in then! Follow me…

When I started to work for one of my customers, they asked me to change their monolithic platform to a full distributed one with HA. If it was so monolithic and full of SPoF ’s, it was not because of a real lack of skills or vision but it was more a history made by events, people and… turn over. The lines of code I found and the design of the platform were telling me a real story implying machines, code and human beings spread over more than 4 years of hotfixes, maintenance and development.

Who is my secret customer ?

Of course, I can’t say any name of company (or I will have to hire Boba Fett …) but I need to describe the flow of data which is the key to designing a new architecture.

First things first, Auto !

The first thing to know is : we’re gonna listen to clusters of machines. These are industrial machines producing a huge amount of data every second. All these expensive bunches of steel and iron are grouped by site and will send all that data through an ftp connection (yes, you read it correctly…) using a 4G or landline connection. Sometimes those sites are really in the middle of nowhere (it sounds like Tatooine, I know…).

Second things… Second !

These data going through the FTP pipe are then processed and inserted in a MySQL database. All this data will provide dashboards and visual tools to monitor customer sites.

What’s up doc ?

My customer needed to build a brand new platform that will be able to scale-out, to be easy to maintain and, of course, that will be iso-functional to the existing one.

That’s when the tricky part comes in : all the computation logic for aggregations are stored in the SQL database as SQL strings (e.g. : “AVG(FIELD1)”) in a table. When the computation needs to be done, all requests are concatenated in one to be executed later.

The caveat is : it’s really hard to maintain a system where you have strings that contain something written in a language (In that case SQL) that you manipulate like simple pieces you put together. Without knowing if the concatenated query is valid or makes sense, running it is at your own risk. If you have a few hundreds of it, then you can have human eyes to check it … but when it’s 70.000 request blocks then… you don’t want to hear about “testing it all by hand » !

Goal of my mission (you said impossible ?)

The main goal of my mission was to build the new platform and migrate the current SQL logic to something else. The chosen database to store all messages from the different sites was Elasticsearch. The problem was : the SQL computation was not translatable to an Elasticsearch query or aggregation.

So we decided to explore all these SQL requests to try to simplify and maybe afterwards to translate to Elasticsearch “jsonish” queries (with painless script of course) or to run a dockerized SQL engine for some of them in a streaming way. I mean whenever we receive a message from FTP it will trigger computation using a SQL engine on a limited time window. These aggregations will then consolidate Elasticsearch database. Helping the customer to understand these queries would maybe bring questions or a different point of view of the running platform.

How to evaluate complexity of SQL queries stored as strings ?

The answer is : use a lexer and a parser that understand SQL grammar. This, maybe, looks tricky but if you use ANTLR4 … it’s not !

My idea was to use Zeppelin and therefore Spark to load and work on these bunch of queries. The advantage of doing this is that you can try and fail, changing your code quickly in the Zeppelin notebook and find exactly what you were looking for.

So the first thing to do was to package a lexer/parser in order to add it as a spark dependency and use it directly from the notebook.

Find the grammar for your parser

To find the right grammar for my parser, I checked this repository. It’s a compilation of a lot of grammars organized by folder. I found the droids ones I was looking for (even if a weird old bearded man told me : “these aren’t the ones you’re looking for”). There were 2 files : one for the lexer ( MySqlLexer.g4 : tokens allowed) and one for the parser ( MySqlParser.g4 : rules). These are the 2 files I downloaded to prepare my lexer/parser project.

Compile and package the SQL lexer/parser

To compile and package the lexer/parser, I wrote a small project using maven and its really useful plugin antlr4-maven-plugin . First I wrote a pom.xml to describe the project ( groupId , artifactId , …) and some version stuff. Here is the pom.xml (I removed some verbose xml to make it more readable) :
<project ...> <modelVersion>4.0.0</modelVersion> <groupId>com.agaetis.antlr4.mysql</groupId> <artifactId>parslex</artifactId> <version>0.0.1-SNAPSHOT</version> <name>MySQL Antlr 4 Lexer/Parser</name> <properties> … <antlr4.visitor>true</antlr4.visitor> <antlr4.listener>true</antlr4.listener> </properties> <dependencies> <dependency> <groupId>org.antlr</groupId> <artifactId>antlr4-runtime</artifactId> <version>4.7.2</version> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.antlr</groupId> <artifactId>antlr4-maven-plugin</artifactId> <version>4.7.2</version> <executions> <execution> <goals> <goal>antlr4</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.6.1</version> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> </plugins> </build> </project>

pom.xml

Then in the root folder of the project (next to the pom.xml ) I created one folder to put all the grammar files : src/main/antlr4/com/agaetis/antlr4/mysql . Here is the structure of the project :

File structure

When you have your project setup, the next step is to package the jar. For that purpose, of course, we use the antlr4 maven plugin like this :

mvn clean install

After that you should have in your local maven repository the jar that you will use with Zeppelin/Spark.

~/.m2/repository/com/agaetis/antlr4/mysql/parslex/0.0.1-SNAPSHOT/parslex-0.0.1-SNAPSHOT.jar

Add the parser package as a dependency in the Zeppelin Spark Interpreter

When you have your parser packaged and ready to be used, you need to tell Zeppelin and Spark to make it available from your notebook. In the interpreter section you just have to search for spark interpreter, edit it and add in the end of the page the dependency to the package located in your maven repository.

All requests that will be parsed are in a MySQL database, therefore I added a dependency to the connector to access a MySQL database and because we’re going to use ANTLR4, obviously, I added as well the ANTLR4 runtime dependency.

Load queries from the database

Here is the snippet to create a dataframe from a MySQL table :
val options = Map( "url" -> dbUrl, "dbtable" -> tableName, "user" -> userName, "password" -> userPassword, "driver" -> "com.mysql.jdbc.Driver" ) val queries = spark.read.format("jdbc").options(options).load queries.cache()
queries.count

jdbc-spark-connect.scala

That means, you’ll be able to use Spark dataframe operators to manipulate (filter, map …) your queries. How to compute the complexity of a SQL statement ?

The most intuitive way to compute the complexity of a SQL query (the one that came in my mind first), was, IMHO, the computation of the complexity of the AST (Abstract Syntax Tree) that represents a given SQL query (that’s why I used a library like ANTLR).

Let’s take an example : here is a query that is just a SELECT statement without any FROM part or WHERE statement.

SELECT IF(AVG(age) > 10, (AVG(age) * 20), 0)

ANTLR4 will compute an AST for that query that looks like this :

Example of an AST (simplified) for a SQL query

In that case, the purpose of the query is : if the average of the field age is greater than 10, then the value returned is the average of the field age multiplied by 20, otherwise the result is 0. This query is not a complicated one but it shows what the AST looks like.

We can see that we have five layers (I don’t count the first one AST in the picture). My first idea was to compute the number of elements for each layer times the layer index (first level has index equals to 1) and then to multiply them. In practice, the number returned in that case is too high. So I computed the natural logarithm of each number of nodes per layer times the level of the layer (plus 1 because it’s a logarithm) and then multiplied them to apply finally a last natural logarithm.

So in this example the complexity would be :

ln((ln(level*N1)+1)*(ln(level*N2)+1)*(ln(level*N3)+1)*(ln(level*N4)+1)*(ln(level*N5)+1))

which equals to :

ln((ln(1*2)+1)*(ln(2*4)+1)*(ln(3*5)+1)*(ln(4*4)+1)*(ln(5*2)+1))

That gives 5.48. When we use this query in the function implemented in the notebook we don’t have the same result because I simplified the abstract syntax tree to illustrate the complexity factor computation. ANTLR4 generates a more complex tree from this same query and therefore, we get a higher complexity factor when it is executed in the notebook.

Here is the code to compute the complexity of a query :

import com.agaetis.antlr4.mysql.MySqlParserBaseListener import com.agaetis.antlr4.mysql.MySql Lexer import com.agaetis.antlr4.mysql.MySqlParser import org.antlr.v4.runtime.ParserRuleContext import org.antlr.v4.runtime.CharStream import org.antlr.v4.runtime.CharStreams import org.antlr.v4.runtime.CommonTokenStream import org.antlr.v4.runtime.tree.ParseTree import org.antlr.v4.runtime.tree.ParseTreeWalker import scala.collection.mutable.Map
class MyListener extends MySqlParserBaseListener { val nodeCountPerLevel = LinkedHashMap[Int, Int](); override def enterEveryRule(ctx:ParserRuleContext) { nodeCountPerLevel(ctx.depth()) = nodeCountPerLevel.getOrElse(ctx.depth(),0) + 1 } }
class MyErrorListener extends BaseErrorListener { @throws(classOf[ParseCancellationException]) override def syntaxError( recognizer: Recognizer[_, _], offendingSymbol:Object, line:Int , charPositionInLine:Int , msg:String , e:RecognitionException) = { println(s"ERROR $msg") throw new ParseCancellationException(msg); } }
def evalComplexity(input:String, exprType:String) : Double = { if(input == null input.isEmpty) 0 else { try { val inputCharStream = CharStreams.fromString(input) val sqlLexer:MySqlLexer = new MySqlLexer(inputCharStream) val tokenStream = new CommonTokenStream(sqlLexer) val parser = new MySqlParser(tokenStream) parser.removeErrorListeners(); parser.addErrorListener(new MyErrorListener()); val tree:ParseTree = exprType match { case "SELECT" => parser.selectStatement() case "EXPR" => parser.expression() case _ => parser.root() } val walker = new ParseTreeWalker() val listener = new MyListener() walker.walk(listener, tree) val nodeCountPerLevel = listener.nodeCountPerLevel println(nodeCountPerLevel.size) val score = nodeCountPerLevel.map { case (k,v) => { Math.log(k * v) + 1.0 } }.reduce( (a,b) => a * b ) Math.log(score) } catch { case e:ParseCancellationException => -1.0 } } }

evcplx.scala

Here is an example of using this function :

Using dataframe (where the queries are) and udf’s (user defined functions), you can add a complexity column to your dataframe like this :
def evalComplexityUdf = udf((input:String) => { evalComplexity(input, "EXPR") })
val allQueriesWithComplexity = allQueries.withColumn("valCplx", evalComplexityUdf($"query"))

df-evcp lx.scala

Then you can display and see the complexity of any query in the dataset :

And so what?

Then we were able to identify complex queries, to analyze why it was so complex and to check if it was still in use in the system. There was also an advantage in using this method : we found few invalid statements in the MySQL database (missing parenthesis, invalid variable names …). These queries have since been corrected or removed if it was not used anymore.

The customer was able to understand a bit more what was in the database, and it was a good start to rethink the way these queries should be kept or mapped to Elasticsearch.

Conclusion

When you have a SQL-based system that contains SQL queries which will be concatenated or manipulated as strings then you can assume that it’s not the best practice you can find. Indeed, it’s a trick that hides a deep misconception problem of your platform. In that case you can either run away and say you can’t do anything for them or you can pluck up the courage to give sense to customer’s data and make them understand that there is always a better way : rethink the data modeling and use a database that fits better the use case.

Space Odyssey – Warner Bros. Entertainment Inc.

Even if the project looks like a space odyssey to the customer (like fighting against an “already in place” platform to change the way it thinks), there’s always a way or a workaround to ignite a seismic shift.

< Post plus ancien

Post plus récent >

Ressources Agaetis

ORY 4 : L’entrepôt d'Amazon le plus robotisé de France

par Florian Ruynat • 15 mai 2024

Introduction : Florian Ruynat, Architecte Cloud, a récemment réalisé une visite captivante de l'entrepôt ORY4, l'un des complexes les plus automatisés d'Amazon en France. Au cœur de cette installation logistique de 152 000 m², où 3 000 employés orchestrent l'approvisionnement et la préparation des commandes, la technologie de pointe est omniprésente. Cette visite a permis d'explorer les possibilités d'intégration de l'IT moderne dans les usines de quatrième génération via le cloud, avec une attention particulière portée à la gestion proactive des pannes, la surveillance des conditions opérationnelles comme la température, ainsi que la gestion des stocks.

Sora : Transformer les Textes en Scènes Vidéo Réalistes et Imaginatives

par David Walter • 16 février 2024

OpenAI, a récemment dévoilé SORA, un outil de génération de vidéo. SORA monte encore une marche, offrant des capacités de génération de vidéos réalistes. Cet article explore les caractéristiques clés de SORA, son impact potentiel sur diverses industries, les points de réflexions et l'impact pour l'avenir de la création de contenu. Qu'est-ce que SORA ? SORA est une interface avancée conçue par OpenAI qui permet de générer des séquences vidéo à partir de descriptions textuelles simples. Utilisant des techniques de pointe en matière d'intelligence artificielle et d'apprentissage profond, SORA est capable de comprendre des commandes complexes et de les traduire en contenus visuels impressionnants. Une qualité de génération inégalée La capacité de SORA à générer des vidéos époustouflantes souligne un tournant dans le domaine de la production vidéo, où la qualité et la créativité ne sont plus entravées par des contraintes techniques ou financières. Cette avancée s'inscrit dans un contexte plus large où l'IA transforme profondément les industries créatives, offrant des outils puissants pour la transcription, le doublage, la création d'avatars générés par IA, et même la suppression de fonds vidéo, rendant ces processus plus accessibles et flexibles. Des outils comme Descript et Adobe Premiere Pro intègrent des fonctionnalités AI pour améliorer le processus d'édition vidéo, depuis la rotation des yeux jusqu'à la génération de transcriptions et sous-titres. De même, la comparaison entre DALL-E 3 et Midjourney montre comment l'IA peut capturer des détails et des ambiances spécifiques dans les images, un principe également applicable à la vidéo. La révolution du streaming vidéo illustre comment l'adaptation numérique bouleverse les modèles économiques traditionnels, offrant une perspective sur la manière dont les technologies génératives pourraient remodeler le paysage médiatique. L'impact de ces technologies dépasse la simple création de contenu ; elles remodèlent également notre compréhension de la créativité et ouvrent de nouvelles voies pour l'expression artistique et la communication. Avec des outils comme SORA, la barrière entre l'idée et sa réalisation se réduit, permettant à un plus grand nombre de personnes de donner vie à leurs visions créatives sans les contraintes traditionnelles de la production vidéo. Cet élan vers une qualité de génération inégalée par l'IA soulève des questions importantes sur l'avenir du contenu créatif et la manière dont nous valorisons l'interaction entre l'humain et la technologie dans le processus créatif. Alors que nous explorons ces nouvelles frontières, il est crucial de rester attentifs aux implications éthiques et aux défis que ces technologies posent, tout en reconnaissant leur potentiel pour enrichir notre monde visuel et narratif.