2021-03-03

Elastic Wildcard ECS Whirlwind

Common Elastic Issues Regarding Cost & Search that Still Exist in 2023
Unposted blog from 2020 that are still true in 2023 and will be beyond.

Elastic Common Schema is rolling back the wildcard data type (security use case searching savior) from ECS 1.8...

reference 1 https://github.com/elastic/ecs/issues/1233

reference 2 https://github.com/elastic/ecs/pull/1237



For prior reading on wildcard data type, keyword data type, text/analyzed fields, case insensitive & case insensitivity searching on Cyber security related data/logs, all the while with/around/using Elastic Common Schema and logging use cases:


I noticed in reference 2 github comment that Elastic discovered "some notable performance issues related to storage size and indexing throughput that we must have time to review and address in a comprehensive way".


Right..... indexing things increases storage VS storing the thing as is. It's a 1 x $IndexTerms.. UNLESS you get good compression ratios. Usually good compression comes at the cost of a CPU resource whether client, server, index, or somewhere else (more on that later ;)
However, compression and ultimately reduction in storage was a huge thing that Elastic touted in their big announcement of the wildcard data type.. 


As I started digging further down the rabbit hole of why Elastic decided to rollback wildcard in 1.8 given that case sensitive log/search bypasses has been well documented and communicated for almost 2 years now....


I noticed a very peculiar comment on the PR to Lucene that added the sauce (code) for making this compression for the wildcard data type better... The comment goes "There's a trade-off here between efficient compression (more docs-per-block = better compression) and fast retrieval times (fewer docs-per-block = faster read access for single values)"


OK... It should be pretty clear, but look also at the wording of many of the PRs... You will see things like "most cases" or "if" data is similar or different.
In short, IT DEPENDS...
There is trade off's in databases as a whole, let alone sub components of them. Whether it is elasticsearch, some SQL db, you name it..

I just want to know how we got here. How did we mess up the ability to do a search for does a value contain XYZ..regardless of upper/lower/space/etc..
Elasticsearch could always do that before. Side bar...Yes, the analyzed field was not perfect for security use cases, but it was there and easier to work around its shortcomings than the situation the cyber security community is in now (mostly that nobody knows their searches are not returning the results they expect).. 
The company could have just created a community analyzer like that neu5ron person.. I think he even worked there too at one point ;) 

Even if wildcard data type had fixed everything by now, you still lose other powerful aspects of searching in Lucene (elasticsearch backend).

Such as fuzzy/Levenshtein distance, term/ordering queries... so on and so forth... The things that are/still useful for security use cases. The things elasticsearch as a whole is useful for in most/all use cases let alone cyber.

Can somebody at Elastic tell me what was wrong with keyword (data type).. and setting the doc values to 10,000+.. global ordinals.. and create a custom text analyzer. Whats this wildcard data type get us that is so special it needs it's own brand new data type.. and needed to be licensed before the great big license change even happened.


We would have solved the vast majority of the issues by now (free text search :) kept the other searching functionalities.. less template/mapping changes... everybody roasting marshmallows and searching for bad folks on their networks. 

The only explanation(s) I can think of for why the wildcard data type fiasco occurred:
Was to decrease "storage" for licensing/purchase cost...
or perhaps some Amazon debacle - because the wildcard data type had become licensed (before even the big license change situation).
It also does not help.. if there is nobody within or empowered within Elastic's organization who is (what I like to call) a "glue person". This would be somebody that transcends multiple aspects of the business and use case.. In this example, somebody who knows the security use cases, backend/lucene (even a small amount is all that would have been needed), actively or recently deployed in a production environment, maintained a deployment, AND most importantly using it like an analyst would use the data..and works with the cyber community.

But lets think for a second.. Storage is one of the cheapest computing resources there is (vs CPU/RAM).. 

So then what..?!
This is where it all gets muddy... Perhaps increases in storage was such a big deal because there is a bigger pricing issue.. A catch 22 where they shoot themselves in the foot and come in at a higher cost than anybody would expect because having to license more nodes (based on that additional storage)..
NOT TO MENTION... shooting themselves in the foot when moving a lot of the parsing/ECS stuff to elasticsearch "ingest" nodes that are a licensed node... Compression overhead = more compute.. more compute = more licensed nodes..more licensed nodes = more license cost...
or this is an genius evil business model :) 

However, I don't think that storage increase is the real cost factor if it is done realistically. I think that this is a cloud storage licensing model issue.. Combined with what I think is the biggest thing, which is some religious (sales) document out there that says X amount of TBs per X amount of (licensed) nodes "NO MORE NO LESS"... and those numbers are pretty unrealistic I would assume. 
Because, after X amount of days of immediately available (HOT architecture) data where there is an overlap of write & read at the same time - its not a huge concern to have much larger disks for a single server/resource-unit.......


As it still stands, I am completely uncertain what the need for wildcard data type was.