Ideal catalog, performance measurements
I am developing a library to work with Entity Attribute Value (repository), abbreviated as EAV (database model for storing arbitrary data). At the end of the last article, I asked you what else I need to write about, you asked to show an example of use and make performance measurements.
What is important to us when working with data? Write speed (adding or updating) and reading speed (specifically – filtering by models of one entity). And the search speed is a priority, because we write down once every ten days, but we read every minute / second and even more than once, and maybe more than one hundred times.
The trick of the library is that it would not work with submission built on basic EAV tables, and work with a small part of this data recorded in a separate materialized view or in a separate table…
During the New Year holidays, I made performance measurements and I want to share the results with you.
What are we going to measure?
Subtraction time for all items in the category
Time of formation of filtration parameters
Time of adding a new characteristic (attribute)
Time of adding a new heading
Time of updating a commodity item
The history of the development of the library began with the development of a market place, so consider that we are working with a catalog of commodity items.
Data set requirements
For testing the reading time from MATERIALIZED VIEW and from TABLE, it is not so important how many records were in the original tables of the EAV model, so several hundred categories will be enough to test the hypothesis.
There are no more than 100 characteristics for a category (and, accordingly, for a commodity item) (for an ordinary online store, this is even a lot, usually the characteristics of a product are no more than a couple of dozen).
Data set characteristics
Taking into account such input, using the script
data-generator/generate.php, the dataset was created:
number of entities (categories) – 320
number of attributes (characteristics) – 144
number of models (items in all categories) – 29,000
numerical values - 800,000
string values - 600,000
The total volume of EAV tables (with indexes) is 310 MB.
Accounting for data variability
The amount of data in which you need to search (filter) is of great importance for the search time.
We divide all entities into three types: many attributes, an average amount and a small amount, select one entity from each type:
Many – 238 models, 112 attributes, total values 26656
Average – 75 models, 34 attributes, 2550 values
Small – 4 models, 5 attributes, 20 values
All measurements in milliseconds
Entity with many models and attributes
Many – 238 models, 112 attributes, total values 26656.
Get all records, no filtering
It is impossible to get all records from VIEW without filtering, because we must have at least one filter – by category – already. Why such a monstrously long time turned out, I cannot understand.
For the “Filter” dimension, which has many models and attributes by entity, a selection was made by 46 columns (out of 112).
When it comes to reading, filtering by MATERIALIZED VIEW is 10% faster than TABLE.
Adding an attribute to a view is re-creating it.
Adding an attribute to the MATERIALIZED VIEW is a complete update.
Adding an attribute to a table is ALTER TABLE, so only 3ms.
Adding a model means creating a record in a table thing and creating empty values in tables word and number… Accordingly, 155 ms for presentation, this is the insertion of 112 records into the table of values (numeric – number, string – word).
For MATERIALIZED VIEW, we again need to recalculate it completely – update it entirely.
For the table, we add a record that will have 112 empty columns, this is only 40.9 ms, but the record is added based on the view, so 155.6 + 40.9 = 196.5.
Updating a record (updating all 112 columns) is done by updating the data in the base EAV tables and then:
This is enough for the presentation;
For MATERIALIZED VIEW, we need to recalculate everything completely again;
For the table, the update occurs independently of the base tables, just one row is updated, but this row is updated only after a successful update of the data in the base tables, so we see 217 ms, in fact, the actual update of the record is 30 ms.
As we can see, the use of MATERIALIZED VIEW gives the maximum gain in read time (twice as fast as from the view and 10% faster than from the table), but at the same time updating the data takes an order of magnitude more time (loss is 10 times).
Using the table gives a gain in reading twice, and a loss in writing by 15%, and this loss is due to the fact that we are waiting for a successful update of the master data, if we do not wait, then the gain will be 6 times.
Of course, in order not to wait, you need to modify the library with a file, add asynchrony or the logic of creating and executing jobs there. Of course, write logic for eventual consistency…
If we update the data in our “universal” catalog once an hour, then the loss of one second for updating is insignificant, and it is definitely worth it to get an additional 10% gain compared to using the table.
If we update data several times per hour, then I would use a table.
An entity with an average number of models and attributes
Average – 75 models, 34 attributes, 2550 values.
Get all records, no filtering
When the record is updated, we again see about 30 ms for the actual update of the record in the table, but compared to 65 ms for the update of the record in the base tables, this is already 50%.
Entity with a small number of models and attributes
Small – 4 models, 5 attributes, 20 values.
Get all records, no filtering
With the decrease in the amount of data, the difference between the way to access this data disappears. For medium categories, there is no difference in reading, the difference in writing is only two times.
Nevertheless, the general trend persists – searching for a materialized view and a table is twice as fast as for a view created on the basis of EAV tables.
If the number of entities is increased by an order of magnitude, then the search time in the “table” will not change, and the search time in the view will increase (it seems to me), therefore, of course, when working with EAV, you should use “tables”.
The question of whether this will be a materialized view, or whether it will be a table itself – must be decided based on the balance between reading and writing.
What’s next ?
In the next article I will give an example of using the library.
I hope you were interested.