Tracking object usage

bvincent · April 17, 2019, 3:22pm

We have a MemSQL cluster that we will be using a data lake. To avoid a data swamp, we would like to monitor object usage so that we can evaluate what is still needed and what can be removed as the data lake grows. What is the best way to achieve this requirement?

It appears that comprehensive audit logging is licensed separately and can impact performance, which might be overkill anyway. Trace logging can log queries and has a “partial mode” to limit overhead (which may miss some queries, which is acceptable for this requirement). My thought is to enable trace logging in partial mode and aggregate the results from the log files into a database table, possibly using a pipeline. Would partial logging still impact performance? Do the trace logs include the user name such that we can see who is using the objects? Are there any other solutions now or in the future with the product roadmap?

jack · April 17, 2019, 5:40pm

What types of database objects are you interested in? For tables and columns, MemSQL exposes information about which table columns are being used in which queries in information_schema.mv_aggregated_column_usage and query_column_usage. See SingleStoreDB Self-Managed 6.7 Performance Improvements under section “New information_schema views for tuning”. This allows you to see which have been queried recently and which haven’t, as well as enabling more advanced analytics about which are involved in the most expensive queries, etc.

Yes, you can use query logging as you described. There should not be a large performance impact from PARTIAL mode. These query logs do not include the user name - for that you would need audit logs, but you are correct that those would impact performance - as they are a security feature they have a much larger performance impact.

bvincent · April 17, 2019, 10:56pm

I’d be interested in identifying all types of objects that are no longer being used to keep the database clean. However, I’m primarily interested in unused tables and their columns, as their data consumes resources that could be used elsewhere.

Thanks for pointing me to these management views. To know whether a table is really unused, we need to capture this data over a period of time. When you say that these views capture which tables have been queried “recently”, do you know the specific timeframe? Is there a standard approach to keeping more history than that, perhaps a scheduled process that periodically copies the contents to another table or file? Also, do they capture indirect usage? For example, a table that accessed through a view or procedure. I don’t have my work computer at the moment, but I’ll be sure to look at the views and test them tomorrow.

jack · April 18, 2019, 12:56pm

It captures uses by all queries in the in-memory plancache (i.e. those shown in information_schema.plancache). Queries will start to expire out of the plancache after they are unused for 12 hours (by default). When that happens, they also no longer show up in these management views.

Yes, you can periodically run a query such as an insert select to copy the data from there to another location. Yes, it captures indirect usage via views.