Added mem-dbg across all structs of the crate as optional feature #5

LucaCappelletti94 · 2024-08-08T14:16:43Z

Mem-dbg is a crate that allows to compute the size of a struct. I have added the derives through the crate as an optional feature, so as to use it to compare this implementation with others easily.

Cheers!

…nal feature

bocharov · 2024-08-08T20:50:59Z

@LucaCappelletti94 thanks for adding optional mem-dbg feature.

We'd love to have this change in, however it seems few size calculation inconsistencies should be addressed first.

For example, if I apply the following diff:

diff --git a/src/estimator.rs b/src/estimator.rs
index 0c7c7fb..cca6bc0 100644
--- a/src/estimator.rs
+++ b/src/estimator.rs
@@ -181,7 +181,9 @@ where
     H: Hasher + Default,
 {
     fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
-        write!(f, "{:?}", self.representation())
+        use mem_dbg::{MemSize, SizeFlags};
+
+        write!(f, "{:?} mem_dbg = {:?}", self.representation(), self.mem_size(SizeFlags::default()))
     }
 }

and then run tests with:

cargo test --features mem_dbg test_estimator_p12_w6

I can see quite a few unexplained discrepancies between mem_dbg and the actual size:

---- estimator::tests::test_estimator_p12_w6::_0_expects_representation_small_estimate_0_size_8_avg_err_0_0000_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_0_expects_representation_small_estimate_0_size_8_avg_err_0_0000_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Small(estimate: 0, size: 8), avg_err: 0.0000"
 right: "representation: Small(estimate: 0, size: 8) mem_dbg = 40, avg_err: 0.0000"
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- estimator::tests::test_estimator_p12_w6::_128_expects_representation_array_estimate_128_size_520_avg_err_0_0000_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_128_expects_representation_array_estimate_128_size_520_avg_err_0_0000_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Array(estimate: 128, size: 520), avg_err: 0.0000"
 right: "representation: Array(estimate: 128, size: 520) mem_dbg = 40, avg_err: 0.0000"

---- estimator::tests::test_estimator_p12_w6::_1_expects_representation_small_estimate_1_size_8_avg_err_0_0000_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_1_expects_representation_small_estimate_1_size_8_avg_err_0_0000_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Small(estimate: 1, size: 8), avg_err: 0.0000"
 right: "representation: Small(estimate: 1, size: 8) mem_dbg = 40, avg_err: 0.0000"

---- estimator::tests::test_estimator_p12_w6::_16_expects_representation_array_estimate_16_size_72_avg_err_0_0000_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_16_expects_representation_array_estimate_16_size_72_avg_err_0_0000_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Array(estimate: 16, size: 72), avg_err: 0.0000"
 right: "representation: Array(estimate: 16, size: 72) mem_dbg = 40, avg_err: 0.0000"

---- estimator::tests::test_estimator_p12_w6::_1024_expects_representation_hll_estimate_1012_size_3092_avg_err_0_0130_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_1024_expects_representation_hll_estimate_1012_size_3092_avg_err_0_0130_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Hll(estimate: 1012, size: 3092), avg_err: 0.0130"
 right: "representation: Hll(estimate: 1012, size: 3092) mem_dbg = 40, avg_err: 0.0130"

---- estimator::tests::test_estimator_p12_w6::_129_expects_representation_hll_estimate_130_size_3092_avg_err_0_0001_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_129_expects_representation_hll_estimate_130_size_3092_avg_err_0_0001_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Hll(estimate: 130, size: 3092), avg_err: 0.0001"
 right: "representation: Hll(estimate: 130, size: 3092) mem_dbg = 40, avg_err: 0.0001"

---- estimator::tests::test_estimator_p12_w6::_256_expects_representation_hll_estimate_254_size_3092_avg_err_0_0029_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_256_expects_representation_hll_estimate_254_size_3092_avg_err_0_0029_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Hll(estimate: 254, size: 3092), avg_err: 0.0029"
 right: "representation: Hll(estimate: 254, size: 3092) mem_dbg = 40, avg_err: 0.0029"

---- estimator::tests::test_estimator_p12_w6::_2_expects_representation_small_estimate_2_size_8_avg_err_0_0000_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_2_expects_representation_small_estimate_2_size_8_avg_err_0_0000_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Small(estimate: 2, size: 8), avg_err: 0.0000"
 right: "representation: Small(estimate: 2, size: 8) mem_dbg = 40, avg_err: 0.0000"

---- estimator::tests::test_estimator_p12_w6::_3_expects_representation_array_estimate_3_size_24_avg_err_0_0000_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_3_expects_representation_array_estimate_3_size_24_avg_err_0_0000_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Array(estimate: 3, size: 24), avg_err: 0.0000"
 right: "representation: Array(estimate: 3, size: 24) mem_dbg = 40, avg_err: 0.0000"

---- estimator::tests::test_estimator_p12_w6::_32_expects_representation_array_estimate_32_size_136_avg_err_0_0000_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_32_expects_representation_array_estimate_32_size_136_avg_err_0_0000_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Array(estimate: 32, size: 136), avg_err: 0.0000"
 right: "representation: Array(estimate: 32, size: 136) mem_dbg = 40, avg_err: 0.0000"

---- estimator::tests::test_estimator_p12_w6::_4_expects_representation_array_estimate_4_size_24_avg_err_0_0000_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_4_expects_representation_array_estimate_4_size_24_avg_err_0_0000_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Array(estimate: 4, size: 24), avg_err: 0.0000"
 right: "representation: Array(estimate: 4, size: 24) mem_dbg = 40, avg_err: 0.0000"

---- estimator::tests::test_estimator_p12_w6::_8_expects_representation_array_estimate_8_size_40_avg_err_0_0000_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_8_expects_representation_array_estimate_8_size_40_avg_err_0_0000_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Array(estimate: 8, size: 40), avg_err: 0.0000"
 right: "representation: Array(estimate: 8, size: 40) mem_dbg = 40, avg_err: 0.0000"

---- estimator::tests::test_estimator_p12_w6::_64_expects_representation_array_estimate_64_size_264_avg_err_0_0000_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_64_expects_representation_array_estimate_64_size_264_avg_err_0_0000_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Array(estimate: 64, size: 264), avg_err: 0.0000"
 right: "representation: Array(estimate: 64, size: 264) mem_dbg = 40, avg_err: 0.0000"

---- estimator::tests::test_estimator_p12_w6::_512_expects_representation_hll_estimate_498_size_3092_avg_err_0_0068_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_512_expects_representation_hll_estimate_498_size_3092_avg_err_0_0068_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Hll(estimate: 498, size: 3092), avg_err: 0.0068"
 right: "representation: Hll(estimate: 498, size: 3092) mem_dbg = 40, avg_err: 0.0068"

---- estimator::tests::test_estimator_p12_w6::_4096_expects_representation_hll_estimate_4105_size_3092_avg_err_0_0089_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_4096_expects_representation_hll_estimate_4105_size_3092_avg_err_0_0089_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Hll(estimate: 4105, size: 3092), avg_err: 0.0089"
 right: "representation: Hll(estimate: 4105, size: 3092) mem_dbg = 40, avg_err: 0.0089"

---- estimator::tests::test_estimator_p12_w6::_10_000_expects_representation_hll_estimate_10068_size_3092_avg_err_0_0087_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_10_000_expects_representation_hll_estimate_10068_size_3092_avg_err_0_0087_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Hll(estimate: 10068, size: 3092), avg_err: 0.0087"
 right: "representation: Hll(estimate: 10068, size: 3092) mem_dbg = 40, avg_err: 0.0087"

---- estimator::tests::test_estimator_p12_w6::_100_000_expects_representation_hll_estimate_95628_size_3092_avg_err_0_0182_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_100_000_expects_representation_hll_estimate_95628_size_3092_avg_err_0_0182_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Hll(estimate: 95628, size: 3092), avg_err: 0.0182"
 right: "representation: Hll(estimate: 95628, size: 3092) mem_dbg = 40, avg_err: 0.0182"

Perhaps, cardinality-estimator crate logic should be adjusted on how mem_size is computed or something inside mem-dbg crate should be changed.

LucaCappelletti94 · 2024-08-08T21:15:16Z

Basically, in most cases the derive is enough to cover everything, and honestly when I opened the PR I had just done that thinking it should be it. Afterwards, as I started benchmarking, I realized that due to the use of the pointer trick to handle dynamic size, the derive told us the struct was only 8 bits or so. Therefore, I quickly wrote up the traits implementations, but I only covered the inner representation. I will try and fix it shortly.

LucaCappelletti94 · 2024-08-09T00:21:53Z

Small update: fixed errors estimating size on your crate side, but also identified an error on the mem dbg side. Working on that now.

LucaCappelletti94 · 2024-08-09T12:22:24Z

Now if you rerun the same script as above, you will find that all estimates are matching.

bocharov · 2024-08-10T00:10:25Z

Cargo.toml

@@ -34,7 +35,7 @@ name = "cardinality_estimator"
 harness = false

 [features]
-default = []
+default = ["mem_dbg"]


Is it really necessary to enable mem_dbg feature by default?

Only if by the end of this PR we were to replace size_of with mem_dbg - the two methods are by definition very overlapping in scope. Possibly, reading the other comments, you intended size_of to only include the size of the stored object, while mem_dbg has the goal of computing the size of the object in memory.

If mem_dbg aims to show real heap memory usage similarly to size_of then it could be a good replacement / alternative to current size_of implementation.

bocharov · 2024-08-10T00:12:22Z

src/array.rs

@@ -110,7 +111,7 @@ impl<'a, const P: usize, const W: usize> RepresentationTrait for Array<'a, P, W>
    /// Return memory size of `Array` representation
    #[inline]
    fn size_of(&self) -> usize {
-        size_of::<usize>() + size_of_val(self.arr)
+       size_of::<usize>() + size_of::<usize>() + size_of::<&[u32]>() + size_of_val(self.arr)


I'm not sure about this size_of calculation logic. Originally, I was counting it as sum of:

data: usize - which stores either data or pointer to the data

optional u32 slice

Let me comment it better

Done that in commit 20f3d0d, let me know if all is clear.

bocharov · 2024-08-10T00:18:21Z

src/hyperloglog.rs

@@ -151,7 +152,8 @@ impl<'a, const P: usize, const W: usize> RepresentationTrait for HyperLogLog<'a,
    /// Return memory size of `HyperLogLog`
    #[inline]
    fn size_of(&self) -> usize {
-        size_of::<usize>() + size_of_val(self.data)
+        // The length of the slice, plus the pointer of the slice reference, and the size of the slice itself.
+        size_of::<usize>() + size_of::<usize>() + size_of_val(self.data)


We technically don't store pointer of the slice reference in memory, as it's stored in "data" field of CardinalityEstimator struct.

I'd suggest to measure allocations of Vec<CardinalityEstimator> let's say of 1,000 elements of same size.
Also heap allocations can be measured with https://docs.rs/dhat/latest/dhat/ as it's done in https://github.com/cloudflare/cardinality-estimator/blob/main/benches/cardinality_estimator.rs#L148
We could try to compare mem_dbg to size_of to dhat measured allocations.

Added more comments regarding this in commit 953cea8. While I am all for comparing the results of mem-dbg with those of other crates (I think the authors already have done some of that), in this case I am pretty sure about the correctness of the estimation of the size of a slice (len (usize) + pointer (usize) + size of values therein)

src/hyperloglog.rs

bocharov · 2024-08-10T00:24:05Z

src/representation.rs

+    fn size_of(&self) -> usize {
+        // A priori, we do not know which of the child structs is the largest and therefore
+        // composing the bulk of the memory usage by the enum. Therefore, we subtract the size of
+        // the variants, and then we sum the size of the enum itself.


Usually Representation enum is not stored directly and has pub(crate) visibility. It's only created when accessing CardinalityEstimator methods and usually optimized by compiler anyway. What's stored is CardinalityEstimator struct and its data.

Ok, I see, so basically what I am currently measuring would be the 'runtime peak size' in place of the storage size. I suppose the question then is which one we are more interested in measuring. Which one would you rather go for?

We're primarily interested in measuring heap allocated memory usage of overall CardinalityEstimator and not the temporary used individual sub-structs Small, Array and HyperLogLog, so that we can reliably and accurately measure how much memory Vec or HashMap of 1M or 100M of CardinalityEstimator uses.

Representation enum and its sub-structs are never actually stored on the heap, these are stack allocated and quickly disposed structs. I think we should only report how much memory CardinalityEstimator uses.

bocharov · 2024-08-13T00:39:06Z

src/representation.rs

+    fn size_of(&self) -> usize {
+        // A priori, we do not know which of the child structs is the largest and therefore
+        // composing the bulk of the memory usage by the enum. Therefore, we subtract the size of
+        // the variants, and then we sum the size of the enum itself.


Representation enum and its sub-structs are never actually stored on the heap, these are stack allocated and quickly disposed structs. I think we should only report how much memory CardinalityEstimator uses.

bocharov · 2024-08-13T00:41:41Z

src/array.rs

+        let cap_size = size_of::<usize>();
+        // The size of the arr attribute, which is a slice
+        // and therefore composed of a pointer (usize) and a length (usize)
+        let arr_size = size_of::<usize>() * 2;


Not sure I understand the "multiply by 2" aspect of this logic, however I don't think we should measure memory usage of Array and other structs except CardinalityEstimator since these are stack allocated and not actually stored / exposed beyond pub(crate) visibility.

LucaCappelletti94 added 3 commits August 8, 2024 16:15

Added support for mem-dbg across all structs of the crate as an optio…

ab52962

…nal feature

Resolved macro error

03d6a03

Added custom implementation of memsize and memdbg

bb75924

Fixed memory estimation divergences

4f6815a

Fixed error in exponent BETA 11

04b0e84

bocharov suggested changes Aug 10, 2024

View reviewed changes

LucaCappelletti94 added 2 commits August 10, 2024 09:32

Commented the size of array

20f3d0d

Commented size of hyperloglog

953cea8

bocharov suggested changes Aug 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added mem-dbg across all structs of the crate as optional feature #5

Added mem-dbg across all structs of the crate as optional feature #5

LucaCappelletti94 commented Aug 8, 2024

bocharov commented Aug 8, 2024

LucaCappelletti94 commented Aug 8, 2024

LucaCappelletti94 commented Aug 9, 2024

LucaCappelletti94 commented Aug 9, 2024

bocharov Aug 10, 2024

LucaCappelletti94 Aug 10, 2024

bocharov Aug 13, 2024

bocharov Aug 10, 2024

LucaCappelletti94 Aug 10, 2024

LucaCappelletti94 Aug 10, 2024

bocharov Aug 10, 2024

LucaCappelletti94 Aug 10, 2024

bocharov Aug 10, 2024

LucaCappelletti94 Aug 10, 2024

bocharov Aug 13, 2024

bocharov Aug 13, 2024

bocharov Aug 13, 2024

bocharov Aug 13, 2024

Added mem-dbg across all structs of the crate as optional feature #5

Are you sure you want to change the base?

Added mem-dbg across all structs of the crate as optional feature #5

Conversation

LucaCappelletti94 commented Aug 8, 2024

bocharov commented Aug 8, 2024

LucaCappelletti94 commented Aug 8, 2024

LucaCappelletti94 commented Aug 9, 2024

LucaCappelletti94 commented Aug 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment