Skip to content

Added mem-dbg across all structs of the crate as optional feature #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

LucaCappelletti94
Copy link

Mem-dbg is a crate that allows to compute the size of a struct. I have added the derives through the crate as an optional feature, so as to use it to compare this implementation with others easily.

Cheers!

@bocharov
Copy link
Contributor

bocharov commented Aug 8, 2024

@LucaCappelletti94 thanks for adding optional mem-dbg feature.

We'd love to have this change in, however it seems few size calculation inconsistencies should be addressed first.

For example, if I apply the following diff:

diff --git a/src/estimator.rs b/src/estimator.rs
index 0c7c7fb..cca6bc0 100644
--- a/src/estimator.rs
+++ b/src/estimator.rs
@@ -181,7 +181,9 @@ where
     H: Hasher + Default,
 {
     fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
-        write!(f, "{:?}", self.representation())
+        use mem_dbg::{MemSize, SizeFlags};
+
+        write!(f, "{:?} mem_dbg = {:?}", self.representation(), self.mem_size(SizeFlags::default()))
     }
 }

and then run tests with:

cargo test --features mem_dbg test_estimator_p12_w6

I can see quite a few unexplained discrepancies between mem_dbg and the actual size:

---- estimator::tests::test_estimator_p12_w6::_0_expects_representation_small_estimate_0_size_8_avg_err_0_0000_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_0_expects_representation_small_estimate_0_size_8_avg_err_0_0000_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Small(estimate: 0, size: 8), avg_err: 0.0000"
 right: "representation: Small(estimate: 0, size: 8) mem_dbg = 40, avg_err: 0.0000"
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- estimator::tests::test_estimator_p12_w6::_128_expects_representation_array_estimate_128_size_520_avg_err_0_0000_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_128_expects_representation_array_estimate_128_size_520_avg_err_0_0000_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Array(estimate: 128, size: 520), avg_err: 0.0000"
 right: "representation: Array(estimate: 128, size: 520) mem_dbg = 40, avg_err: 0.0000"

---- estimator::tests::test_estimator_p12_w6::_1_expects_representation_small_estimate_1_size_8_avg_err_0_0000_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_1_expects_representation_small_estimate_1_size_8_avg_err_0_0000_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Small(estimate: 1, size: 8), avg_err: 0.0000"
 right: "representation: Small(estimate: 1, size: 8) mem_dbg = 40, avg_err: 0.0000"

---- estimator::tests::test_estimator_p12_w6::_16_expects_representation_array_estimate_16_size_72_avg_err_0_0000_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_16_expects_representation_array_estimate_16_size_72_avg_err_0_0000_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Array(estimate: 16, size: 72), avg_err: 0.0000"
 right: "representation: Array(estimate: 16, size: 72) mem_dbg = 40, avg_err: 0.0000"

---- estimator::tests::test_estimator_p12_w6::_1024_expects_representation_hll_estimate_1012_size_3092_avg_err_0_0130_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_1024_expects_representation_hll_estimate_1012_size_3092_avg_err_0_0130_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Hll(estimate: 1012, size: 3092), avg_err: 0.0130"
 right: "representation: Hll(estimate: 1012, size: 3092) mem_dbg = 40, avg_err: 0.0130"

---- estimator::tests::test_estimator_p12_w6::_129_expects_representation_hll_estimate_130_size_3092_avg_err_0_0001_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_129_expects_representation_hll_estimate_130_size_3092_avg_err_0_0001_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Hll(estimate: 130, size: 3092), avg_err: 0.0001"
 right: "representation: Hll(estimate: 130, size: 3092) mem_dbg = 40, avg_err: 0.0001"

---- estimator::tests::test_estimator_p12_w6::_256_expects_representation_hll_estimate_254_size_3092_avg_err_0_0029_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_256_expects_representation_hll_estimate_254_size_3092_avg_err_0_0029_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Hll(estimate: 254, size: 3092), avg_err: 0.0029"
 right: "representation: Hll(estimate: 254, size: 3092) mem_dbg = 40, avg_err: 0.0029"

---- estimator::tests::test_estimator_p12_w6::_2_expects_representation_small_estimate_2_size_8_avg_err_0_0000_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_2_expects_representation_small_estimate_2_size_8_avg_err_0_0000_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Small(estimate: 2, size: 8), avg_err: 0.0000"
 right: "representation: Small(estimate: 2, size: 8) mem_dbg = 40, avg_err: 0.0000"

---- estimator::tests::test_estimator_p12_w6::_3_expects_representation_array_estimate_3_size_24_avg_err_0_0000_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_3_expects_representation_array_estimate_3_size_24_avg_err_0_0000_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Array(estimate: 3, size: 24), avg_err: 0.0000"
 right: "representation: Array(estimate: 3, size: 24) mem_dbg = 40, avg_err: 0.0000"

---- estimator::tests::test_estimator_p12_w6::_32_expects_representation_array_estimate_32_size_136_avg_err_0_0000_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_32_expects_representation_array_estimate_32_size_136_avg_err_0_0000_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Array(estimate: 32, size: 136), avg_err: 0.0000"
 right: "representation: Array(estimate: 32, size: 136) mem_dbg = 40, avg_err: 0.0000"

---- estimator::tests::test_estimator_p12_w6::_4_expects_representation_array_estimate_4_size_24_avg_err_0_0000_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_4_expects_representation_array_estimate_4_size_24_avg_err_0_0000_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Array(estimate: 4, size: 24), avg_err: 0.0000"
 right: "representation: Array(estimate: 4, size: 24) mem_dbg = 40, avg_err: 0.0000"

---- estimator::tests::test_estimator_p12_w6::_8_expects_representation_array_estimate_8_size_40_avg_err_0_0000_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_8_expects_representation_array_estimate_8_size_40_avg_err_0_0000_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Array(estimate: 8, size: 40), avg_err: 0.0000"
 right: "representation: Array(estimate: 8, size: 40) mem_dbg = 40, avg_err: 0.0000"

---- estimator::tests::test_estimator_p12_w6::_64_expects_representation_array_estimate_64_size_264_avg_err_0_0000_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_64_expects_representation_array_estimate_64_size_264_avg_err_0_0000_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Array(estimate: 64, size: 264), avg_err: 0.0000"
 right: "representation: Array(estimate: 64, size: 264) mem_dbg = 40, avg_err: 0.0000"

---- estimator::tests::test_estimator_p12_w6::_512_expects_representation_hll_estimate_498_size_3092_avg_err_0_0068_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_512_expects_representation_hll_estimate_498_size_3092_avg_err_0_0068_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Hll(estimate: 498, size: 3092), avg_err: 0.0068"
 right: "representation: Hll(estimate: 498, size: 3092) mem_dbg = 40, avg_err: 0.0068"

---- estimator::tests::test_estimator_p12_w6::_4096_expects_representation_hll_estimate_4105_size_3092_avg_err_0_0089_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_4096_expects_representation_hll_estimate_4105_size_3092_avg_err_0_0089_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Hll(estimate: 4105, size: 3092), avg_err: 0.0089"
 right: "representation: Hll(estimate: 4105, size: 3092) mem_dbg = 40, avg_err: 0.0089"

---- estimator::tests::test_estimator_p12_w6::_10_000_expects_representation_hll_estimate_10068_size_3092_avg_err_0_0087_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_10_000_expects_representation_hll_estimate_10068_size_3092_avg_err_0_0087_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Hll(estimate: 10068, size: 3092), avg_err: 0.0087"
 right: "representation: Hll(estimate: 10068, size: 3092) mem_dbg = 40, avg_err: 0.0087"

---- estimator::tests::test_estimator_p12_w6::_100_000_expects_representation_hll_estimate_95628_size_3092_avg_err_0_0182_ stdout ----
thread 'estimator::tests::test_estimator_p12_w6::_100_000_expects_representation_hll_estimate_95628_size_3092_avg_err_0_0182_' panicked at src/estimator.rs:237:5:
assertion `left == right` failed
  left: "representation: Hll(estimate: 95628, size: 3092), avg_err: 0.0182"
 right: "representation: Hll(estimate: 95628, size: 3092) mem_dbg = 40, avg_err: 0.0182"

Perhaps, cardinality-estimator crate logic should be adjusted on how mem_size is computed or something inside mem-dbg crate should be changed.

@LucaCappelletti94
Copy link
Author

Basically, in most cases the derive is enough to cover everything, and honestly when I opened the PR I had just done that thinking it should be it. Afterwards, as I started benchmarking, I realized that due to the use of the pointer trick to handle dynamic size, the derive told us the struct was only 8 bits or so. Therefore, I quickly wrote up the traits implementations, but I only covered the inner representation. I will try and fix it shortly.

@LucaCappelletti94
Copy link
Author

Small update: fixed errors estimating size on your crate side, but also identified an error on the mem dbg side. Working on that now.

@LucaCappelletti94
Copy link
Author

Now if you rerun the same script as above, you will find that all estimates are matching.

@@ -34,7 +35,7 @@ name = "cardinality_estimator"
harness = false

[features]
default = []
default = ["mem_dbg"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really necessary to enable mem_dbg feature by default?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if by the end of this PR we were to replace size_of with mem_dbg - the two methods are by definition very overlapping in scope. Possibly, reading the other comments, you intended size_of to only include the size of the stored object, while mem_dbg has the goal of computing the size of the object in memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If mem_dbg aims to show real heap memory usage similarly to size_of then it could be a good replacement / alternative to current size_of implementation.

src/array.rs Outdated
@@ -110,7 +111,7 @@ impl<'a, const P: usize, const W: usize> RepresentationTrait for Array<'a, P, W>
/// Return memory size of `Array` representation
#[inline]
fn size_of(&self) -> usize {
size_of::<usize>() + size_of_val(self.arr)
size_of::<usize>() + size_of::<usize>() + size_of::<&[u32]>() + size_of_val(self.arr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this size_of calculation logic. Originally, I was counting it as sum of:

  • data: usize - which stores either data or pointer to the data
  • optional u32 slice

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me comment it better

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done that in commit 20f3d0d, let me know if all is clear.

@@ -151,7 +152,8 @@ impl<'a, const P: usize, const W: usize> RepresentationTrait for HyperLogLog<'a,
/// Return memory size of `HyperLogLog`
#[inline]
fn size_of(&self) -> usize {
size_of::<usize>() + size_of_val(self.data)
// The length of the slice, plus the pointer of the slice reference, and the size of the slice itself.
size_of::<usize>() + size_of::<usize>() + size_of_val(self.data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We technically don't store pointer of the slice reference in memory, as it's stored in "data" field of CardinalityEstimator struct.

I'd suggest to measure allocations of Vec<CardinalityEstimator> let's say of 1,000 elements of same size.
Also heap allocations can be measured with https://docs.rs/dhat/latest/dhat/ as it's done in https://github.com/cloudflare/cardinality-estimator/blob/main/benches/cardinality_estimator.rs#L148
We could try to compare mem_dbg to size_of to dhat measured allocations.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more comments regarding this in commit 953cea8. While I am all for comparing the results of mem-dbg with those of other crates (I think the authors already have done some of that), in this case I am pretty sure about the correctness of the estimation of the size of a slice (len (usize) + pointer (usize) + size of values therein)

fn size_of(&self) -> usize {
// A priori, we do not know which of the child structs is the largest and therefore
// composing the bulk of the memory usage by the enum. Therefore, we subtract the size of
// the variants, and then we sum the size of the enum itself.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually Representation enum is not stored directly and has pub(crate) visibility. It's only created when accessing CardinalityEstimator methods and usually optimized by compiler anyway. What's stored is CardinalityEstimator struct and its data.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see, so basically what I am currently measuring would be the 'runtime peak size' in place of the storage size. I suppose the question then is which one we are more interested in measuring. Which one would you rather go for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're primarily interested in measuring heap allocated memory usage of overall CardinalityEstimator and not the temporary used individual sub-structs Small, Array and HyperLogLog, so that we can reliably and accurately measure how much memory Vec or HashMap of 1M or 100M of CardinalityEstimator uses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Representation enum and its sub-structs are never actually stored on the heap, these are stack allocated and quickly disposed structs. I think we should only report how much memory CardinalityEstimator uses.

fn size_of(&self) -> usize {
// A priori, we do not know which of the child structs is the largest and therefore
// composing the bulk of the memory usage by the enum. Therefore, we subtract the size of
// the variants, and then we sum the size of the enum itself.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Representation enum and its sub-structs are never actually stored on the heap, these are stack allocated and quickly disposed structs. I think we should only report how much memory CardinalityEstimator uses.

let cap_size = size_of::<usize>();
// The size of the arr attribute, which is a slice
// and therefore composed of a pointer (usize) and a length (usize)
let arr_size = size_of::<usize>() * 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the "multiply by 2" aspect of this logic, however I don't think we should measure memory usage of Array and other structs except CardinalityEstimator since these are stack allocated and not actually stored / exposed beyond pub(crate) visibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants