-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat add concurrency support for DuckDBConnectionManager #1700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Reviewed everything up to 0ef7b5a in 2 minutes and 0 seconds
More details
- Looked at
321
lines of code in2
files - Skipped
0
files when reviewing. - Skipped posting
5
drafted comments based on config settings.
1. pandasai/data_loader/duck_db_connection_manager.py:107
- Draft comment:
Potential SQL injection risk: the table name is directly interpolated in the SQL statement. Consider proper quoting or parameterization for table identifiers. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
Looking at the context: 1) This is internal code handling DataFrame registration 2) The name parameter is used earlier with conn.register() which is DuckDB's official API 3) DuckDB itself would likely handle table name validation 4) This is not taking user SQL input - it's just for internal table management 5) The name is already used safely with register() before this query.
I could be wrong about DuckDB's internal safety mechanisms. Even internal APIs should follow best practices for SQL safety.
While best practices are important, this is a controlled internal API where the name parameter is already safely used with DuckDB's register() method. The risk is minimal since this isn't processing arbitrary user SQL.
The comment should be removed as it raises a theoretical concern that isn't a practical risk in this controlled internal context where the table name is already validated by DuckDB's API.
2. pandasai/data_loader/duck_db_connection_manager.py:62
- Draft comment:
Using a classmethod for _close_connections that relies on the singleton instance logic works, but consider documenting why this approach (vs an instance method) is chosen for clarity. - Reason this comment was not posted:
Confidence changes required:50%
<= threshold50%
None
3. tests/unit_tests/data_loader/test_duckdbmanager.py:37
- Draft comment:
Directly using internal attributes (_pool_size, _get_connection, etc.) in tests couples them tightly to implementation details. Consider testing only via public interfaces if possible. - Reason this comment was not posted:
Confidence changes required:50%
<= threshold50%
None
4. pandasai/data_loader/duck_db_connection_manager.py:108
- Draft comment:
Potential SQL injection risk: the table name is directly injected into the SQL string. Consider sanitizing or quoting the table name to ensure safety. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
Looking at the context: 1) This is DuckDB, an embedded database 2) The 'name' parameter comes from internal code, not user input 3) The register() method is an internal method used to register DataFrames 4) The SQL query is a simple CREATE TABLE statement 5) DuckDB's execute() method may already handle table name escaping.
I might be underestimating the risk - even internal parameters should be properly escaped. DuckDB's behavior with malicious table names isn't immediately clear.
While proper escaping is good practice, this is an embedded database with no network access, and the table names come from internal code, not user input. The risk is minimal.
Given the context and usage, this is not a significant security risk that requires immediate attention. The comment should be removed.
5. tests/unit_tests/data_loader/test_duckdbmanager.py:42
- Draft comment:
The connection pool exhaustion test waits approximately 60 seconds due to _max_wait_time. Consider reducing this value in the test environment to speed up tests. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
The comment makes an assumption about _max_wait_time being 60 seconds, but we can't verify this from the code shown. The test itself looks well-structured and the timeout verification is intentional. Changing the timeout would affect the test's ability to verify the actual production behavior.
I might be missing configuration code elsewhere that sets _max_wait_time. The suggestion to reduce timeout could be valid if tests are actually taking too long.
Without seeing the actual _max_wait_time value or evidence of slow tests, this comment is speculative. The test is intentionally verifying timeout behavior which is important for production code.
Delete the comment as it makes assumptions about the timeout value and suggests changing intentional test behavior without clear evidence of a problem.
Workflow ID: wflow_pHF0gouAE4Efizud
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
feat: exponential backoff for TransactionException Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
fix: add a timeout to thread.join calls Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
Pandas-ai is a very amazing project! But when we used it in http server, we've met some concurrency problems. And then we found that the problem is about the connections of DuckDB. So we modified this class to support concurrency and add unit tests for it.
Hope that it's helpful :)
Important
Adds concurrency support to
DuckDBConnectionManager
with connection pooling and thread-safe operations, including comprehensive unit tests.DuckDBConnectionManager
with_init_connection_pool()
,_get_connection()
, and_release_connection()
methods.register()
with retry logic for concurrent scenarios.sql()
._init_connection_pool()
and_close_connections()
._pool_size
and_max_wait_time
.test_duckdbmanager.py
for connection pool exhaustion, concurrent access, and table registration thread safety.This description was created by
for 0ef7b5a. It will automatically update as commits are pushed.