Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce chunkstore memory footprint #747

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
### 1.75
* Bugfix: pypandoc not rendering README correctly for PYPI
* Bugfix: #744 get_info on an empty dataframe raises an exception
* Feature: Chunkstore: Removed duplication error when filtering by columns
* Feature: Chunkstore: Reduced memory footprint when reading data

### 1.74 (2019-02-28)
* Bugfix: #712 Pandas deprecation warning in chunkstore serializer
Expand Down
1 change: 1 addition & 0 deletions arctic/chunkstore/chunkstore.py
Original file line number Diff line number Diff line change
Expand Up @@ -280,6 +280,7 @@ def read(self, symbol, chunk_range=None, filter_data=True, **kwargs):
chunks[segments[0][SYMBOL]].append({DATA: chunk_data, METADATA: mdata})

skip_filter = not filter_data or chunk_range is None
kwargs['inplace'] = kwargs.get('inplace', True)

if len(symbol) > 1:
return {sym: deser(chunks[sym], **kwargs) if skip_filter else chunker.filter(deser(chunks[sym], **kwargs), chunk_range) for sym in symbol}
Expand Down
17 changes: 12 additions & 5 deletions arctic/serialization/numpy_arrays.py
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ def serialize(self, df):
ret[METADATA][TYPE] = dtype
return ret

def deserialize(self, data, columns=None):
def deserialize(self, data, columns=None, inplace=False):
"""
Deserializes SON to a DataFrame
Expand All @@ -203,13 +203,17 @@ def deserialize(self, data, columns=None):
columns: None, or list of strings
optionally you can deserialize a subset of the data in the SON. Index
columns are ALWAYS deserialized, and should not be specified
inplace: Convert and remove items from data in-place
this will modify data
Returns
-------
pandas dataframe or series
"""
if not data:
return pd.DataFrame()
if not inplace:
data = data[:]
bmoscon marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some errors in the tests, so I'm thinking this will need to be tweaked a bit more


meta = data[0][METADATA] if isinstance(data, list) else data[METADATA]
index = INDEX in meta
Expand All @@ -218,16 +222,19 @@ def deserialize(self, data, columns=None):
if index:
columns = columns[:]
columns.extend(meta[INDEX])
if len(columns) > len(set(columns)):
raise Exception("Duplicate columns specified, cannot de-serialize")
columns = list(set(columns))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I see this as a win. It seems like the caller may have a bug if they are specifying duplicate columns, we're just hiding the error now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current logic is confusing when subsetting data frames with indexes. For example, if you have the data frame:
index: date, security
columns: price, volume

The logic works if the user passes:
['price']
Raises a duplicate columns error when passing:
['date','security','price']

I don't see the value in the check - it should just do the right thing..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using pandas nomenclature the columns and the index are separate. If there is an index, you always get it back, even if you specify a subset of columns (and even if they do not include the index columns). Maybe the documentation should be improved. If for example, you specify price and security, you'll still get date as well as price and security, so your fix would only introduce more weirdness (in my opinion).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could remove index columns from columns and then check for duplicates. This keeps the nomenclature but keeps the user interface 'minimum surprise'. Or raise an error saying they have included index columns in the column list.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result would be the same though, no? You'd supply index columns and it wont complain. I foresee someone opening a bug complaining they only specified 1 of 3 index columns but still got all 3 back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, in retrospect, that means breaking the API for clients. How about we keep the fuzziness for clients and simply output a warning instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, lets see a warning :) but i still think that get info should change, otherwise how would you ever know how to rid yourself of the warning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with get_info change

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so sounds like you just need to fix the broken tests and add the log and we're all set :D


if not isinstance(data, list):
df = self.converter.objify(data, columns)
else:
df = pd.concat([self.converter.objify(d, columns) for d in data], ignore_index=not index)
dfs = []
while len(data):
bmoscon marked this conversation as resolved.
Show resolved Hide resolved
dfs.append(self.converter.objify(data.pop(0), columns))
df = pd.concat(dfs, ignore_index=not index)
del dfs

if index:
df = df.set_index(meta[INDEX])
df.set_index(meta[INDEX], inplace=True)
if meta[TYPE] == 'series':
return df[df.columns[0]]
return df
Expand Down