-
Notifications
You must be signed in to change notification settings - Fork 794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration with geopandas #588 #818
Changes from 9 commits
d317c98
e297f20
ea65ca1
f7d66b7
e9d28a5
985d3f6
b627c6c
2cd9fde
7db9ff8
950eb72
9f91c00
76a2af8
094a2e7
df84294
27a3df8
301ea23
c114acc
143ad04
661447d
d2b46e0
649fa21
89a999e
80c56d6
1bb8192
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
{ | ||
"python.pythonPath": "/Users/tim/anaconda/anaconda/envs/altair_test/bin/python" | ||
} | ||
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
{ | ||
// See https://go.microsoft.com/fwlink/?LinkId=733558 | ||
// for the documentation about the tasks.json format | ||
"version": "2.0.0", | ||
"tasks": [ | ||
{ | ||
"label": "pip instal", | ||
"type": "shell", | ||
"command": "pip install -e.", | ||
"problemMatcher": [ | ||
"$go" | ||
] | ||
} | ||
] | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please remove this file There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -227,9 +227,13 @@ def parse_shorthand(shorthand, data=None): | |
{'aggregate': 'count', 'type': 'quantitative'} | ||
""" | ||
attrs = _parse_shorthand(shorthand) | ||
if isinstance(data, pd.DataFrame) and 'type' not in attrs: | ||
if isinstance(data, pd.DataFrame): | ||
if 'field' in attrs and attrs['field'] in data.columns: | ||
attrs['type'] = infer_vegalite_type(data[attrs['field']]) | ||
if 'type' not in attrs: | ||
attrs['type'] = infer_vegalite_type(data[attrs['field']]) | ||
if hasattr(data,'__geo_interface__'): #TODO: Add descripion in shorthand | ||
attrs['field']='properties.'+attrs['field'] | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Something strange is going on here: is not checking for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
{ /* vega-light data*/
"format": {
"property": "features", /* generate geoshape for each row*/
"type": "json"
},
"values": { /* valid geojson for all rows*/
"type": "FeatureCollection",
"features": [
{ /* valid geojson for each row*/
"type": "Feature",
"properties": { /* column values */
"pop_est": 12799293.0,
"continent": "Africa",
},
"geometry": { /* geometry of the row */
"type": "MultiPolygon",
"coordinates": [ /* a lot of numbers*/]
}
}
]
}
} So first step is to add ["property": "features"] to vega-light data format description. That splits valid geojson stored in values back to rows of GeoDataFrame (it is possible to replace this step with storing content of "features" directly into "values" but that will made "values" invalid geojson). Next is access to column values of GeoDataFrame. Values are accessible from chart as "properties.column_name". I hoped to simplify user experience by adding "properties." in shorthand. Now if user use shorthands he will get same behavior for GeoDataFrame as for DataFrame (take a look to updated description of PR) . May be I should check if field name starts with "properties." to avoid doubling it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, I see. We can think about that. In the meantime can you fix the indentation? 😄 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. indentation fixed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not needed anymore due new save format |
||
return attrs | ||
|
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,7 +33,7 @@ class DataTransformerRegistry(PluginRegistry[DataTransformerType]): | |
# form. | ||
# | ||
# A data model transformer has the following type signature: | ||
# DataModelType = Union[dict, pd.DataFrame] | ||
# DataModelType = Union[dict, pd.DataFrame, gpd.GeoDataFrame, geojson interface object] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it true that any Python object can expose a geojson interface? In that case, I don't think the typing-style specification is all that useful anymore. (or just needs to have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According Sean Gillies it is growing community of adopters:). In fact there is some list of classes unknown to me. In my opinion there is no crucial value in supporting general interface - just funny feature. If it breaks some concept maybe we should remove interface support. Or we can declare only geojson, so all others will be supported accidentally.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm.. I'm not certain. I don't know the geo space well enough. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well I think |
||
# DataModelTransformerType = Callable[[DataModelType, KwArgs], DataModelType] | ||
# ============================================================================== | ||
|
||
|
@@ -57,6 +57,8 @@ def limit_rows(data, max_rows=5000): | |
values = data['values'] | ||
else: | ||
return data | ||
else: | ||
return data | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now that there are three return statements in this code, the logic is pretty opaque (it took me a bit to read this and figure out what it was doing). I think the function should be refactored for clarity. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As currently written the function will never progress beyond this line, and the max_rows check will never happen There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if so, how could it pass There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or do you mean that it bypass unknown data types, then yes like #887 |
||
if max_rows is not None and len(values) > max_rows: | ||
raise MaxRowsError('The number of rows in your dataset is greater ' | ||
'than the maximum allowed ({0}). ' | ||
|
@@ -85,18 +87,29 @@ def to_json(data, prefix='altair-data'): | |
check_data_type(data) | ||
ext = '.json' | ||
filename = _compute_filename(prefix=prefix, ext=ext) | ||
data_format = {'type': 'json'} | ||
if isinstance(data, pd.DataFrame): | ||
data = sanitize_dataframe(data) | ||
data.to_json(filename, orient='records') | ||
if not hasattr(data,'__geo_interface__'): | ||
data.to_json(filename, orient='records') | ||
else: #GeoPandas | ||
with open(filename,'w') as f: | ||
json.dump(data.__geo_interface__, f) | ||
data_format['property']='features' | ||
|
||
elif hasattr(data,'__geo_interface__'): # geojson object | ||
with open(filename,'w') as f: | ||
json.dump(data.__geo_interface__, f) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would put all the geo_interface logic in one block... e.g. if hasattr(data, '__geo_interface__'):
# do geo stuff
elif isinstance(data, pd.DataFrame):
# do dataframe stuff
elif isinstance(data, dict):
# do dict stuff There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, if isinstance(data, pd.DataFrame) and hasattr(data, '__geo_interface__'):
# do GeoPandas stuff
elif hasattr(data, '__geo_interface__'):
# do geo_interface stuff
elif isinstance(data, pd.DataFrame):
# do dataframe stuff
elif isinstance(data, dict):
# do dict stuff That is simple and clear. But I supposed that it could provoke errors in case some modification in Pandas block without mirroring it to GeoPandas. That why is so strange logic in final version. Should i rewrite it back? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I see – there's overlap between geo and dataframe. I think it would be best to put all the geo logic under one if statement, so something like: if hasattr(data, '__geo_interface__'):
if dataframe:
#stuff
else:
# other stuff
elif isinstance(data, pd.DataFrame):
# do dataframe stuff
elif isinstance(data, dict):
# do dict stuff There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
|
||
elif isinstance(data, dict): | ||
if 'values' not in data: | ||
raise KeyError('values expected in data dict, but not present.') | ||
values = data['values'] | ||
with open(filename) as f: | ||
with open(filename,'w') as f: | ||
json.dump(values, f) | ||
return { | ||
'url': filename, | ||
'format': {'type': 'json'} | ||
'format': data_format | ||
} | ||
|
||
|
||
|
@@ -107,12 +120,17 @@ def to_csv(data, prefix='altair-data'): | |
ext = '.csv' | ||
filename = _compute_filename(prefix=prefix, ext=ext) | ||
if isinstance(data, pd.DataFrame): | ||
if hasattr(data,'__geo_interface__'):#GeoPandas | ||
raise NotImplementedError('use to_json or to_values with GeoDataFrame objects.') | ||
|
||
data = sanitize_dataframe(data) | ||
data.to_csv(filename) | ||
return { | ||
'url': filename, | ||
'format': {'type': 'csv'} | ||
} | ||
elif hasattr(data,'__geo_interface__'):#GeoJSON | ||
raise NotImplementedError('to_csv only works with Pandas DataFrame objects.') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same comment here: please do the geo_interface logic all in one place, rather than checking it twice. Also, four spaces for indentation please. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
elif isinstance(data, dict): | ||
raise NotImplementedError('to_csv only works with Pandas DataFrame objects.') | ||
|
||
|
@@ -123,7 +141,19 @@ def to_values(data): | |
check_data_type(data) | ||
if isinstance(data, pd.DataFrame): | ||
data = sanitize_dataframe(data) | ||
if hasattr(data,'__geo_interface__'):#GeoPandas | ||
return { | ||
'values':data.__geo_interface__, | ||
'format':{'type':'json','property':'features'} | ||
} | ||
|
||
return {'values': data.to_dict(orient='records')} | ||
elif hasattr(data,'__geo_interface__'):#GeoJSON | ||
return { | ||
'values':data.__geo_interface__, | ||
'format':{'type':'json'} | ||
} | ||
|
||
elif isinstance(data, dict): | ||
if 'values' not in data: | ||
raise KeyError('values expected in data dict, but not present.') | ||
|
@@ -132,8 +162,8 @@ def to_values(data): | |
|
||
def check_data_type(data): | ||
"""Raise if the data is not a dict or DataFrame.""" | ||
if not isinstance(data, (dict, pd.DataFrame)): | ||
raise TypeError('Expected dict or DataFrame, got: {}'.format(type(data))) | ||
if not (isinstance(data, (dict, pd.DataFrame)) or hasattr(data,'__geo_interface__')): | ||
raise TypeError('Expected dict, DataFrame, GeoDataFrame or geojson inteface object , got: {}'.format(type(data))) | ||
|
||
|
||
# ============================================================================== | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,153 @@ | ||
import pytest | ||
import pandas as pd | ||
import altair.vegalite.v2 as alt | ||
|
||
from ..data import pipe, to_values, to_csv | ||
from .. import parse_shorthand | ||
|
||
|
||
def _create_geojson(): | ||
return { | ||
"type": "FeatureCollection", | ||
"bbox": [ | ||
-161.30174569731454, | ||
-60.39157788643298, | ||
172.67580002536624, | ||
42.438347020953984 | ||
], | ||
"features": [ | ||
{ | ||
"type": "Feature", | ||
"properties": {"prop": 1}, | ||
"geometry": { | ||
"type": "LineString", | ||
"coordinates": [ | ||
[-69.2980008004234, 23.18780298146116], | ||
[-161.30174569731454, -60.39157788643298], | ||
[172.67580002536624, 24.151450472748962] | ||
] | ||
}, | ||
"id": "0", | ||
"bbox": [ | ||
-161.30174569731454, | ||
-60.39157788643298, | ||
172.67580002536624, | ||
24.151450472748962 | ||
] | ||
}, | ||
{ | ||
"type": "Feature", | ||
"properties": {"prop": 2}, | ||
"geometry": { | ||
"type": "LineString", | ||
"coordinates": [ | ||
[156.03047546751765, 42.438347020953984], | ||
[35.46296546950265, -18.185542212943375], | ||
[152.53211600051463, 23.471406463455793] | ||
] | ||
}, | ||
"id": "1", | ||
"bbox": [ | ||
35.46296546950265, | ||
-18.185542212943375, | ||
156.03047546751765, | ||
42.438347020953984 | ||
] | ||
}, | ||
{ | ||
"type": "Feature", | ||
"properties": {"prop": 3}, | ||
"geometry": { | ||
"type": "LineString", | ||
"coordinates": [ | ||
[-133.98414913936503, 25.39468871174894], | ||
[145.04376601680605, 13.058626381790845], | ||
[170.30576801294046, 38.67128737163435] | ||
] | ||
}, | ||
"id": "2", | ||
"bbox": [ | ||
-133.98414913936503, | ||
13.058626381790845, | ||
170.30576801294046, | ||
38.67128737163435 | ||
] | ||
} | ||
] | ||
} | ||
|
||
def _create_fake_geo_interface(): | ||
class FakeGeoJSON: | ||
__geo_interface__=_create_geojson() | ||
return FakeGeoJSON() | ||
|
||
def _create_fake_geodatafarme(): | ||
class FakeGeoDataFrame(pd.DataFrame): | ||
__geo_interface__=_create_geojson() | ||
def copy(self, deep=True): | ||
data = self._data | ||
if deep: | ||
data = data.copy() | ||
return FakeGeoDataFrame(data).__finalize__(self) | ||
|
||
return FakeGeoDataFrame({'prop':[1,2,3]}) | ||
|
||
def test_to_values_geo(): | ||
"""Test the to_values data transformer.""" | ||
|
||
data = _create_fake_geodatafarme() | ||
result = pipe(data, to_values) | ||
assert result['format'] == {'type':'json','property':'features'} | ||
assert result['values']==data.__geo_interface__ | ||
|
||
data = _create_fake_geo_interface() | ||
result = pipe(data, to_values) | ||
assert result['format'] == {'type':'json'} | ||
assert result['values']==data.__geo_interface__ | ||
|
||
def test_chart_data_geotypes(): | ||
Chart = lambda data,**arg: alt.Chart(data).mark_geoshape().project().encode(**arg) | ||
|
||
# Fake GeoPandas | ||
data = _create_fake_geodatafarme() | ||
dct = Chart(data,fill='prop').to_dict() | ||
assert dct['data']['format'] == {'type':'json','property':'features'} | ||
assert dct['data']['values'] == data.__geo_interface__ | ||
|
||
# Fake GeoInteface | ||
data = _create_fake_geo_interface() | ||
dct = Chart(data).to_dict() | ||
assert dct['data']['format'] == {'type':'json'} | ||
assert dct['data']['values'] == data.__geo_interface__ | ||
|
||
def test_parse_shorthand_with_geodata(): | ||
def check(s, data, **kwargs): | ||
assert parse_shorthand(s, data) == kwargs | ||
|
||
data = _create_fake_geodatafarme() | ||
|
||
check('prop', data, field='properties.prop', type='quantitative') | ||
check('prop:N', data, field='properties.prop', type='nominal') | ||
check('count(prop)', data, field='properties.prop', aggregate='count', type='quantitative') | ||
|
||
data = _create_fake_geo_interface() | ||
|
||
check('properties.prop:Q', data, field='properties.prop', type='quantitative') | ||
check('prop', data, field='prop') | ||
|
||
def test_to_csv_geo(): | ||
"""Test the to_csv raise error with geopandas.""" | ||
|
||
data = _create_fake_geodatafarme() | ||
with pytest.raises(NotImplementedError): | ||
pipe(data, to_csv) | ||
|
||
def test_geo_pandas(): | ||
gpd = pytest.importorskip('geopandas') | ||
|
||
data = gpd.GeoDataFrame.from_features(_create_geojson()) | ||
dct = alt.Chart(data).mark_geoshape().project().encode(fill='prop').to_dict() | ||
|
||
assert dct['data']['format'] == {'type':'json','property':'features'} | ||
assert (gpd.GeoDataFrame.from_features(dct['data']['values']) == data).all().all() | ||
assert dct['encoding'] == {'fill': {'field': 'properties.prop', 'type': 'quantitative'}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done