Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with geopandas #588 #818

Closed
wants to merge 24 commits into from
Closed
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
d317c98
Merge remote-tracking branch 'altair-viz/master'
iliatimofeev Mar 27, 2018
e297f20
Merge remote-tracking branch 'altair-viz/master'
iliatimofeev Mar 27, 2018
ea65ca1
Merge remote-tracking branch 'altair-viz/master'
iliatimofeev May 4, 2018
f7d66b7
__geo_interface__ in to_geojson_values
iliatimofeev May 5, 2018
e9d28a5
GeoDataFrame support without dependency of GeoPandas
iliatimofeev May 5, 2018
985d3f6
Merge remote-tracking branch 'altair-viz/master' into it-#588-geopandas
iliatimofeev May 5, 2018
b627c6c
Unused test file
iliatimofeev May 6, 2018
2cd9fde
Full __geo_interface__ support, test_geojson
iliatimofeev May 7, 2018
7db9ff8
test update
iliatimofeev May 7, 2018
950eb72
Mistakenly added .vscode files removed
iliatimofeev May 8, 2018
9f91c00
limit_rows two returns, to_* one if "geo" statement, four spaces for…
iliatimofeev May 8, 2018
76a2af8
geojson_feature()
iliatimofeev May 9, 2018
094a2e7
test_geopandas_examples (hacker version)
iliatimofeev May 13, 2018
df84294
travis-ci: move finalized locals outside try
iliatimofeev May 13, 2018
27a3df8
remove python 3 code
iliatimofeev May 13, 2018
301ea23
flat version
iliatimofeev May 16, 2018
c114acc
flat version
iliatimofeev May 16, 2018
143ad04
Merge remote-tracking branch 'altair-viz/master' into it-#588-geopandas
iliatimofeev May 16, 2018
661447d
Merge remote-tracking branch 'altair-viz/master' into it-#588-geopandas
iliatimofeev May 16, 2018
d2b46e0
GeoPandas ref
iliatimofeev May 17, 2018
649fa21
Merge remote-tracking branch 'altair-viz/master' into it-#588-geopandas
iliatimofeev Jun 10, 2018
89a999e
Merge remote-tracking branch 'altair-viz/master' into it-#588-geopandas
iliatimofeev Jun 10, 2018
80c56d6
merge
iliatimofeev Jun 10, 2018
1bb8192
flake8 fix
iliatimofeev Jun 10, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ target/

.ipynb_checkpoints
.idea/*
.vscode/*
tools/_build
Untitled*.ipynb
.mypy*
Expand Down
8 changes: 6 additions & 2 deletions altair/utils/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -227,9 +227,13 @@ def parse_shorthand(shorthand, data=None):
{'aggregate': 'count', 'type': 'quantitative'}
"""
attrs = _parse_shorthand(shorthand)
if isinstance(data, pd.DataFrame) and 'type' not in attrs:
if isinstance(data, pd.DataFrame):
if 'field' in attrs and attrs['field'] in data.columns:
attrs['type'] = infer_vegalite_type(data[attrs['field']])
if 'type' not in attrs:
attrs['type'] = infer_vegalite_type(data[attrs['field']])
if hasattr(data,'__geo_interface__'):
attrs['field']='properties.'+attrs['field']

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something strange is going on here: is not checking for type in attrs necessary for this PR? And why is this block doubly indented?

Copy link
Contributor Author

@iliatimofeev iliatimofeev May 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'type' not in attrs this is still checked, but in the line below (232).
The idea behind: to allow user work with GeoDataFrame like as regular DataFrame but showing geoshapes attached to its rows.
Implementation details:
First of all GeoDataFrame is subclass of pd.DataFrame with __geo_interface__. So it is instance of pd.DataFrame that has attribute __geo_interface__. GeoPandas are stored as geojson FeatureCollection with each row as Feature object where all columns are placed in properties object. Sample will be more informative:

{ /* vega-light data*/
    "format": {
        "property": "features", /* generate geoshape for each row*/
        "type": "json"
    },
    "values": { /* valid geojson for all rows*/
        "type": "FeatureCollection",
        "features": [
            { /* valid geojson for each row*/
                "type": "Feature",
                "properties": { /* column values  */
                    "pop_est": 12799293.0,
                    "continent": "Africa",
                },
                "geometry": { /* geometry of the row  */
                    "type": "MultiPolygon",
                    "coordinates": [ /* a lot of numbers*/]
                }
            }
        ]
    }
}

So first step is to add ["property": "features"] to vega-light data format description. That splits valid geojson stored in values back to rows of GeoDataFrame (it is possible to replace this step with storing content of "features" directly into "values" but that will made "values" invalid geojson).

Next is access to column values of GeoDataFrame. Values are accessible from chart as "properties.column_name". I hoped to simplify user experience by adding "properties." in shorthand. Now if user use shorthands he will get same behavior for GeoDataFrame as for DataFrame (take a look to updated description of PR) .

May be I should check if field name starts with "properties." to avoid doubling it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see.

We can think about that. In the meantime can you fix the indentation? 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed anymore due new save format

return attrs


Expand Down
51 changes: 38 additions & 13 deletions altair/utils/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ class DataTransformerRegistry(PluginRegistry[DataTransformerType]):
# form.
#
# A data model transformer has the following type signature:
# DataModelType = Union[dict, pd.DataFrame]
# DataModelType = Union[dict, pd.DataFrame, gpd.GeoDataFrame, geojson.GeoJSON]
# DataModelTransformerType = Callable[[DataModelType, KwArgs], DataModelType]
# ==============================================================================

Expand All @@ -52,11 +52,10 @@ def limit_rows(data, max_rows=5000):
check_data_type(data)
if isinstance(data, pd.DataFrame):
values = data
elif isinstance(data, dict):
if 'values' in data:
values = data['values']
else:
return data
elif isinstance(data, dict) and ('values' in data):
values = data['values']
else:
return data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that there are three return statements in this code, the logic is pretty opaque (it took me a bit to read this and figure out what it was doing).

I think the function should be refactored for clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As currently written the function will never progress beyond this line, and the max_rows check will never happen

Copy link
Contributor Author

@iliatimofeev iliatimofeev May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if so, how could it pass test_limit_rows()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do you mean that it bypass unknown data types, then yes like #887

if max_rows is not None and len(values) > max_rows:
raise MaxRowsError('The number of rows in your dataset is greater '
'than the maximum allowed ({0}). '
Expand Down Expand Up @@ -85,18 +84,29 @@ def to_json(data, prefix='altair-data'):
check_data_type(data)
ext = '.json'
filename = _compute_filename(prefix=prefix, ext=ext)
if isinstance(data, pd.DataFrame):
data_format = {'type': 'json'}

if hasattr(data,'__geo_interface__'):
if isinstance(data, pd.DataFrame): #GeoPandas
data = sanitize_dataframe(data)
data_format['property']='features'

with open(filename,'w') as f:
json.dump(data.__geo_interface__, f)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put all the geo_interface logic in one block... e.g.

if hasattr(data, '__geo_interface__'):
    # do geo stuff
elif isinstance(data, pd.DataFrame):
    # do dataframe stuff
elif isinstance(data, dict):
   # do dict stuff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, GeoPandas needs extra line of code than a simple __geo_interface__. I had something like :

if isinstance(data, pd.DataFrame) and hasattr(data, '__geo_interface__'):
    # do GeoPandas stuff
elif hasattr(data, '__geo_interface__'):
    # do geo_interface stuff
elif isinstance(data, pd.DataFrame):
    # do dataframe stuff
elif isinstance(data, dict):
   # do dict stuff

That is simple and clear. But I supposed that it could provoke errors in case some modification in Pandas block without mirroring it to GeoPandas. That why is so strange logic in final version.

Should i rewrite it back?

Copy link
Collaborator

@jakevdp jakevdp May 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see – there's overlap between geo and dataframe.

I think it would be best to put all the geo logic under one if statement, so something like:

if hasattr(data, '__geo_interface__'):
    if dataframe:
        #stuff
    else:
        # other stuff
elif isinstance(data, pd.DataFrame):
    # do dataframe stuff
elif isinstance(data, dict):
   # do dict stuff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


elif isinstance(data, pd.DataFrame):
data = sanitize_dataframe(data)
data.to_json(filename, orient='records')

elif isinstance(data, dict):
if 'values' not in data:
raise KeyError('values expected in data dict, but not present.')
values = data['values']
with open(filename) as f:
with open(filename,'w') as f:
json.dump(values, f)
return {
'url': filename,
'format': {'type': 'json'}
'format': data_format
}


Expand All @@ -106,7 +116,10 @@ def to_csv(data, prefix='altair-data'):
check_data_type(data)
ext = '.csv'
filename = _compute_filename(prefix=prefix, ext=ext)
if isinstance(data, pd.DataFrame):
if hasattr(data,'__geo_interface__'):
raise NotImplementedError('use to_json or to_values with GeoJSON objects.')

elif isinstance(data, pd.DataFrame):
data = sanitize_dataframe(data)
data.to_csv(filename)
return {
Expand All @@ -121,9 +134,21 @@ def to_csv(data, prefix='altair-data'):
def to_values(data):
"""Replace a DataFrame by a data model with values."""
check_data_type(data)
if isinstance(data, pd.DataFrame):

if hasattr(data,'__geo_interface__'):
data_format = {'type': 'json'}
if isinstance(data, pd.DataFrame): #GeoPandas
data = sanitize_dataframe(data)
data_format['property']='features'
return {
'values':data.__geo_interface__,
'format': data_format
}

elif isinstance(data, pd.DataFrame):
data = sanitize_dataframe(data)
return {'values': data.to_dict(orient='records')}

elif isinstance(data, dict):
if 'values' not in data:
raise KeyError('values expected in data dict, but not present.')
Expand All @@ -132,8 +157,8 @@ def to_values(data):

def check_data_type(data):
"""Raise if the data is not a dict or DataFrame."""
if not isinstance(data, (dict, pd.DataFrame)):
raise TypeError('Expected dict or DataFrame, got: {}'.format(type(data)))
if not (isinstance(data, (dict, pd.DataFrame)) or hasattr(data,'__geo_interface__')):
raise TypeError('Expected dict, DataFrame, GeoDataFrame or geojson, got: {}'.format(type(data)))


# ==============================================================================
Expand Down
2 changes: 1 addition & 1 deletion altair/utils/tests/test_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ def test_to_values():


def test_type_error():
"""Ensure that TypeError is raised for types other than dict/DataFrame."""
"""Ensure that TypeError is raised for types other than dict/DataFrame/GeoDataFrame/__geo_interface__."""
for f in (sample, limit_rows, to_values):
with pytest.raises(TypeError):
pipe(0, f)
153 changes: 153 additions & 0 deletions altair/utils/tests/test_geojson.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
import pytest
import pandas as pd
import altair.vegalite.v2 as alt

from ..data import pipe, to_values, to_csv
from .. import parse_shorthand


def _create_geojson():
return {
"type": "FeatureCollection",
"bbox": [
-161.30174569731454,
-60.39157788643298,
172.67580002536624,
42.438347020953984
],
"features": [
{
"type": "Feature",
"properties": {"prop": 1},
"geometry": {
"type": "LineString",
"coordinates": [
[-69.2980008004234, 23.18780298146116],
[-161.30174569731454, -60.39157788643298],
[172.67580002536624, 24.151450472748962]
]
},
"id": "0",
"bbox": [
-161.30174569731454,
-60.39157788643298,
172.67580002536624,
24.151450472748962
]
},
{
"type": "Feature",
"properties": {"prop": 2},
"geometry": {
"type": "LineString",
"coordinates": [
[156.03047546751765, 42.438347020953984],
[35.46296546950265, -18.185542212943375],
[152.53211600051463, 23.471406463455793]
]
},
"id": "1",
"bbox": [
35.46296546950265,
-18.185542212943375,
156.03047546751765,
42.438347020953984
]
},
{
"type": "Feature",
"properties": {"prop": 3},
"geometry": {
"type": "LineString",
"coordinates": [
[-133.98414913936503, 25.39468871174894],
[145.04376601680605, 13.058626381790845],
[170.30576801294046, 38.67128737163435]
]
},
"id": "2",
"bbox": [
-133.98414913936503,
13.058626381790845,
170.30576801294046,
38.67128737163435
]
}
]
}

def _create_fake_geo_interface():
class FakeGeoJSON:
__geo_interface__=_create_geojson()
return FakeGeoJSON()

def _create_fake_geodataframe():
class FakeGeoDataFrame(pd.DataFrame):
__geo_interface__=_create_geojson()
def copy(self, deep=True):
data = self._data
if deep:
data = data.copy()
return FakeGeoDataFrame(data).__finalize__(self)

return FakeGeoDataFrame({'prop':[1,2,3]})

def test_to_values_geo():
"""Test the to_values data transformer."""

data = _create_fake_geodataframe()
result = pipe(data, to_values)
assert result['format'] == {'type':'json','property':'features'}
assert result['values']==data.__geo_interface__

data = _create_fake_geo_interface()
result = pipe(data, to_values)
assert result['format'] == {'type':'json'}
assert result['values']==data.__geo_interface__

def test_chart_data_geotypes():
Chart = lambda data,**arg: alt.Chart(data).mark_geoshape().project().encode(**arg)

# Fake GeoPandas
data = _create_fake_geodataframe()
dct = Chart(data,fill='prop').to_dict()
assert dct['data']['format'] == {'type':'json','property':'features'}
assert dct['data']['values'] == data.__geo_interface__

# Fake GeoInterface
data = _create_fake_geo_interface()
dct = Chart(data).to_dict()
assert dct['data']['format'] == {'type':'json'}
assert dct['data']['values'] == data.__geo_interface__

def test_parse_shorthand_with_geodata():
def check(s, data, **kwargs):
assert parse_shorthand(s, data) == kwargs

data = _create_fake_geodataframe()

check('prop', data, field='properties.prop', type='quantitative')
check('prop:N', data, field='properties.prop', type='nominal')
check('count(prop)', data, field='properties.prop', aggregate='count', type='quantitative')

data = _create_fake_geo_interface()

check('properties.prop:Q', data, field='properties.prop', type='quantitative')
check('prop', data, field='prop')

def test_to_csv_geo():
"""Test the to_csv raise error with geopandas."""

data = _create_fake_geodataframe()
with pytest.raises(NotImplementedError):
pipe(data, to_csv)

def test_geo_pandas():
gpd = pytest.importorskip('geopandas')

data = gpd.GeoDataFrame.from_features(_create_geojson())
dct = alt.Chart(data).mark_geoshape().project().encode(fill='prop').to_dict()

assert dct['data']['format'] == {'type':'json','property':'features'}
assert (gpd.GeoDataFrame.from_features(dct['data']['values']) == data).all().all()
assert dct['encoding'] == {'fill': {'field': 'properties.prop', 'type': 'quantitative'}}
4 changes: 2 additions & 2 deletions altair/vegalite/v2/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,11 @@ def _prepare_data(data):
"""Convert input data to data for use within schema"""
if data is Undefined:
return data
elif isinstance(data, pd.DataFrame) or hasattr(data, '__geo_interface__'):
return pipe(data, data_transformers.get())
elif isinstance(data, (dict, core.Data, core.InlineData,
core.UrlData, core.NamedData)):
return data
elif isinstance(data, pd.DataFrame):
return pipe(data, data_transformers.get())
elif isinstance(data, six.string_types):
return core.UrlData(data)
else:
Expand Down