-
-
Notifications
You must be signed in to change notification settings - Fork 18.9k
[ArrowStringArray] API: StringDtype parameterized by storage (python or pyarrow) #39908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
4cb60e6
d242f2d
d39ab2c
2367810
9166d3b
8760705
d5b3fec
2c657df
647a6c2
0596fd7
c5a19c5
99680c9
69a6cc1
bd147ba
830275f
214e524
c9ba03c
7425536
68ac391
5cfa97a
74dbf96
3985943
3bda421
0c108a4
523e24c
279624c
80d231e
c5ced5a
459812c
d707b6b
71ccf24
daaac06
46626d1
3677bfa
42d382f
4fb1a0d
5d4eac1
15efb2e
b53cfe0
b7db53f
3399f08
e365f01
71d1e6c
9e23c35
c69a611
64b3206
d83a4ff
ef38660
aef1162
6247a5b
a6d066c
8adb08d
3ad0638
56714c9
6a1cc2b
1761a84
3e26baa
6b470b1
2ec6de0
a0b7a70
d9dcd20
4a37470
1d59c7a
e57c850
51f1b1d
fc95c06
ef02a43
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -113,18 +113,22 @@ def array( | |
|
||
Currently, pandas will infer an extension dtype for sequences of | ||
|
||
============================== ===================================== | ||
============================== ======================================= | ||
Scalar Type Array Type | ||
============================== ===================================== | ||
============================== ======================================= | ||
:class:`pandas.Interval` :class:`pandas.arrays.IntervalArray` | ||
:class:`pandas.Period` :class:`pandas.arrays.PeriodArray` | ||
:class:`datetime.datetime` :class:`pandas.arrays.DatetimeArray` | ||
:class:`datetime.timedelta` :class:`pandas.arrays.TimedeltaArray` | ||
:class:`int` :class:`pandas.arrays.IntegerArray` | ||
:class:`float` :class:`pandas.arrays.FloatingArray` | ||
:class:`str` :class:`pandas.arrays.StringArray` | ||
:class:`str` :class:`pandas.arrays.StringArray` or | ||
:class:`pandas.arrays.ArrowStringArray` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe should merge #40962 first (or cut down alternative) bit of chicken and egg, before this is merged naming can change, and vice-versa, I guess having both in the same PR is the only good solution even if more to review. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we are all agreed with the name ArrowStringArray. so perhaps nothing stopping us making it public now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done in 4a37470 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is an implementation detail. hopefully we can revert |
||
:class:`bool` :class:`pandas.arrays.BooleanArray` | ||
============================== ===================================== | ||
============================== ======================================= | ||
|
||
The ExtensionArray created when the scalar type is :class:`str` is determined by | ||
pd.options.mode.string_storage if the dtype is not explicitly given. | ||
simonjayhawkins marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
For all other cases, NumPy's usual inference rules will be used. | ||
|
||
|
@@ -240,6 +244,14 @@ def array( | |
['a', <NA>, 'c'] | ||
Length: 3, dtype: string[python] | ||
|
||
>>> with pd.option_context("string_storage", "pyarrow"): | ||
... arr = pd.array(["a", None, "c"]) | ||
... | ||
>>> arr | ||
<ArrowStringArray> | ||
['a', <NA>, 'c'] | ||
Length: 3, dtype: string[pyarrow] | ||
|
||
>>> pd.array([pd.Period('2000', freq="D"), pd.Period("2000", freq="D")]) | ||
<PeriodArray> | ||
['2000-01-01', '2000-01-01'] | ||
|
@@ -292,10 +304,10 @@ def array( | |
IntegerArray, | ||
IntervalArray, | ||
PandasArray, | ||
StringArray, | ||
TimedeltaArray, | ||
period_array, | ||
) | ||
from pandas.core.arrays.string_ import StringDtype | ||
|
||
if lib.is_scalar(data): | ||
msg = f"Cannot pass scalar '{data}' to 'pandas.array'." | ||
|
@@ -345,7 +357,8 @@ def array( | |
return TimedeltaArray._from_sequence(data, copy=copy) | ||
|
||
elif inferred_dtype == "string": | ||
return StringArray._from_sequence(data, copy=copy) | ||
# StringArray/ArrowStringArray depending on pd.options.mode.string_storage | ||
return StringDtype().construct_array_type()._from_sequence(data, copy=copy) | ||
|
||
elif inferred_dtype == "integer": | ||
return IntegerArray._from_sequence(data, copy=copy) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add some introspection functions
is_string_python_dtype
is_string_arrow_dtype
is_string_dtype
to encompass operations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
followon ok if you can create an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can discuss this once we have merged what we got in case things change. This conditional is slightly different from what would be in the introspection functions since this allows "string" whatever the global storage whereas StringDtype(), without the storage given uses the global option.
I'm probably overthinking or completely misunderstood @jorisvandenbossche suggestion that "StringDtype()" should defer the lookup.