pharo-users@lists.pharo.org

Any question about pharo is welcome

View all threads

Re: Implementing DataFrame>>dtypes feature in Pharo PolyMath Project

BG
Balaji G
Sat, Jun 19, 2021 2:38 PM

Thank you sir for the clarification.

On Sat, Jun 19, 2021 at 7:24 PM Konrad Hinsen konrad.hinsen@cnrs.fr wrote:

Dear Balaji,

I am working on implementing the DataFrame>>dtypes feature which
checks the datatypes of columns in a DataFrame, as part of my GSOC
project. I have tried to explain my theoretical work done so far on
this blog post. Please kindly go through it , as I need advice on what
could be the optimal way to implement this feature. Any kind of input
and discussions is most welcome.

Your post looks like an overall accurate description of the current
state of everything - with one exception, and that is Pandas. You say
you didn't look at the Pandas code yet, so that's not surprising.

You seem to assume that Pandas stores Python objects as elements of
DataFrames, but that isn't true. Pandas uses NumPy arrays instead. And
NumPy arrays are very different from standard Python objects, because
their internal data layout is by design the same as used in C or
Fortran. For a full description, see

https://numpy.org/doc/stable/user/basics.rec.html

However, I am not sure you need to understand this in all detail, as I
am pretty sure that you do not want to copy this approach in Pharo.

The one point that does matter for you is where NumPy and Pandas take a
column's dtype from. The answer is that it's defined when a DataFrame is
created, and it cannot be changed afterwards. If a column is "integer",
it will remain "integer" forever. If you try to assign a string to an
element of such a column, you get an error message. When you create a
DataFrame from existing data, e.g. by reading a CSV file, Pandas scans
the data and determines a suitable dtype, much in the same way as V1.0
in Pharo/PolyMath did. But since Pandas doesn't allow any later change,
there is no serious performance issue.

So that's an option you can add to your list: define the dtypes once and
for all when the DataFrame is created. The main drawback is that you
would have to change the API for DataFrame creation to make this work.

Cheers,
Konrad


Konrad Hinsen
Centre de Biophysique Moléculaire, CNRS Orléans
Synchrotron Soleil - Division Expériences
Saint Aubin - BP 48
91192 Gif sur Yvette Cedex, France
Tel. +33-1 69 35 97 15
E-Mail: konrad DOT hinsen AT cnrs DOT fr
http://dirac.cnrs-orleans.fr/~hinsen/
ORCID: https://orcid.org/0000-0003-0330-9428
Twitter: @khinsen

Thank you sir for the clarification. On Sat, Jun 19, 2021 at 7:24 PM Konrad Hinsen <konrad.hinsen@cnrs.fr> wrote: > Dear Balaji, > > > I am working on implementing the DataFrame>>dtypes feature which > > checks the datatypes of columns in a DataFrame, as part of my GSOC > > project. I have tried to explain my theoretical work done so far on > > this blog post. Please kindly go through it , as I need advice on what > > could be the optimal way to implement this feature. Any kind of input > > and discussions is most welcome. > > Your post looks like an overall accurate description of the current > state of everything - with one exception, and that is Pandas. You say > you didn't look at the Pandas code yet, so that's not surprising. > > You seem to assume that Pandas stores Python objects as elements of > DataFrames, but that isn't true. Pandas uses NumPy arrays instead. And > NumPy arrays are very different from standard Python objects, because > their internal data layout is by design the same as used in C or > Fortran. For a full description, see > > https://numpy.org/doc/stable/user/basics.rec.html > > However, I am not sure you need to understand this in all detail, as I > am pretty sure that you do not want to copy this approach in Pharo. > > The one point that does matter for you is where NumPy and Pandas take a > column's dtype from. The answer is that it's defined when a DataFrame is > created, and it cannot be changed afterwards. If a column is "integer", > it will remain "integer" forever. If you try to assign a string to an > element of such a column, you get an error message. When you create a > DataFrame from existing data, e.g. by reading a CSV file, Pandas scans > the data and determines a suitable dtype, much in the same way as V1.0 > in Pharo/PolyMath did. But since Pandas doesn't allow any later change, > there is no serious performance issue. > > So that's an option you can add to your list: define the dtypes once and > for all when the DataFrame is created. The main drawback is that you > would have to change the API for DataFrame creation to make this work. > > Cheers, > Konrad > -- > --------------------------------------------------------------------- > Konrad Hinsen > Centre de Biophysique Moléculaire, CNRS Orléans > Synchrotron Soleil - Division Expériences > Saint Aubin - BP 48 > 91192 Gif sur Yvette Cedex, France > Tel. +33-1 69 35 97 15 > E-Mail: konrad DOT hinsen AT cnrs DOT fr > http://dirac.cnrs-orleans.fr/~hinsen/ > ORCID: https://orcid.org/0000-0003-0330-9428 > Twitter: @khinsen > --------------------------------------------------------------------- >