Thank you sir for the clarification.
On Sat, Jun 19, 2021 at 7:24 PM Konrad Hinsen email@example.com wrote:
I am working on implementing the DataFrame>>dtypes feature which
checks the datatypes of columns in a DataFrame, as part of my GSOC
project. I have tried to explain my theoretical work done so far on
this blog post. Please kindly go through it , as I need advice on what
could be the optimal way to implement this feature. Any kind of input
and discussions is most welcome.
Your post looks like an overall accurate description of the current
state of everything - with one exception, and that is Pandas. You say
you didn't look at the Pandas code yet, so that's not surprising.
You seem to assume that Pandas stores Python objects as elements of
DataFrames, but that isn't true. Pandas uses NumPy arrays instead. And
NumPy arrays are very different from standard Python objects, because
their internal data layout is by design the same as used in C or
Fortran. For a full description, see
However, I am not sure you need to understand this in all detail, as I
am pretty sure that you do not want to copy this approach in Pharo.
The one point that does matter for you is where NumPy and Pandas take a
column's dtype from. The answer is that it's defined when a DataFrame is
created, and it cannot be changed afterwards. If a column is "integer",
it will remain "integer" forever. If you try to assign a string to an
element of such a column, you get an error message. When you create a
DataFrame from existing data, e.g. by reading a CSV file, Pandas scans
the data and determines a suitable dtype, much in the same way as V1.0
in Pharo/PolyMath did. But since Pandas doesn't allow any later change,
there is no serious performance issue.
So that's an option you can add to your list: define the dtypes once and
for all when the DataFrame is created. The main drawback is that you
would have to change the API for DataFrame creation to make this work.