• About Dangerous DBA
  • Table of Contents
Dangerous DBA A blog for those DBA's who live on the edge

Position Tracker – The Stub – Pandas:

May 18, 2020 11:45 am / Leave a Comment / dangerousDBA

So as was noted in my last post the stub using files was horrendously slow. It took over 30 mins to generate 10 mobile no’s with just 100 points each.

As part of my continuing education with Python then I decided this could be sped up a lot more by doing this in memory as the dataset is not that large and started to look at Pandas. I am not going to go into what Pandas does here as it is done to death all over the internet, but needless to say this worked tremendously well.

What have I changed:

In the Git hub repo for this little project you will now find there are a couple of implementations, the first: create-send-data-files.py and the second: create-send-data-pandas.py. These respectfully use modules sourced from the: generate-data folder named similarly.

Reading the file – Its a one liner:

In the file based approach then reading the postcode file in was a call to a function:

read_input_file_to_dict(postcodes_file, processed_postcodes,0)

But this actually called off to multi line function that had some manipulation of the strings to maintain the types. Now:

postcodes_df = pd.read_csv(postcodes_file)

Just the one line above reads the file and loads it into a Pandas DataFrame, with the correct types (using: <Data Frame Name>.dtypes):

id int64
postcode object
latitude float64
longitude float64

And as an aside you can get some statistics on the data (using: <Data Frame Name>.describe()):

             id      latitude     longitude
count 1.766510e+06 1.766428e+06 1.766428e+06
mean 9.734494e+05 5.309039e+01 -1.725632e+00
std 6.091245e+05 4.247189e+00 1.657252e+00
min 1.000000e+00 4.918194e+01 -8.163139e+00
25% 4.738062e+05 5.150653e+01 -2.725936e+00
50% 9.440715e+05 5.247490e+01 -1.585693e+00
75% 1.413107e+06 5.368498e+01 -3.608118e-01
max 2.660538e+06 1.000000e+02 1.760443e+00

Something that is not easily possible using just vanilla Python.

Manipulating the file – Changing mindset!:

With the file based approach then lots of the time was spent reading through the various dictionaries line by line and discarding a lot of the read data.

Using Pandas you have to change how you think about doing these things to a more “SQL like” way. In the stub then we randomly choose a postcode letter (e.g. A) and then get the min and max of the ids to do with those postcodes for later random selection again. This is easily done in Pandas by first adding a new column:

# Create a column of the first letter:
postcodes_df['postcode_first_letter'] = postcodes_df['postcode'].str[0]

Then carrying out an SQL like min max query:

min_pcode_id_sr = postcodes_df.groupby('postcode_first_letter', sort=True)['id'].min()
max_pcode_id_sr = postcodes_df.groupby('postcode_first_letter', sort=True)['id'].max()

This now creates two additional dataframes that could be manipulated further if needed, but can be accessed easily later to get the lower and upper bounds of the ID’s for the postcode (where chosen_letter is the one that is required):

min_postcode_id = min_pcode_id_sr.get(key=chosen_letter)
max_postcode_id = max_pcode_id_sr.get(key=chosen_letter)

Flourishing the data – as easy as a dict:

As in the file based implementation then it is easy to add additional columns for the made up data and actually required “no code changes”; so for the Pandas version:

chosen_data_mod['temperature'] = get_temperature()

VS (for the file version):

chosen_data['temperature'] = get_temperature()

The object names are different out of necessity here but you can see that the creation is very similar.

I feel the need … the need for speed

As I stated at the top of the article then running (for producing 20 devices with 100 data points):

time python3 generate_data_stub_pandas.py

Got me results of:

real 0m13.461s
user 0m13.539s
sys 0m0.338s

And obviously a file. Doing the same for the files version, generated a lot of load on the machine, and no file after 5 mins!

Conclusions:

Therefore going forward then using the Pandas version of this code a it will allow me if I want to either generate a fast amount of data quickly OR have it running in a continuous loop generating a steady stream of data.

Posted in: 2020, GCP, Google, Pandas, Position Tracker, Python / Tagged: Improvements, Learning, pandas, Position Tracking, python

Leave a Reply Cancel reply

Post Navigation

← Previous Post
Next Post →

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 757 other subscribers

Recent Posts

  • Self generating Simple SQL procedures – MySQL
  • Google Cloud Management – My Idea – My White Whale?
  • Position Tracker – The Stub – Pandas:
  • Position Tracker – The Stub
  • Position Tracker – In the beginning
  • Whats been going on in the world of the Dangerous DBA:
  • QCon London Day 1
  • Testing Amazon Redshift: Distribution keys and styles
  • Back to dangerous blogging
  • DB2 10.1 LUW Certification 611 notes 1 : Physical Design

Dangerous Topics

added functionality ADMIN_EST_INLINE_LENGTH Bootcamp colum convert data types DB2 db2 DB2 Administration DB2 Development db2advis db2licm Decompose XML EXPORT GCP Google IBM IBM DB2 LUW idug information centre infosphere IOT LOAD merry christmas and a happy new year Position Tracking python Recursive Query Recursive SQL Reorganisation Reorganise Reorganise Indexes Reorganise Tables Runstats sql statement Stored Procedures SYSPROC.ADMIN_CMD Time UDF User Defined Functions V9.7 V10.1 Varchar XML XML PATH XMLTABLE

DangerousDBA Links

  • DB2 for WebSphere Commerce
  • My Personal Blog

Disclaimer:

The posts here represent my personal views and not those of my employer. Any technical advice or instructions are based on my own personal knowledge and experience, and should only be followed by an expert after a careful analysis. Please test any actions before performing them in a critical or nonrecoverable environment. Any actions taken based on my experiences should be done with extreme caution. I am not responsible for any adverse results. DB2 is a trademark of IBM. I am not an employee or representative of IBM.

Advertising

© Copyright 2023 - Dangerous DBA
Infinity Theme by DesignCoral / WordPress