So as was noted in my last post the stub using files was horrendously slow. It took over 30 mins to generate 10 mobile no’s with just 100 points each.
As part of my continuing education with Python then I decided this could be sped up a lot more by doing this in memory as the dataset is not that large and started to look at Pandas. I am not going to go into what Pandas does here as it is done to death all over the internet, but needless to say this worked tremendously well.
What have I changed:
In the Git hub repo for this little project you will now find there are a couple of implementations, the first: create-send-data-files.py
and the second: create-send-data-pandas.py
. These respectfully use modules sourced from the: generate-data
folder named similarly.
Reading the file – Its a one liner:
In the file based approach then reading the postcode file in was a call to a function:
read_input_file_to_dict(postcodes_file, processed_postcodes,0)
But this actually called off to multi line function that had some manipulation of the strings to maintain the types. Now:
postcodes_df = pd.read_csv(postcodes_file)
Just the one line above reads the file and loads it into a Pandas DataFrame, with the correct types (using: <Data Frame Name>.dtypes
):
id int64
postcode object
latitude float64
longitude float64
And as an aside you can get some statistics on the data (using: <Data Frame Name>.describe()
):
id latitude longitude count 1.766510e+06 1.766428e+06 1.766428e+06 mean 9.734494e+05 5.309039e+01 -1.725632e+00 std 6.091245e+05 4.247189e+00 1.657252e+00 min 1.000000e+00 4.918194e+01 -8.163139e+00 25% 4.738062e+05 5.150653e+01 -2.725936e+00 50% 9.440715e+05 5.247490e+01 -1.585693e+00 75% 1.413107e+06 5.368498e+01 -3.608118e-01 max 2.660538e+06 1.000000e+02 1.760443e+00
Something that is not easily possible using just vanilla Python.
Manipulating the file – Changing mindset!:
With the file based approach then lots of the time was spent reading through the various dictionaries line by line and discarding a lot of the read data.
Using Pandas you have to change how you think about doing these things to a more “SQL like” way. In the stub then we randomly choose a postcode letter (e.g. A) and then get the min and max of the ids to do with those postcodes for later random selection again. This is easily done in Pandas by first adding a new column:
# Create a column of the first letter:
postcodes_df['postcode_first_letter'] = postcodes_df['postcode'].str[0]
Then carrying out an SQL like min max query:
min_pcode_id_sr = postcodes_df.groupby('postcode_first_letter', sort=True)['id'].min()
max_pcode_id_sr = postcodes_df.groupby('postcode_first_letter', sort=True)['id'].max()
This now creates two additional dataframes that could be manipulated further if needed, but can be accessed easily later to get the lower and upper bounds of the ID’s for the postcode (where chosen_letter is the one that is required):
min_postcode_id = min_pcode_id_sr.get(key=chosen_letter)
max_postcode_id = max_pcode_id_sr.get(key=chosen_letter)
Flourishing the data – as easy as a dict:
As in the file based implementation then it is easy to add additional columns for the made up data and actually required “no code changes”; so for the Pandas version:
chosen_data_mod['temperature'] = get_temperature()
VS (for the file version):
chosen_data['temperature'] = get_temperature()
The object names are different out of necessity here but you can see that the creation is very similar.
I feel the need … the need for speed
As I stated at the top of the article then running (for producing 20 devices with 100 data points):
time python3 generate_data_stub_pandas.py
Got me results of:
real 0m13.461s
user 0m13.539s
sys 0m0.338s
And obviously a file. Doing the same for the files version, generated a lot of load on the machine, and no file after 5 mins!
Conclusions:
Therefore going forward then using the Pandas version of this code a it will allow me if I want to either generate a fast amount of data quickly OR have it running in a continuous loop generating a steady stream of data.