• About Dangerous DBA
  • Table of Contents
Dangerous DBA A blog for those DBA's who live on the edge

Category Archives: Position Tracker

Position Tracker – The Stub – Pandas:

May 18, 2020 11:45 am / Leave a Comment / dangerousDBA

So as was noted in my last post the stub using files was horrendously slow. It took over 30 mins to generate 10 mobile no’s with just 100 points each.

As part of my continuing education with Python then I decided this could be sped up a lot more by doing this in memory as the dataset is not that large and started to look at Pandas. I am not going to go into what Pandas does here as it is done to death all over the internet, but needless to say this worked tremendously well.

What have I changed:

In the Git hub repo for this little project you will now find there are a couple of implementations, the first: create-send-data-files.py and the second: create-send-data-pandas.py. These respectfully use modules sourced from the: generate-data folder named similarly.

Reading the file – Its a one liner:

In the file based approach then reading the postcode file in was a call to a function:

read_input_file_to_dict(postcodes_file, processed_postcodes,0)

But this actually called off to multi line function that had some manipulation of the strings to maintain the types. Now:

postcodes_df = pd.read_csv(postcodes_file)

Just the one line above reads the file and loads it into a Pandas DataFrame, with the correct types (using: <Data Frame Name>.dtypes):

id int64
postcode object
latitude float64
longitude float64

And as an aside you can get some statistics on the data (using: <Data Frame Name>.describe()):

             id      latitude     longitude
count 1.766510e+06 1.766428e+06 1.766428e+06
mean 9.734494e+05 5.309039e+01 -1.725632e+00
std 6.091245e+05 4.247189e+00 1.657252e+00
min 1.000000e+00 4.918194e+01 -8.163139e+00
25% 4.738062e+05 5.150653e+01 -2.725936e+00
50% 9.440715e+05 5.247490e+01 -1.585693e+00
75% 1.413107e+06 5.368498e+01 -3.608118e-01
max 2.660538e+06 1.000000e+02 1.760443e+00

Something that is not easily possible using just vanilla Python.

Manipulating the file – Changing mindset!:

With the file based approach then lots of the time was spent reading through the various dictionaries line by line and discarding a lot of the read data.

Using Pandas you have to change how you think about doing these things to a more “SQL like” way. In the stub then we randomly choose a postcode letter (e.g. A) and then get the min and max of the ids to do with those postcodes for later random selection again. This is easily done in Pandas by first adding a new column:

# Create a column of the first letter:
postcodes_df['postcode_first_letter'] = postcodes_df['postcode'].str[0]

Then carrying out an SQL like min max query:

min_pcode_id_sr = postcodes_df.groupby('postcode_first_letter', sort=True)['id'].min()
max_pcode_id_sr = postcodes_df.groupby('postcode_first_letter', sort=True)['id'].max()

This now creates two additional dataframes that could be manipulated further if needed, but can be accessed easily later to get the lower and upper bounds of the ID’s for the postcode (where chosen_letter is the one that is required):

min_postcode_id = min_pcode_id_sr.get(key=chosen_letter)
max_postcode_id = max_pcode_id_sr.get(key=chosen_letter)

Flourishing the data – as easy as a dict:

As in the file based implementation then it is easy to add additional columns for the made up data and actually required “no code changes”; so for the Pandas version:

chosen_data_mod['temperature'] = get_temperature()

VS (for the file version):

chosen_data['temperature'] = get_temperature()

The object names are different out of necessity here but you can see that the creation is very similar.

I feel the need … the need for speed

As I stated at the top of the article then running (for producing 20 devices with 100 data points):

time python3 generate_data_stub_pandas.py

Got me results of:

real 0m13.461s
user 0m13.539s
sys 0m0.338s

And obviously a file. Doing the same for the files version, generated a lot of load on the machine, and no file after 5 mins!

Conclusions:

Therefore going forward then using the Pandas version of this code a it will allow me if I want to either generate a fast amount of data quickly OR have it running in a continuous loop generating a steady stream of data.

Posted in: 2020, GCP, Google, Pandas, Position Tracker, Python / Tagged: Improvements, Learning, pandas, Position Tracking, python

Position Tracker – The Stub

May 15, 2020 11:05 am / Leave a Comment / dangerousDBA

Continuing on what I have been blogging about then this is the start of flourishing the IOT data pipeline that is created in the Quicklabs tutorial.

What data did the original do:

The original file (<– link just there), makes up some data using the Python random module. This generates a number of readings for a sensor based on the arguments that you have passed in!

Why did I want to change it:

I felt the dataset it produced was a little small and not really real. I wanted locations so that you could do cooler visualisations on it, and as a more real world example.

How did I change it:

The code can be found here: Position Tracker

TL;DR:

I created an additional script and renamed the original; that read a file and for three additional parameters generated additional devices and data and sent them off to the Google IOT device registry, subsequently Google PubSub, Google Dataflow and finally Google Bigquery to be visualised (cruedly) in Google Datastudio.

Screenshot of Google Datastudio of my initial data

Issues: VERY slow to generate the data; use dictionaries better or an other library such as Pandas?

Otherwise:

First thing I did was create a new module called: generate_data. This was going to hold the stub and any associated files and data that got produced. I also cp cloudiot_mqtt_example_json.py create-send-data.py so that I could have free reign to change what I need to!

Next I got ambitious and thought where do I want to generate locations for and I came up with the UK. Looking through a Google search then people suggested many ways to do this in Python. They all had floors though, so I decided on using actual postcode locations; where to source that from.

I found a site: freemaptools.com and it had UK postcode data to download; it also seems to be refreshed frequently! So I got this file and inspected it; good data of the format: id,postcode,latitude,longitude

I created a time monster:

I created in the new generate_data a generate_data_stub.py taking parameters for:

  1. The number of devices you wanted to generate data for
  2. The number of datapoints per device
  3. The filename of the output data

This in turn:

  1. Read the UK postcodes file and turned it into a file that was processed into python dictionary lines.
  2. Read the new file and for each first letter of the postcode created a dictionary of the min and max id (for randomness later)
  3. Created a list of random device numbers
  4. For each of the devices then:
  5. Chose a random letter
  6. Got the min and max of the id’s from the 2) step dictionary
  7. Chooses a random set of numbers in the range from 6)
  8. Finds the id and associated data from the processed file
  9. Flourishes the data with a temp and the mobile no.
  10. Writes it to the output file.

Issue:

This works, but it it is very time consuming and seems to be CPU bound.

What to try next:

I think there is very little need to keep writing all the data around and could be done more in memory better utilising dictionaries or using a library such as Pandas.

Posted in: 2020, Big Data, BigQuery, GCP, Google, Position Tracker, Python / Tagged: bigdata, Google, IOT, python

Position Tracker – In the beginning

May 13, 2020 9:00 am / Leave a Comment / dangerousDBA

As mentioned in the previous post starting this all up again then I have been looking at expanding and improving my skills around python and Google Cloud.

What did I do:

To this end I took up Googles offer and signed up for the “Free courses during COVID” offer and followed through with the QwickLabs modules. One of the quests contained a IOT device simulator, this was a fairly simple amount of data that was passed through all the way to Google Big Query. I am going to take this and completely plagiarise it as a learning example to build upon as I think it is a good basis for a lot of things:

  1. Improving my python – The first iteration I will publish I plan to replace the data files the code ingests with something that generates “random” data. There is a lot of scope to use different methods to improve this
  2. Extending the pipeline – This pipeline can go all the way through to visualisation in Google Data Studio
  3. Looking at Google Big Query – This is a very interesting area, we can look at functions, GIS and all things GBQ
  4. Other Google Services – There are many services used in this example and I feel that we can add more as we need such as Google Cloud Composer,

Where can this wonder code be found:

I have a Git Hub account where I have various white whales that I have started, and this particular one can be found: here

How is this going to work:

I am going to start by creating a pipeline that is not much of a departure from what is offered by the current Quest. I will then iterate on that to produce proof of concepts and give appraisals of what I have done, try and critique myself. You will be able to find the work in my GitHub repo and we can see where we go from here, depending on mainly when I get time to do these things!

Who can help:

You all can if you think that there is a better way to do literally everything I would love to know and investigate. I am pretty certain that there is for my Python location data generating stub after the first rushed iteration.

I look forward to hearing from you all!

Posted in: 2020, Big Data, BigQuery, BigQuery, Cloud, Dataflow, GCP, Google, IOT, Position Tracker, PubSub / Tagged: BigQuery, Dataflow, Devices, GCP, Google, Google Cloud, IOT, Position Tracking, PubSub

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 757 other subscribers

Recent Posts

  • Self generating Simple SQL procedures – MySQL
  • Google Cloud Management – My Idea – My White Whale?
  • Position Tracker – The Stub – Pandas:
  • Position Tracker – The Stub
  • Position Tracker – In the beginning
  • Whats been going on in the world of the Dangerous DBA:
  • QCon London Day 1
  • Testing Amazon Redshift: Distribution keys and styles
  • Back to dangerous blogging
  • DB2 10.1 LUW Certification 611 notes 1 : Physical Design

Dangerous Topics

added functionality ADMIN_EST_INLINE_LENGTH Bootcamp colum convert data types DB2 db2 DB2 Administration DB2 Development db2advis db2licm Decompose XML EXPORT GCP Google IBM IBM DB2 LUW idug information centre infosphere IOT LOAD merry christmas and a happy new year Position Tracking python Recursive Query Recursive SQL Reorganisation Reorganise Reorganise Indexes Reorganise Tables Runstats sql statement Stored Procedures SYSPROC.ADMIN_CMD Time UDF User Defined Functions V9.7 V10.1 Varchar XML XML PATH XMLTABLE

DangerousDBA Links

  • DB2 for WebSphere Commerce
  • My Personal Blog

Disclaimer:

The posts here represent my personal views and not those of my employer. Any technical advice or instructions are based on my own personal knowledge and experience, and should only be followed by an expert after a careful analysis. Please test any actions before performing them in a critical or nonrecoverable environment. Any actions taken based on my experiences should be done with extreme caution. I am not responsible for any adverse results. DB2 is a trademark of IBM. I am not an employee or representative of IBM.

Advertising

© Copyright 2023 - Dangerous DBA
Infinity Theme by DesignCoral / WordPress