• About Dangerous DBA
  • Table of Contents
Dangerous DBA A blog for those DBA's who live on the edge

Category Archives: Big Data

Position Tracker – The Stub

May 15, 2020 11:05 am / Leave a Comment / dangerousDBA

Continuing on what I have been blogging about then this is the start of flourishing the IOT data pipeline that is created in the Quicklabs tutorial.

What data did the original do:

The original file (<– link just there), makes up some data using the Python random module. This generates a number of readings for a sensor based on the arguments that you have passed in!

Why did I want to change it:

I felt the dataset it produced was a little small and not really real. I wanted locations so that you could do cooler visualisations on it, and as a more real world example.

How did I change it:

The code can be found here: Position Tracker

TL;DR:

I created an additional script and renamed the original; that read a file and for three additional parameters generated additional devices and data and sent them off to the Google IOT device registry, subsequently Google PubSub, Google Dataflow and finally Google Bigquery to be visualised (cruedly) in Google Datastudio.

Screenshot of Google Datastudio of my initial data

Issues: VERY slow to generate the data; use dictionaries better or an other library such as Pandas?

Otherwise:

First thing I did was create a new module called: generate_data. This was going to hold the stub and any associated files and data that got produced. I also cp cloudiot_mqtt_example_json.py create-send-data.py so that I could have free reign to change what I need to!

Next I got ambitious and thought where do I want to generate locations for and I came up with the UK. Looking through a Google search then people suggested many ways to do this in Python. They all had floors though, so I decided on using actual postcode locations; where to source that from.

I found a site: freemaptools.com and it had UK postcode data to download; it also seems to be refreshed frequently! So I got this file and inspected it; good data of the format: id,postcode,latitude,longitude

I created a time monster:

I created in the new generate_data a generate_data_stub.py taking parameters for:

  1. The number of devices you wanted to generate data for
  2. The number of datapoints per device
  3. The filename of the output data

This in turn:

  1. Read the UK postcodes file and turned it into a file that was processed into python dictionary lines.
  2. Read the new file and for each first letter of the postcode created a dictionary of the min and max id (for randomness later)
  3. Created a list of random device numbers
  4. For each of the devices then:
  5. Chose a random letter
  6. Got the min and max of the id’s from the 2) step dictionary
  7. Chooses a random set of numbers in the range from 6)
  8. Finds the id and associated data from the processed file
  9. Flourishes the data with a temp and the mobile no.
  10. Writes it to the output file.

Issue:

This works, but it it is very time consuming and seems to be CPU bound.

What to try next:

I think there is very little need to keep writing all the data around and could be done more in memory better utilising dictionaries or using a library such as Pandas.

Posted in: 2020, Big Data, BigQuery, GCP, Google, Position Tracker, Python / Tagged: bigdata, Google, IOT, python

Position Tracker – In the beginning

May 13, 2020 9:00 am / Leave a Comment / dangerousDBA

As mentioned in the previous post starting this all up again then I have been looking at expanding and improving my skills around python and Google Cloud.

What did I do:

To this end I took up Googles offer and signed up for the “Free courses during COVID” offer and followed through with the QwickLabs modules. One of the quests contained a IOT device simulator, this was a fairly simple amount of data that was passed through all the way to Google Big Query. I am going to take this and completely plagiarise it as a learning example to build upon as I think it is a good basis for a lot of things:

  1. Improving my python – The first iteration I will publish I plan to replace the data files the code ingests with something that generates “random” data. There is a lot of scope to use different methods to improve this
  2. Extending the pipeline – This pipeline can go all the way through to visualisation in Google Data Studio
  3. Looking at Google Big Query – This is a very interesting area, we can look at functions, GIS and all things GBQ
  4. Other Google Services – There are many services used in this example and I feel that we can add more as we need such as Google Cloud Composer,

Where can this wonder code be found:

I have a Git Hub account where I have various white whales that I have started, and this particular one can be found: here

How is this going to work:

I am going to start by creating a pipeline that is not much of a departure from what is offered by the current Quest. I will then iterate on that to produce proof of concepts and give appraisals of what I have done, try and critique myself. You will be able to find the work in my GitHub repo and we can see where we go from here, depending on mainly when I get time to do these things!

Who can help:

You all can if you think that there is a better way to do literally everything I would love to know and investigate. I am pretty certain that there is for my Python location data generating stub after the first rushed iteration.

I look forward to hearing from you all!

Posted in: 2020, Big Data, BigQuery, BigQuery, Cloud, Dataflow, GCP, Google, IOT, Position Tracker, PubSub / Tagged: BigQuery, Dataflow, Devices, GCP, Google, Google Cloud, IOT, Position Tracking, PubSub

Whats been going on in the world of the Dangerous DBA:

May 11, 2020 3:19 pm / Leave a Comment / dangerousDBA

It’s been a while ….. again:

So since I last posted way back in 2016 and QCon there has been a lot of water under the bridge in terms of technologies used and abused, and either taken into the stack or discarded to the ummm it nearly worked pile.

Things that have changed in that time:

Where a I work has seen a massive increase in the number of people in the data team in response to the business becoming far more data driven. The team has gone from two or three of us to three distinct sections all with a minimum of four people in them.

We now embrace what was just emerging as a widespread phenomenon in 2016 – streaming data; and make good use of this for micro services and data products.

I now have processed to a “senior” role in the business; with juniors and none prefixed or suffixed data engineers in the data products team.

Things that have NOT changed in that time:

We are still moving off DB2 but due to several large scale events over the years and the increasing size of the team then this is rapidly gathering pace but as a business we still need the old beast, but it has been on somewhat of a diet!

Where are things going:

Well for us from the Data Centre and Amazon Web Services (AWS) and to Google Cloud. Making extensive use of everything that it has to offer because of it nature we face far less of the size issues that we did on the data centre or even on AWS with some of its offerings

What are my interests (white whales) now:

Well I am going to be looking at a few things now:

  1. Google Cloud Management – is it possible to create a tool that will aid this from a green field prospective and also brownfield. It seems (not that I can google so they need to do a better effort at SEO) that there is not a product out there that does this in an obvious way
  2. IOT – This is going to be something that is “easily implemented” but then has individual elements that can taken WAY further, currently working on something and look put for upcoming posts

Posted in: 2020, Big Data, Blogging, General Blog / Tagged: AWS, GCP

Back to dangerous blogging

May 20, 2015 9:51 pm / Leave a Comment / dangerousDBA

Looking at this I have not written a blog post on here in over two years! I think it is time for a change. First of all to let you guys know what I have been up too and then too let you in what I will blogging about in the future:

What I have been up too

The past two years have been very interesting the company that I work for has seen a data explosion and unfortunately this meant that the DB2 estate that is needed to support this is quite expensive in terms of licensing and hardware costs compared to other offerings that are out there. Also we were struggling to find people with the requisite skills that supported our architecture of DB2 with an in house built ETL tools. It was rather unique solution in an area of the country were London sucks a lot of the people with DB2 skills.

Therefore we started looking at other platforms to house our data warehouse and power the businesses data future. We started looking at SQL server and all its offerings. This seemed like a winner for a time we started and building out the platform with an home built auto runner and other interesting features that I may blog about later. This solution was far cheaper in terms of licensing than DB2 (sorry IBM) and also came with a lot of features such as SSRS and SSIS and also a more readily available talent pool in the local area. This architecture still had one major problem hardware, we fill it up and stress it out far quicker than our normal 5 year planning caters for.

Then Amazon brought out Redshift. This is a winner it has features that will support our businesses data growth and has an eco system of products round it that will support the businesses data hunger. Although it does not come with an ETL tool like SSIS; in an effort to standardise opposed to our current scripted batch system we found Talend. AS a reporting solution found an awesome product called Looker. We are also looking at other Big Data solutions such as Hadoop, EMR, Databricks and many other things data related.

What will I be blogging about the future

So I now intend to get back to my blogging about my solutions to technical issues we face and solve, cool features we use and you can you to around Amazon Redshift, Looker, Talend and maybe occasionally DB2 as we wind down our usage. I will also be blogging more about the conferences I have attended, big data solutions we try out. So watch out for more content and hopefully interesting articles and findings.



Posted in: Amazon, Big Data, Redshift / Tagged: Amazon, Big Data, databricks, Hadoop, Looker, Redshift, Talend

Just finished reading: Customer Experience Analytics

January 29, 2012 9:36 pm / Leave a Comment / dangerousDBA

This is another excellent free ebook from IBM that renders well on a kindle. This book can be down loaded from the IBM Information Management Book store  or the direct link here.

Overall like the Understanding Big Data book I reviewed last time I think it is a good introduction to the subject matter giving you a quick way to get up to speed with some of the concepts involved and the evolution going on around the social sphere and customer experience. The book is again split up into sections, this time three: Part One: The CEA Opportunity, Part Two: The Customer Experience Analytics [sic] Solution and Part Three: How to Package a Customer Experience Analytics [sic] Program.

Part One: The CEA Opportunity covers a few case studies of how various industries use customer experience to fuel decisions that affect the business and the customer.   It then moves on to how that our societies are moving toward increasingly automated way of interacting during the sales and marketing processes makes collecting the data for CEA a lot easier and quicker to act upon. The third chapter in this part looks at the evolution of the customer decision making process, and how a single customers influence on the wider world can (should) affect how a business deals with them. This raised some interesting thoughts in that basically people that are “listened” to (facebook, twitter, text messages in a social group) should be treated differently when they have a complaint than those that “listen” and do not contribute back, pushing “stardom” down onto those that are not famous, but are popular in a social group. The final chapter in this section looks at the “bazaar” of data that exists for CEA and touches on big data concepts again.

Part Two: The Customer Experience Analytics [sic] Solution is a slightly technical, but more theoretical look at with out pushing any particular products how you would go about creating you CEA solution. It covers Master Data Management (MDM), Stream computing, Predictive Modelling and a couple of other topics, but not to a depth to make you a master of these areas but at least enough to let you in on the conversation.

Part Three: How to Package a Customer Experience Analytics [sic] Program is basically how you would put together a business case for CEA and the conclusion of the book. The business case for CEA varies from needed to stay in business (mobile phone compaines) to currently only done on an Ad-Hoc basis and needs to be built up in the company or the industry.  It would be hard to place the company that I currently work for on this scale as I am un-aware of what and if anyone else does in the sector that we are in, but I think it has legs and should be something that we should be pushing, would defiantly like to get involved in the technical side.  I also think what we do have in place is to rigid in the way it carries out its current matching and we really need to be pulling in or getting the social sphere of the customer somehow.

Posted in: Big Data, BigData Case studies, Book read, IBM, Information Managment, InfoSphere Streams, Kindle, MapReduce, MDM / Tagged: customer decision, customer experience, free ebook, management book, social sphere, stream computing

Post Navigation

← Older Posts
 

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 757 other subscribers

Recent Posts

  • Self generating Simple SQL procedures – MySQL
  • Google Cloud Management – My Idea – My White Whale?
  • Position Tracker – The Stub – Pandas:
  • Position Tracker – The Stub
  • Position Tracker – In the beginning
  • Whats been going on in the world of the Dangerous DBA:
  • QCon London Day 1
  • Testing Amazon Redshift: Distribution keys and styles
  • Back to dangerous blogging
  • DB2 10.1 LUW Certification 611 notes 1 : Physical Design

Dangerous Topics

added functionality ADMIN_EST_INLINE_LENGTH Bootcamp colum convert data types DB2 db2 DB2 Administration DB2 Development db2advis db2licm Decompose XML EXPORT GCP Google IBM IBM DB2 LUW idug information centre infosphere IOT LOAD merry christmas and a happy new year Position Tracking python Recursive Query Recursive SQL Reorganisation Reorganise Reorganise Indexes Reorganise Tables Runstats sql statement Stored Procedures SYSPROC.ADMIN_CMD Time UDF User Defined Functions V9.7 V10.1 Varchar XML XML PATH XMLTABLE

DangerousDBA Links

  • DB2 for WebSphere Commerce
  • My Personal Blog

Disclaimer:

The posts here represent my personal views and not those of my employer. Any technical advice or instructions are based on my own personal knowledge and experience, and should only be followed by an expert after a careful analysis. Please test any actions before performing them in a critical or nonrecoverable environment. Any actions taken based on my experiences should be done with extreme caution. I am not responsible for any adverse results. DB2 is a trademark of IBM. I am not an employee or representative of IBM.

Advertising

© Copyright 2023 - Dangerous DBA
Infinity Theme by DesignCoral / WordPress