Yellow Elephants

Google Cloud Platform Big Data and Machine Learning Fundamentals

Today I finished my first class in the Data Engineering on the Google Cloud Platform Specialization, “Google Cloud Platform Big Data and Machine Learning Fundamentals.” The first task of the day was Lab #6, “Carry out Machine Learning with TensorFlow v1.3”, and then a quiz. Frankly, Lab #7 was more interesting, “Demand forecasting with BigQuery and TensorFlow.” The lab used taxi cab data for New York city to try and predict demand.

While GCP Datalab takes a while to provision, it is very convenient because it collects together many of the tools needed to do ML. The data was stored on Big Query, a GCP data warehouse that uses SQL. What’s cool about Datalab is that since the tools and libraries are already there, it’s just a matter of importing the python client for Big Query – along with the other libraries needed for this lab pandas, numpy and TensorFlow – with no installation necessary. Then, it used regular python, e.g. run_line_magic() method on the iPython instance, to allow SQL queries in Big Query:*

get_ipython().run_line_magic(‘bq’, ‘query -n taxiquery’)

WITH trips AS (

SELECT EXTRACT (DAYOFYEAR from pickup_datetime) AS daynumber

FROM `bigquery-public-data.new_york.tlc_yellow_trips_*`

where _TABLE_SUFFIX = @YEAR

)

SELECT daynumber, COUNT(1) AS numtrips FROM trips

GROUP BY daynumber ORDER BY daynumber

Then run the query and store the results in a pandas dataframe:

query_parameters = [

{

‘name’: ‘YEAR’,

‘parameterType’: {‘type’: ‘STRING’},

‘parameterValue’: {‘value’: 2015}

}

]

trips = taxiquery.execute(query_params=query_parameters).result().to_dataframe()

Then, the data was plotted, which really helps to visualize whether there are any relationships in the data. The weather, day of the week and more were all examined for relationships. Eventually, The final step was using TensorFlow, with 80% of the data used for training and 20% used for testing the model. While this part of the lab made the least amount of sense to me in terms of what was being done to the data because I am not a data scientist and I don’t yet know what the methods within TensorFlow do, I did understand the python e.g. lambda functions, regular functions, using numpy to do math on large amounts of data in an orderly way.

Eventually, I finished the class and received my certificate!

Leveraging Unstructured Data with Cloud Dataproc on Google Cloud Platform

So, I started the second class in the series, “Leveraging Unstructured Data with Cloud Dataproc on Google Cloud Platform.” After an overview of Hadoop (which has a yellow elephant mascot, not be confused with my favorite elephant mascot — Slonik of PostGreSQL!) , the first lab involved creating a Hadoop Cluster. The directions for the lab were more rudimentary than in the last class. After configuring local environmental variables, I then set up a Hadoop cluster with one master and three worker nodes on standard machines. Since the point of the lab was just to set up the Hadoop cluster and then examine it (logs, Hadoop admin sites, etc) it didn’t need to be powerful, it just needed to work – which it did. Hopefully in the next lab we’ll do something interesting with it!

* This code was taken from the lab and is copyrighted by Google.

Google Cloud Platform Big Data and Machine Learning Fundamentals

Leveraging Unstructured Data with Cloud Dataproc on Google Cloud Platform

Tags