Today I did several more lessons in my class (Leveraging Unstructured Data with Cloud Dataproc on Google Cloud Platform) along with lab #2. In essence, there was a dataset in a text file with information about animals that included type, name, breed, age and more. The goal was to group the data such that all dogs were together, all cats were together, etcetera. The Hadoop cluster had already been set up for the lab, and the set up was similar to the first lab: one master node and two worker nodes. One new thing this time was to create a firewall, and configure it so that access to the master node was restricted to my computer’s IP. The data was then staged into the Hadoop Data File System (HDFS). The next part was pretty cool – using Hive to create a schema and connecting it to the data. Since Hive uses a flavor of SQL, it was a good refresher. The actual processing was done using Pig. Unfortunately, the lab supplied a Pig script with a bug, which thankfully did not prevent the job from running, but was still annoying, as the error kept logging and filling the terminal. Once the job completed successfully, I was able to put the results into a local directory, grab them from HDFS and view them. The pigs, dogs and cats were correctly grouped by type.
Next up for me was a return to Go. I have been learning Go on and off for a while, but without a real sense of what I wanted to do with it. I started off by doing a small refresher that involved downloading a csv file of plant data, ingesting the csv file by reading in each row, turning the row into a string, and creating a separate array for it. The string arrays were added to a holding array, creating a nested structure. Next, I went searching and found a book on Safari (“Go Machine Learning Projects” by Xuanyi Chew) that will help me learn the ETL process using projects with Go, and it includes bonus challenges, so I am excited to dive into it.