Video Screencast Help

Performing Data analytics with DQL

Created: 30 May 2014 • Updated: 02 Jun 2014
Language Translations
Mike Hart's picture
+6 6 Votes
Login to vote

As a data scientist at Symantec and collaborator with the Data Insight engineering team, I have been leveraging DQL to make it easier to conduct my research.  When I explore data, I use Python to rapidly write code to validate ideas.  I prefer Python for its expressive syntax, ease of use and the wealth of machine learning, statistical and scientific computing libraries.  This post will show how easy it is to use Python for processing DQL output to gain insight into user behavior with an example aimed at discovering “weekend warriors”.

A typical DQL query I run will look something like this:

from activity get timestamp, user.sid, path.msu.id, path.id, path.parent.id, path.name, path.type, opcode, count

You will notice that I do not perform any aggregation on the data.  This is intentional because I may need to look at the data in different ways that I would lose by applying GROUP BY clauses.

Opening and reading SQLite data in Python is extremely easy:

import sqlite3
db = sqlite3.connect('q1.db')
cursor = db.cursor()
for row in cursor.execute('SELECT * FROM activity'):
  pass # Insert the magic here

The value of being able to collect data across all indexers and serialize it one place is fully appreciated when you can write custom analytics using DQL.  For example, imagine if we wanted to keep tabs on “weekend warriors”, employees who access a significant amount of data on the weekend.  Let’s imagine we execute the above query and save it as q1.db.  To find which employees are working on the weekend, we must filter on the file operations that occurred on the weekend and aggregate for each user and weekend day.  This cannot be easily done in either SQLite and therefore we need to code it.  Fortunately, Python makes this simple.  First, we can use the datetime module in order to determine if the day an activity takes place occurs on a Saturday or Sunday:

from datetime import date

def is_weekend_day(timestamp):
   day_of_week = date.fromtimestamp(timestamp).weekday()
   return day_of_week == 5 or day_of_week == 6

Let’s take advantage of how easy it is to see which files users touched on the weekends.  We can use two of Python’s rich and fast built-in data structures, sets and maps.

user_day_files_map = {}
for row in cursor.execute("SELECT timestamp, user_sid, path_msu_id, path_id FROM activity"):
  if not is_weekend_day(row[0]): continue # Not a weekend day
  activity_date = date.fromtimestamp(row[0])
  user = row[1]
  if not user_day_files_map.has_key( (activity_date, user) ):
    user_day_files_map[(activity_date, user)] = set()
  user_day_files_map[(activity_date, user)].add((row[2], row[3]))

We can now create a spreadsheet with how many files users accessed on each weekend day as so:

fh = open('weekend.warriors.csv', 'w')
for key in user_day_files_map:
  dt, user = key
  fh.write("%s,%s,%i\n" % (dt.strftime("%m/%d/%y"), user, len(user_day_files_map[key])))
fh.close()

Now that we have computed which employees accessed which files on the weekend (in user_day_files_map) and saved it into a CSV. The data can be loaded into the analysis software of your choosing.  If the file is of reasonable size, opening the file into Excel allows us to chart and apply statistical functions with little added effort.

As you can see, we can run custom analytics on Data Insight indices with very little effort by leveraging DQL, Python and a little code.