RootChutney

Came across another interesting problem on hackerrank related to range sum query in 3 dimensions.

The question goes as below -

You are given a 3-D Matrix in which each block contains 0 initially. The first block is defined by the coordinate (1,1,1) and the last block is defined by the coordinate (N,N,N). There are two types of queries.

UPDATE x y z W

updates the value of block (x,y,z) to W.

QUERY x1 y1 z1 x2 y2 z2

calculates the sum of the value of blocks whose x coordinate is between x1 and x2 (inclusive), y coordinate between y1 and y2 (inclusive) and z coordinate between z1 and z2 (inclusive).

We will stick to creating a fenwick tree here in case of a cube although this question has larger implication in the database world for calculating sums and averages and has much better solutions like lazy propagation of segment trees, quad trees and octet trees.

I wont go deeper into explaining fenwick trees/ binary indexed tree as there are many blogs and videos explaining the same but I highly recommend this link to visualize the workings of it for any application. A question arises as to why use Fenwick tree and not segment tree for this question and the general rule of thumb would be its easier to code Fenwick tree and any application that can be solved using that, should make use of fenwick trees as segment trees have a wide range of applications and provide more flexibility in exchange for heavy maintainability of code/application.

The central idea around creation of Fenwick trees is to affect the minimal number of nodes during an update operation and also use minimum while calculating a metric from it. The main gist around doing this is as follows :-

Populating next/child node ->

2's complement of current node index
AND it with current node index
ADD to current node index

Calculating parent node ->

2's complement of current node index
AND it with current node index
SUBTRACT it from current node index.

For a range sum query on a two dimensional space from point (1,1) to (3,3), the calculation on the Fenwick tree can be visualized as below :-

fig 2 : 2D Range sum query

which can be formulated as -

sum(row2+1, col2+1) - sum(row1, col2+1) - sum(row2+1, col1) + sum(row1,col1) = ∑(row1, col1) to (row2,col2)

Similarly, for a range sum query on a three dimensional space from point (1,1,1) to (4,4,3), the calculation on the Fenwick tree can be visualized as below :-

fig 3 : 3D Range sum query

which can be formulated as -

sum(x2+1,y2+1,z2+1) - sum(x1,y1,z1) - sum(x1,y2+1,z2+1) - sum(x2+1,y1,z2+1) - sum(x2+1,y2+1,z1) + sum(x1,y1,z2+1) + sum(x1,y2+1,z1) + sum(x2+1,y1,z1) = ∑(x1, y1, z1) to (x2, y2, z2)

Pasting the solution code in Java below. A minor confusion can be caused when submitting solution to hackerrank as for this problem, the arrays start from 1 instead of 0 and hence an array with n as 3 goes from 1,1,1 to 3,3,3.

New Document

import java.io.*;
import java.util.*;

public class Solution {
    long[][][] tree;
    long[][][] nums;
    private int dimensions = 0;
    
    public Solution(int dimensions) {
        if (dimensions == 0) return;
        this.dimensions = dimensions;
        tree = new long[dimensions+1][dimensions+1][dimensions+1];
        nums = new long[dimensions][dimensions][dimensions];
    }
    
    public void update(int x, int y, int z, int value) {
        long delta = value - nums[x][y][z];
        nums[x][y][z] = value;
        for (int i = x + 1; i <= dimensions; i += i & (-i)) {
            for (int j = y + 1; j <= dimensions; j += j & (-j)) {
                for (int k = z + 1; k <= dimensions; k += k & (-k)) {
                    tree[i][j][k] +=  delta;
                }
            }
        }
    }
    
    public void query(int x1, int y1, int z1, int x2, int y2, int z2) {
        long result = sum(x2+1,y2+1,z2+1) - sum(x1,y1,z1) - sum(x1,y2+1,z2+1) - sum(x2+1,y1,z2+1) - sum(x2+1,y2+1,z1) + sum(x1,y1,z2+1) + sum(x1,y2+1,z1) + sum(x2+1,y1,z1);
        System.out.println(result);
    }
    
    public long sum(int x, int y, int z) {
        long sum = 0l;
        for (int i = x; i > 0; i -= i & (-i)) {
            for (int j = y; j > 0; j -= j & (-j)) {
                for (int k = z; k > 0; k -= k & (-k)) {
                    sum += tree[i][j][k];
                }
            }
        }
        return sum;
    }

    public static void main(String[] args) {
        /* Enter your code here. Read input from STDIN. Print output to STDOUT. Your class should be named Solution. */
        Scanner ob = new Scanner(System.in);
  Solution solution = null;
        int testcases = ob.nextInt();
        ob.nextLine();
        for (int i = 0; i < testcases; i++) {
            String numsLine = ob.nextLine();
            String[] numsLineParts = numsLine.trim().split(" ");
            int dimensions = Integer.valueOf(numsLineParts[0]);
            int numOperations = Integer.valueOf(numsLineParts[1]);
            solution = new Solution(dimensions);
            for (int j = 0; j < numOperations; j++) {
                String line = ob.nextLine();
                String[] lineParts = line.split(" ");
                if (lineParts[0].equals("UPDATE")) {
                    solution.update(Integer.valueOf(lineParts[1])-1, Integer.valueOf(lineParts[2])-1,             Integer.valueOf(lineParts[3])-1, Integer.valueOf(lineParts[4]));
                }
                if (lineParts[0].equals("QUERY")) {
                    solution.query(Integer.valueOf(lineParts[1])-1, Integer.valueOf(lineParts[2])-1,             Integer.valueOf(lineParts[3])-1, Integer.valueOf(lineParts[4])-1, Integer.valueOf(lineParts[5])-1, Integer.valueOf(lineParts[6])-1);
                }
            }
        }
    }
}

The time complexity of this solution is O(log(X) * log(Y) * log(Z)) = O(log(N))³

Some hues of color [Turnagain arm, Seward Highway]

Ice, Ice baby !! [Harding Icefield/Exit Glacier]

Crash and Burn [Holgate Glacier]

For the love of fall colors [Glenn Highway]

Cracks & Racks [Matanuska Glacier]

Ice in front of us and supposedly underneath us [Matanuska Glacier]

No savage souls around !! [Savage River, Denali National Park]

Please Mr.Fog, let us see the snow caps [Sable Pass, Denali National Park]

Just another view too good to ignore [Toklat River, Denali National Park]

Bcoz some mountains have big bottoms [Mt.Denali (center) , Denali National Park]

No effects, just some raindrops !! [Roosevelt Arch]

Looking Wyoming, Walking Montana [ Location - Gardner, Montana]

Literally get lost in Montana [Somewhere in Montana]

Different hues of glacial water [Many Glacier Road]

Imminent loss of glaciers and pertinent loss of beauty [Swiftcurrent Lake]

Grinnell Glacier in the center pouring streams into the lake [Grinnell Lake]

The Horn chillin in the midst of ice and fire [Going to the sun road]

Lush meadow, Hidden lake and over-powering mountains [Logan Pass]

Going indeed to the sun [Going-to-the-sun road]

Bear choosing the next branch to ingest [Going-to-the-sun road]

Came across an interesting question on hackerrank which goes as follows -

We have a string of length N. Can you figure out the number of occurrences of the most frequent substring in this string? We are only interested in substring of length from K to L and in each substring the number of distinct characters must not exceed M. The string contains only lower-case letters(a-z).

Constraints:
2<=N<=100000
2<=K<=L<=26,L<N
2<=M<=26

Sample Input :-
5
2 4 26 // value of K L M
abcde

Sample Output :-
1

The solution to the above problem is

-Any substring between length K and L will be inserted into the trie.

-Each trieNode also has a frequency counter which will be incremented when an already existing substring is inserted.

-While inserting, we need to check for the number of distinct characters in the given substring and not insert any substring which has distinct characters greater than m. In case, the number of distinct characters is greater than m, we return -1 as frequency from insertTrie function so as to know that the given substring was invalid and not inserted into trie.
- While inserting, we can keep track of the current maximum frequency and return that at the end as the answer.

Implementation in java is as below :-

New Document

import java.util.*;

public class MostOftenSubstring {
    TrieNode root;
    Set distinctCharacters = new HashSet();
 
    public int getHighestFrequencyOfSubstring(int n, int k, int l, int m, String s) {
        if (s == null || n == 0) {
            return 0;
        }
        if (k < 0 || l < 0 || m <= 0) {
            return 0;
        }
        int maxFrequency = 1;
        root = new TrieNode();
        for (int i = 0; i < n; ++i) {
            for (int j = k; j <= l && j <= n; ++j) {
                String substring = s.substring(i,j);
                // return -1 if distincts more than m
                int currentFrequency = insertIntoTrie(substring, m);
                if (currentFrequency == -1) 
                    break;
                maxFrequency = (currentFrequency > maxFrequency) ? currentFrequency : maxFrequency;                 
            }
            k += 1;
            l += 1;
        }
        return maxFrequency;
    }
     
     public int insertIntoTrie(String word, int m) {
         distinctCharacters.clear();
         TrieNode parent = root;
         for(char c : word.toCharArray()) {
             distinctCharacters.add(c);
             if (distinctCharacters.size() > m) {
                 return -1;
             }
             int i = c - 'a';
             if(parent.next[i] == null) 
                 parent.next[i] = new TrieNode();
             parent = parent.next[i];
         }
         if (parent.word != null) {
            parent.frequency += 1;
         } else {
             parent.word = word;
             parent.frequency = 1;
         }
         return parent.frequency;
     }

     class TrieNode {
         TrieNode[] next = new TrieNode[26];
         String word;
         int frequency = 0;
     }
}

Time Complexity is O(N.(L-K)) without considering trie insertion and O(N. (L-K)²) if considering trie insertion.

In the midst of the DeCaLiBron - Mt. Democrat to the right, Mt. Bross to the far left[Alma, Colorado]

Moment of contemplation - left (Mt.Democrat) or right(Mt.Cameron) [Cameron-Democrat saddle]

Heavy legs and tired lungs - Summit of Mt. Democrat(14154 feet !!)

Football field at 14k feet !! - Summit of Mt. Cameron

A water-filled kite on the ground - Looking down at Kite Lake Trailhead [Alma, Colorado]

Summit Lake below Mt.Evans

Looking down Mt.Evans Road and Scenic Byway [Summit of Mt. Evans]

The scramble awaits - Looking up Mt. Bierstadt summit [Clear Creek, Colorado]

Sun-lit sawtooth ridge - Looking down from summit of Mt. Bierstadt

Of fleeting lakes and perennial ridges - Summit of Mt. Bierstadt

Scary Sawtooth ridge from the side

For some lines are drawn pretty fine - The Continental Divide

Road to Aspen aint that way !! [Independence Pass, Colorado]

El famous Maroon Bells - [Maroon Bells, Colorado]

Beautiful Breck - [Main St, Breckenridge, Colorado]

Had some time to deep dive into GraphX and as it is only supported in Scala, was a good experience. Coming from experimenting on graph databases, graphx seemed an easy platform and makes one not worry about sharding/partitioning your data. The graphx paper and bob ducharme's blog seemed good starters to getting started with playing on graphx.

So I managed to get all Indian movie data for 10 different languages(Hindi, Marathi, Punjabi, Telugu, Tamil, Bhojpuri, Kannada, Gujarati, Bengali, Malayalam) with the actors and directors associated with the films.

Now came the part of building the relationships between the above entities. GraphX uses VertexId associated with each vertex in the graph which is Long(64 bit). So I went about creating the same for all entities and replacing them in the movie file for easier parsing later when creating edges(relationships). The essential part which makes GraphX super easy and super fun is that the partitioning takes place on vertices,i.e., the vertices go along with the edge when edges are on a different node but the VertexID associated with that vertex will still be the same as any other node. The essential syntax to consider is -

For Vertex -> RDD[(VertexId, (VertexObject))

In my case, the vertex RDD is -> RDD[(VertexId, (String, String))

For Edge -> Edge(srcVertexId, dstVertexId, "<edgeValueString>")

You can also have a URI or other data types as your edgeValue(can be a weight to associate to the edge when you happen to be choosing the destination to choose based on edge value).

These two when combined together in this line create the graph object for us -

val graph = Graph(entities,relationships)

Graph depiction of data

So, I ended up connecting a movie to all its actors and back, its director and back and the language the movie was in and back. This led to some half a million edges for my graph(in essence half of them are edges but I made two edges between all sets of entities and movie). Also, I added literal properties to the movie(which are not nodes) such as imdbRating and potentialAction(webUrl of Imdb for that movie). Literal properties can also be thought of adding to the vertexObject directly if the Object is a class with the properties as different variables. The code to construct and query the graph is below. The main program can be replaced by a scalatra webserver and be used for the same but I was more focused on the graph functionality.

New Document

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import scala.io.Source
import org.json4s._
import org.json4s.jackson.JsonMethods._
import scala.collection.mutable.ListBuffer

object MoviePropertyGraph {
  val baseURI = "http://kb.rootchutney.com/propertygraph#"

  class InitGraph(val entities:RDD[(VertexId, (String, String))], val relationships:RDD[Edge[String]],
                  val literalAttributes:RDD[(Long,(String,Any))]){
    val graph = Graph(entities,relationships)

    def getPropertyTriangles(): Unit ={
          // Output object property triples
       graph.triplets.foreach( t => println(
         s"<$baseURI${t.srcAttr._1}> <$baseURI${t.attr}> <$baseURI${t.dstAttr._1}> ."
       ))
    }

    def getLiteralPropertyTriples(): Unit = {
      entities.foreach(t => println(
        s"""<$baseURI${t._2._1}> <${baseURI}role> \"${t._2._2}\" ."""
      ))
    }

    def getConnectedComponents(): Unit = {
      val connectedComponents = graph.connectedComponents()
      graph.vertices.leftJoin(connectedComponents.vertices) {
        case(id, entity, component) => s"${entity._1} is in component ${component.get}"
      }.foreach{ case(id, str) => println(str) }
    }

    def getMoviesOfActor(): Unit = {
      println("Enter name of actor:")
      val actor = Console.readLine()
      graph.triplets.filter(triplet => (triplet.srcAttr._1.equals(actor)) && triplet.attr.equals("inFilm"))
        .foreach(t => println(t.dstAttr._1))
    }

    def getMoviesOfDirector(): Unit = {
      println("Enter name of director:")
      val director = Console.readLine()
      graph.triplets.filter(triplet => (triplet.srcAttr._1.equals(director)) && triplet.attr.equals("movie"))
        .foreach(t => println(t.dstAttr._1))
    }

    def getMoviesInLanguage(): Unit = {
      println("Enter language:")
      val lang = Console.readLine()
      graph.triplets.filter(triplet => (triplet.dstAttr._1.equals(lang)) && triplet.attr.equals("inLanguage"))
        .foreach(t => println(t.srcAttr._1))
    }

    def getTopMoviesBeginningWith(): Unit = {
      println("Enter prefix/starting tokens:")
      val prefix = Console.readLine()
      graph.vertices.filter(vertex => (vertex._2._1.startsWith(prefix)) && vertex._2._2.equals
        ("movie"))
        .distinct()
        .leftOuterJoin(literalAttributes.filter(literalTriple => (literalTriple._2._1.equals("imdbRating"))))
        .sortBy(element => element._2._2.get._2.asInstanceOf[Double], false)
        .foreach(t => println(t._2._1 +" " + t._2._2.get._2.asInstanceOf[Double]))
    }

    def getActorsInMovie(): Unit = {
      println("Enter movie:")
      val movie = Console.readLine()
      graph.triplets.filter(triplet => (triplet.srcAttr._1.equals(movie)) && triplet.attr.equals("actor"))
        .foreach(t => println(t.dstAttr._1))
    }
  }

  def main(args: Array[String]) {
    val sc = new SparkContext("local", "MoviePropertyGraph", "127.0.0.1")
    val movieFile = ""
    val actorFile = ""
    val directorFile = ""
    val languageFile = ""

    val movies = readJsonsFile(movieFile, "movie", sc)
    val actors = readCsvFile(actorFile, "actor", sc)
    val directors = readCsvFile(directorFile, "director", sc)
    val languages = readCsvFile(languageFile, "language", sc)

    val entities = movies.union(actors).union(directors).union(languages)
    val relationships: RDD[Edge[String]] = readJsonsFileForEdges(movieFile, sc)
    val literalAttributes = readJsonsFileForLiteralProperties(movieFile, sc)
    val graph = new InitGraph(entities, relationships, literalAttributes)
// Some supported functionality in graphx
//    graph.getPropertyTriangles()
//    graph.getLiteralPropertyTriples()
//    graph.getConnectedComponents()

    var input = 0
    do {
      println("1. Movies of actor \n2. Movies of director \n3. Movies in language \n4. Top Movies beginning with " +
        "\n5. Actors in movie \n6. Exit\n")
      input = Console.readInt()
      input match {
        case 1 => graph.getMoviesOfActor()
        case 2 => graph.getMoviesOfDirector()
        case 3 => graph.getMoviesInLanguage()
        case 4 => graph.getTopMoviesBeginningWith()
        case 5 => graph.getActorsInMovie()
        case 6 => "bye !!!"
        case something => println("Unexpected case: " + something.toString)
      }
    } while (input != 6);
    sc.stop
  }

  def readJsonsFile(filename: String, entityType: String, sparkContext:SparkContext):  RDD[(VertexId, (String, String))] = {
    try {
     val entityBuffer = new ListBuffer[(Long,(String,String))]
      for (line <- case="" catch="" classof="" entitybuffer.append="" entitybuffer="" entitytype="" ex:="" exception="" filename="" getlines="" head.tolong="" head="" jsondocument="" movieid="" name="" return="" source.fromfile="" sparkcontext.parallelize="" tring="" val=""> println("Exception in reading file:" + filename + ex.getLocalizedMessage)
        return sparkContext.emptyRDD
    }
  }

  def readCsvFile(filename: String, entityType: String, sparkContext:SparkContext): RDD[(VertexId, (String, String))] = {
    try {
      val entityBuffer = new ListBuffer[(Long,(String,String))]
      for (line <- -="" .map="" .split="" 1="" array="" case="" entitybuffer.appendall="" filename="" getlines="" k="" line.length="" line.substring="" source.fromfile="" split="" v=""> (k.substring(2, k.length-1), v.substring(1, v.length).toLong)}
          .map { case(k,v) => (v,(k,entityType))})
      }
      return sparkContext.parallelize(entityBuffer)
    } catch {
      case ex: Exception => println("Exception in reading file:" + filename + ex.getLocalizedMessage)
        return sparkContext.emptyRDD
    }
  }

  def readJsonsFileForEdges(filename: String, sparkContext:SparkContext): RDD[Edge[String]] = {
    try {
      val relationshipBuffer = new ListBuffer[Edge[String]]
      for (line <- actor.tolong="" actor="" actors="" case="" catch="" classof="" dge="" director="" ex:="" exception="" filename="" for="" getlines="" head.tolong="" infilm="" inlanguage="" jsondocument="" movie="" movieid="" nt="" println="" relationshipbuffer.append="" relationshiprdd.count="" relationshiprdd="" return="" source.fromfile="" tring="" val=""> println("Exception in reading file:" + filename + ex.getLocalizedMessage)
        return sparkContext.emptyRDD
    }
  }

  def readJsonsFileForLiteralProperties(filename: String, sparkContext:SparkContext): RDD[(Long,(String,Any))] = {
    try {
      val literalsBuffer = new ListBuffer[(Long,(String,Any))]
      for (line <- case="" catch="" classof="" ex:="" exception="" filename="" getlines="" head.tolong="" head="" imdbrating="" jsondocument="parse(line)" literalsbuffer.append="" literalsbuffer="" movieid="" ouble="" potentialaction="" return="" source.fromfile="" sparkcontext.parallelize="" tring="" val=""> println("Exception in reading file:" + filename + ex.getLocalizedMessage)
        return sparkContext.emptyRDD
    }
  }
}

The function "getTopMoviesBeginningWith" is the coolest of all functions as it sorts the entities based on literal properties(imdb) joined into the same RDD and provides a lame autosuggest for movies starting with a prefix.
My actors, director and languages files were in the form of csv containing name and vertexID values that I had pre-generated. For example,
'shah rukh khan': 75571848106
'subhash ghai': 69540999294
'hindi': 10036785003, 'telugu': 61331709614
My movie file was in the form of a json containing all this connected data. For example -
{"inLanguage": 10036785003,
"director": 52308324411,
"movieId": "0112870",
"potentialAction": "http://www.imdb.com/title/tt0112870/",
"name": "dilwale dulhania le jayenge",
"imdbRating": 8.3,
"actors": [75571848106, 19194912859, 79050811553, 37409465222, 44878963029, 10439885041, 85959578654, 31779460606, 39613049947, 29336046116, 64948799477, 71910110134, 1911950262, 19092081589, 77693233734]
}

If you happen to be needing these files, do let me know and I can email them. Last but not the least, I found building my project using SBT really easier than non SBT method as it gave a gradle like method to build projects.
Here is my sample sdt file that I used to build the project -

name := "PropertyGraph"

version := "1.0"

scalaVersion := "2.10.4"

connectInput in run := true

// additional libraries
libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "1.5.0" % "provided",
  "org.apache.spark" %% "spark-sql" % "1.5.0",
  "org.apache.spark" %% "spark-hive" % "1.5.0",
  "org.apache.spark" %% "spark-streaming" % "1.5.0",
  "org.apache.spark" %% "spark-streaming-kafka" % "1.5.0",
  "org.apache.spark" %% "spark-streaming-flume" % "1.5.0",
  "org.apache.spark" %% "spark-mllib" % "1.5.0",
  "org.apache.spark" %% "spark-graphx" % "1.5.0",
  "org.apache.commons" %% "commons-lang3" % "3.0",
  "org.eclipse.jetty"  %% "jetty-client" % "8.1.14.v20131031",
  "org.json4s" %% "json4s-native" % "{latestVersion}",
  "org.json4s" %% "json4s-jackson" % "{latestVersion}"
)

The experimentation did not give me a real time computation(like Neo4j or Virtuoso graph databases) of the graph as spark processing takes time on a single spark node and I safely assume increasing number of nodes will increase the time due to parallel distribution and fetching. I hope this continues to get better with future versions of spark(soon to be out 1.6 !!)