Skip to content

aashishdugar/PageRank-Algorithm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

PageRank Algorithm (using Spark)

This is a python script for a page rank algorithm using Spark.

  • The input is taken from a command line argument. The format of running it is =>

    python pagerank.py [input-file] [no. of iterations]
    
  • Here is where the spark implementation begins. It collects the pairs of URLS and groups them by the same key.

  • The pairs are initialized to a weight of '1'.

  • Now with the no. of iterations read from the CL args, we divide each value to the key it goes and add that up.

  • There is small implementation to handle spider traps called "damping". in quick words, Spider Traps are dead-ends or infinite loops where the iterations get caught in a repetitive chain or cant iterate to another point or key in the algorithm. For this we've assumed the constant d. Its normally d=0.85 but I've assumed 0.80. The line of code for this is =>

    ranks = urls_joined.reduceByKey(add).mapValues(lambda rank: rank * 0.80 + 0.20)
    

About

Python script for the page rank algo

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages