Hobo Script II - Using PHP to scrape data from multiple sites, and combine it into one list.  
     
      By Brendan CM <verifex (a t) g m a i l (0) com>  
     
  Most PHP scripts start out the same, a small tool, then a medium sized tool, soon you have a full blown web app all written in PHP. Well, as of now this script is still in its baby stages. It is meant for pulling data from multiple sites about public proxy servers. Then combining them all into one big list, even while some sites may not list them exactly the same as another.  
     
       
<?PHP
  
  //      Use my good old friend the HttpClient class
  include("HttpClient.class.php");
  //      Here is an array of public proxy lists
  $links = array (
"http://www.publicproxyservers.com/page1.html", "http://www.proxy4free.com/page1.html", "http://www.anonymitychecker.com/page1.html", "http://www.samair.ru/proxy/index.htm", "http://www.samair.ru/proxy/proxy-20.htm"); // A very simple regular expression to pull the domain out of url $pattern3='/http\:\/\/([a-z-\.0-9]+)\//U'; foreach ($links as $id1 => $link) { preg_match_all($pattern3,$link,$domain); // Make sure each link is going to a new domain $client = new HttpClient($domain[1][0]); $client->setDebug(false); $client->setPersistCookies(true); $client->timeout = 20; $client->max_redirects = 50; // Each domain needs special needs in regards to cookies $client->cookie_host = $domain[1][0]; $client->setUserAgent('Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.3a) Gecko/20021207'); $client->get($link); $proxypage = $client->getContent(); // Here is the first regex to pull info from a proxy site. This pulls in the IP, PORT etc. $pattern='/<td.*>([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)<\/td>\n[ ]+<td.*>([0-9]+)<\/td>\n[ ]+ <td.*>([a-z \+]+)<\/td>\n[ ]+<td.*>([a-zA-Z0-9\(\) ]+)<\/td>/U'; // Here is the second regex to pull info from a proxy site. This pulls in the IP, PORT etc. $pattern2='/([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+):([0-9]+)[ ]+([a-z \+]+)([A-Z][A-Za-z]+)/ms'; preg_match_all($pattern,$proxypage,$proxylist); // The [1] record of proxylist contains results, if it is 0 then first regex failed, go to second one. if (count($proxylist[1])==0) { unset($proxylist); preg_match_all($pattern2,$proxypage,$proxylist); } // Run through all the results found from the raw html. for($i=0;$i<count($proxylist[1]);$i++) { // If results do not include anonymous, elite don't add them if ((!(strpos($proxylist[3][$i],"anony")===FALSE))||(!(strpos($proxylist[3][$i],"elite")===FALSE))) { if (isset($prox)) { // The in_array function of PHP can save time searching for duplicates in a array if (!(in_array($proxylist[1][$i], $prox[1]))) { $prox[1][]=$proxylist[1][$i]; $prox[2][]=$proxylist[2][$i]; $prox[3][]=$proxylist[3][$i]; $prox[4][]=$proxylist[4][$i]; } } else { $prox[1][]=$proxylist[1][$i]; $prox[2][]=$proxylist[2][$i]; $prox[3][]=$proxylist[3][$i]; $prox[4][]=$proxylist[4][$i]; } } } } // After all links have been processed, count the final array, and display results. $max=count($prox[1]); for($i=0;$i<$max;$i++) { echo "ip: ".$prox[1][$i]." "; echo "port: ".$prox[2][$i]." "; echo "type: ".$prox[3][$i]." "; echo "country: ".$prox[4][$i]."<br>"; } echo count($prox[1])."<br>"; ?>
       
   
     
 
Creative Commons: Some Rights Reserved Valid HTML 4.01! Valid CSS!
slogan
slogan
slogan
slogan