# Scrape Links
The scraping of links works very similar to image scraping. You can retrieve a list of URL without any additional information as well as a detailed list containing rel
, target
as well as other attributes.
# Simple Link List
The following example parses a web-page for any links and returns an array of absolute URLs:
$web = new \Spekulatius\PHPScraper\PHPScraper;
/**
* Navigate to the test page. It contains 6 links to placekitten.com with different attributes:
*
* <h2>Different ways to wrap the attributes</h2>
* <p><a href="https://placekitten.com/408/287" target=_blank>external kitten</a></p>
* <p><a href="https://placekitten.com/444/333" target="_blank">external kitten</a></p>
* <p><a href="https://placekitten.com/444/321" target='_blank'>external kitten</a></p>
*
* <h2>Named frame/window/tab</h2>
* <p><a href="https://placekitten.com/408/287" target=kitten>external kitten</a></p>
* <p><a href="https://placekitten.com/444/333" target="kitten">external kitten</a></p>
* <p><a href="https://placekitten.com/444/321" target='kitten'>external kitten</a></p>
*/
$web->go('https://test-pages.phpscraper.de/links/target.html');
// Print the number of links.
echo "This page contains " . count($web->links) . " links.\n\n";
// Loop through the links
foreach ($web->links as $link) {
echo " - " . $link . "\n";
}
/**
* Combined this will print out:
*
* This page contains 6 links.
*
* - https://placekitten.com/408/287
* - https://placekitten.com/444/333
* - https://placekitten.com/444/321
* - https://placekitten.com/408/287
* - https://placekitten.com/444/333
* - https://placekitten.com/444/321
*/
If the page shouldn't contain any links, an empty array is returned.
# Links with Details
If you are in need of more details you can access these in a similar way as on the images. Below is an example to access the detailed data of the first link on the page:
$web = new \Spekulatius\PHPScraper\PHPScraper;
/**
* Navigate to the test page. This page contains several links with different rel attributes. To save space only the first one:
*
* <a href="https://placekitten.com/432/287" rel="nofollow">external kitten</a>
*/
$web->go('https://test-pages.phpscraper.de/links/rel.html');
// Get the first link on the page.
$firstLink = $web->linksWithDetails[0];
/**
* $firstLink contains now:
*
* [
* 'url' => 'https://placekitten.com/432/287',
* 'protocol' => 'https',
* 'text' => 'external kitten',
* 'title' => null,
* 'target' => null,
* 'rel' => 'nofollow',
* 'isNofollow' => true,
* 'isUGC' => false,
* 'isNoopener' => false,
* 'isNoreferrer' => false,
* ]
*/
If you require more data, you will either need to extend the library or submit an issue for consideration.
# Internal Links and External Links
PHPScraper allows to return only internal or external links. The following demonstrates both:
$web = new \Spekulatius\PHPScraper\PHPScraper;
// Navigate to the test page.
$web->go('https://test-pages.phpscraper.de/links/base-href.html');
// Get the list of internal links (in the example an image is linked)
var_dump($web->internalLinks);
/**
* [
* 'https://test-pages.phpscraper.de/assets/cat.jpg'
* ]
*/
// Get the list of external links
var_dump($web->externalLinks);
/**
* [
* 'https://placekitten.com/408/287'
* ]
*/