# Scrape CSV, XML and JSON

PHPScraper can process common plain file types such as csv, json, xml from strings or URLs for you. Most functionality described below works for all three types. Special cases are noted. The following topics are covered:

# Parsing of CSV/XML/JSON strings

If you have a string that represents a CSV, XML or JSON, PHPScraper can assist in validating and parsing it:

$web = new \Spekulatius\PHPScraper\PHPScraper;

// Parse a JSON string
$json = $web->parseJson($jsonString);

// Parse an XML string
$xml = $web->parseXml($xmlString);

// Parse a CSV string
$csv = $web->parseCsv($csvString);

This can be useful when chaining steps or accessing embedded elements such as schema data.

# Fetching and Parsing of CSV/XML/JSON URLs

PHPScraper can assist with fetching and parsing the contents of remote resources (URLs) containing JSON-, CSV- or XML data:

$web = new \Spekulatius\PHPScraper\PHPScraper;

// Fetches URL and parses contents to JSON.
$json = $web
    ->parseJson('https://test-pages.phpscraper.de/index.json');

// Fetches URL and parses contents to XML.
$xml = $web
    ->parseXml('https://test-pages.phpscraper.de/sitemap.xml');

// Fetches URL and parses contents into a simple array.
$csv = $web
    ->parseCsv('https://test-pages.phpscraper.de/test.csv');

// Fetches URL and generates an asso. array (map) with the first line as keys.
$csv = $web
    ->parseCsvWithHeader('https://test-pages.phpscraper.de/test.csv');

Each of the methods above can be accessed in various ways. Using parseCsv as an example, you can use any of the methods as following:

$web = new \Spekulatius\PHPScraper\PHPScraper;

// Option 1: Pass in the absolute URL
$csv = $web
    ->parseCsv('https://test-pages.phpscraper.de/test.csv');

// Option 2: Navigate to a relative URL for parsing.
$csv = $web
    ->go('https://test-pages.phpscraper.de/meta/feeds.html')
    ->parseCsv('/test.csv');

// Option 3: Navigate with `go` or `clickLink` and call the parser.
$csv = $web
    ->go('https://test-pages.phpscraper.de/test.csv')
    ->parseCsv();

Multiple Methods

The examples above apply to the following methods:

  • parseJson
  • parseXml
  • parseCsv
  • parseCsvWithHeader (resolves into an asso. array)

# Parsing a CSV String with Headers

CSV can be parsed into various data structures. PHPScraper comes with two options built-in to parse CSV. Given the following example file:

$ curl https://test-pages.phpscraper.de/test.csv

date,value
1945-02-06,4.20
1952-03-11,42

The standard parser parseCsv returns a simple array with casted values:

$web = new \Spekulatius\PHPScraper\PHPScraper;

print_r(
    $web->parseCsv('https://test-pages.phpscraper.de/test.csv')
);
/**
 * [
 *     ['date', 'value'],
 *     ['1945-02-06', 4.20],
 *     ['1952-03-11', 42],
 * ]
 */

parseCsvWithHeader parses the content and uses the first line as headers and returns an associative array (map):

$web = new \Spekulatius\PHPScraper\PHPScraper;

print_r(
    $web->parseCsvWithHeader('https://test-pages.phpscraper.de/test.csv')
);

/**
 * [
 *      ['date' => '1945-02-06', 'value' => 4.20],
 *      ['date' => '1952-03-11', 'value' => 42],
 * ]
 */

Type Casting

Native types such as int and float are automatically cast to PHP-native types.

# Providing CSV Parsing Parameters

You might want to define which separate, enclosure, and escape to use. You can do so by passing an options array along:

$web = new \Spekulatius\PHPScraper\PHPScraper;

// Direct access:
$csv = $web
    ->parseCsv('https://test-pages.phpscraper.de/test-custom.csv', '|', '"');

// Alternative syntax using `go` first:
$csv = $web
    ->go('https://test-pages.phpscraper.de/test.csv')
    ->parseCsv(null, '|', '"');