PHP DOMDocument: как убедиться, что я получаю 2 элемента, которые являются h3 + таблицей один за другим, а не h3 + p

Я написал эту часть кода, которая анализирует шаблоны страниц, которые содержат контент, который выглядит следующим образом:

<h3>Title 1</h3> <table> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> </table> <h3>Title 2</h3> <table> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> </table> <h3>Title 3</h3> <table> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> </table> <h3>Title 4</h3> <table> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> </table> 

Это мой код:

 $url = "http://www.example.com"; // Set CURL $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0); curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0); curl_setopt($ch, CURLOPT_ENCODING, 1); $res = curl_exec($ch); curl_close($ch); // Get XML parsed page. $dom = new DOMDocument; $dom->loadXML($res); // Get relevant HTML paeg part from the XML and turn to objects $page = new DOMDocument; $page->loadHTML($dom->documentElement->textContent); $xpath = new DomXpath($page); $tablesum = $page->getElementsByTagName('h3')->length; $title = $page->getElementsByTagName('h3'); $tbl = $page->getElementsByTagName('table'); for($i=0; $i<=$tablesum; $i++){ // Set title + code Var $theTitle = $title->item($i)->textContent; $theTitle = str_replace("[edit]", "", $theTitle); echo "<b>".$theTitle."</b>"; echo "<br/>"; $currentTable = $tbl->item($i); // Foreach table row foreach ($currentTable->getElementsByTagName('tr') as $tr) { // Get tds $tds = $tr->getElementsByTagName('td'); var_dump($tds->item(0)); if ($tds->item(0)) { $output1 = $tds->item(0)->textContent; echo "<u>output1</u>: ".$output1; echo "<br/>"; $output2 = $tds->item(1)->textContent; echo "<u>output2</u>: ".$output2; echo "<br/>"; $output3 = $tds->item(2)->textContent; echo "<u>output3</u>: ".$output3; echo "<br/>"; $output4 = $tds->item(3)->textContent; echo "<u>output4</u>: ".$output4; echo "<br/>"; $output5 = $tds->item(4)->textContent; echo "<u>output5</u>: ".$output5; echo "<br/>"; $output6 = $tds->item(5)->textContent; echo "<u>output6</u>: ".$output6; echo "<br/>"; } echo "<br/>"; } echo "<br/><br/>"; } 

Поэтому мой вывод должен выглядеть примерно так:

 title 1 output1: 1st td content from under title1 output2: 2nd td content from under title1 output3: 3rd td content from under title1 output4: 4th td content from under title1 output5: 5nd td content from under title1 output1: 1st td content from under title1 output2: 2nd td content from under title1 output3: 3rd td content from under title1 output4: 4th td content from under title1 output5: 5nd td content from under title1 output1: 1st td content from under title1 output2: 2nd td content from under title1 output3: 3rd td content from under title1 output4: 4th td content from under title1 output5: 5nd td content from under title1 title 2 output1: 1st td content from under title2 output2: 2nd td content from under title2 output3: 3rd td content from under title2 output4: 4th td content from under title2 output5: 5nd td content from under title2 output1: 1st td content from under title2 output2: 2nd td content from under title2 output3: 3rd td content from under title2 output4: 4th td content from under title2 output5: 5nd td content from under title2 output1: 1st td content from under title2 output2: 2nd td content from under title2 output3: 3rd td content from under title2 output4: 4th td content from under title2 output5: 5nd td content from under title2 title 3 output1: 1st td content from under title3 output2: 2nd td content from under title3 output3: 3rd td content from under title3 output4: 4th td content from under title3 output5: 5nd td content from under title3 output1: 1st td content from under title3 output2: 2nd td content from under title3 output3: 3rd td content from under title3 output4: 4th td content from under title3 output5: 5nd td content from under title3 output1: 1st td content from under title3 output2: 2nd td content from under title3 output3: 3rd td content from under title3 output4: 4th td content from under title3 output5: 5nd td content from under title3 

и так далее…

Все мои эхо выглядели нормально, пока я не заметил, что у меня есть контент в моем HTML-коде, который имеет <p> вместо тега <table> сразу после него. который не анализируется. и создает перекрытие между заголовком и выводами контента.

Небольшой пример:

 <h3>Title 1</h3> <table> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> </table> <h3>Title 2</h3> <p> this is not suppose to be parsed and title should be left alone with an empty echo or no echo. </p> <h3>Title 3</h3> <table> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> </table> <h3>Title 4</h3> <table> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> </table> 

В настоящее время я использую $i как индекс, который работал бы, если бы я не получал никаких

теги в моем html.

Насколько я знаю, я не могу захватить 2 элемента, используя DOMDocument.

Я предпочитаю делать эхо теги, которые не имеют таблицы после (и, возможно, эхо «нет таблицы здесь») и избегают перекрытия, которое происходит.