Удаление аналогичных элементов из массива

Array ( [0] => The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tape drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS. [1] => The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tapes drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS. [2] => The N2225 and N2226 SAS/SATA HBAs support SAS data transfer rates of 3, 6, and 12 Gbps per lane and SATA transfer rates of 3 and 6 Gbps per lane, and they enable maximum connectivity and performance in a low-profile (N2225) or full-height (N2226) form factor. [3] => Rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty. [4] => The following tables list the compatibility information for the N2225 and N2226 SAS/SATA HBAs and System x®, iDataPlex®, and NeXtScale™ servers. [5] => For more information about the System x servers, including older servers that support the N2225 and N2226 adapters, see the following ServerProven® website: [6] => The following table lists the external storage systems that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in storage solutions. [7] => The following table lists the external tape backup units that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in tape backup solutions. [8] => For more information about the specific versions and service levels that are supported and any other prerequisites, see the ServerProven website: [9] => The N2225 and N2226 SAS/SATA HBAs carry a one-year limited warranty. When installed in a supported System x server, the adapters assume your system's base warranty and any Lenovo warranty upgrade. ) 

Не совсем то, что можно удалить с помощью array_unique , но элементы, которые устаревают другим элементом, который содержит точно такие же данные и многое другое, а иногда и несколько слов, различны.

Как их фильтровать?

Solutions Collecting From Web of "Удаление аналогичных элементов из массива"

Прежде всего, проблема не такая простая и недостаточно хорошо сформулированная: вы не хотите удалять одинаковые элементы, вы хотите удалить похожие элементы, поэтому ваша первая проблема будет определять, какие элементы схожи.

Учитывая, что сходство может происходить в любой точке строки, недостаточно, чтобы они начали с того же набора символов. Например, возьмите эти два предложения (адаптированные из вашего вопроса):

 Rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty. The rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty. 

Они очень похожи, не начиная с той же строки. Одним из способов определения меры подобия является метод Smith-Waterman_algorithm , здесь есть реализация PHP.

— Позже отредактировать —

Вот реализация, использующая PHP, встроенную в подобный_text ()

 /** * @param mixed $array input array * @param int $minSimilarity minimum similarity for an item to be removed (percentage) * @return array */ function applyFilter ($array, $minSimilarity = 90) { $result = []; foreach ($array as $outerValue) { $append = true; foreach ($result as $key => $innerValue) { $similarity = null; similar_text($innerValue, $outerValue, $similarity); if ($similarity >= $minSimilarity) { if (strlen($outerValue) > strlen($innerValue)) { // always keep the longer one $result[$key] = $outerValue; } $append = false; break; } } if ($append) { $result[] = $outerValue; } } return $result; } $test = [ 'The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tape drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.', 'The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tapes drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.', 'The N2225 and N2226 SAS/SATA HBAs support SAS data transfer rates of 3, 6, and 12 Gbps per lane and SATA transfer rates of 3 and 6 Gbps per lane, and they enable maximum connectivity and performance in a low-profile (N2225) or full-height (N2226) form factor.', 'Rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty.', 'The following tables list the compatibility information for the N2225 and N2226 SAS/SATA HBAs and System x®, iDataPlex®, and NeXtScale™ servers.', 'For more information about the System x servers, including older servers that support the N2225 and N2226 adapters, see the following ServerProven® website:', 'The following table lists the external storage systems that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in storage solutions.', 'The following table lists the external tape backup units that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in tape backup solutions.', 'For more information about the specific versions and service levels that are supported and any other prerequisites, see the ServerProven website:', 'The N2225 and N2226 SAS/SATA HBAs carry a one-year limited warranty. When installed in a supported System x server, the adapters assume your system's base warranty and any Lenovo warranty upgrade.', ]; var_dump(applyFilter($test)); 

— EOF позже редактировать —

Вот полный рабочий код с алгоритмом Smith-Waterman_algorithm :

 class SmithWatermanGotoh { private $gapValue; private $substitution; /** * Constructs a new Smith Waterman metric. * * @param gapValue * a non-positive gap penalty * @param substitution * a substitution function */ public function __construct($gapValue=-0.5, $substitution=null) { if($gapValue > 0.0) throw new Exception("gapValue must be <= 0"); //if(empty($substitution)) throw new Exception("substitution is required"); if (empty($substitution)) $this->substitution = new SmithWatermanMatchMismatch(1.0, -2.0); else $this->substitution = $substitution; $this->gapValue = $gapValue; } public function compare($a, $b) { if (empty($a) && empty($b)) { return 1.0; } if (empty($a) || empty($b)) { return 0.0; } $maxDistance = min(mb_strlen($a), mb_strlen($b)) * max($this->substitution->max(), $this->gapValue); return $this->smithWatermanGotoh($a, $b) / $maxDistance; } private function smithWatermanGotoh($s, $t) { $v0 = []; $v1 = []; $t_len = mb_strlen($t); $max = $v0[0] = max(0, $this->gapValue, $this->substitution->compare($s, 0, $t, 0)); for ($j = 1; $j < $t_len; $j++) { $v0[$j] = max(0, $v0[$j - 1] + $this->gapValue, $this->substitution->compare($s, 0, $t, $j)); $max = max($max, $v0[$j]); } // Find max for ($i = 1; $i < mb_strlen($s); $i++) { $v1[0] = max(0, $v0[0] + $this->gapValue, $this->substitution->compare($s, $i, $t, 0)); $max = max($max, $v1[0]); for ($j = 1; $j < $t_len; $j++) { $v1[$j] = max(0, $v0[$j] + $this->gapValue, $v1[$j - 1] + $this->gapValue, $v0[$j - 1] + $this->substitution->compare($s, $i, $t, $j)); $max = max($max, $v1[$j]); } for ($j = 0; $j < $t_len; $j++) { $v0[$j] = $v1[$j]; } } return $max; } } class SmithWatermanMatchMismatch { private $matchValue; private $mismatchValue; /** * Constructs a new match-mismatch substitution function. When two * characters are equal a score of <code>matchValue</code> is assigned. In * case of a mismatch a score of <code>mismatchValue</code>. The * <code>matchValue</code> must be strictly greater then * <code>mismatchValue</code> * * @param matchValue * value when characters are equal * @param mismatchValue * value when characters are not equal */ public function __construct($matchValue, $mismatchValue) { if($matchValue <= $mismatchValue) throw new Exception("matchValue must be > matchValue"); $this->matchValue = $matchValue; $this->mismatchValue = $mismatchValue; } public function compare($a, $aIndex, $b, $bIndex) { return ($a[$aIndex] === $b[$bIndex] ? $this->matchValue : $this->mismatchValue); } public function max() { return $this->matchValue; } public function min() { return $this->mismatchValue; } } /** * @param mixed $array input array * @param int $minSimilarity minimum similarity for an item to be removed (percentage) * @return array */ function applyFilter ($array, $minSimilarity = 90) { $swg = new SmithWatermanGotoh(); $result = []; foreach ($array as $outerValue) { $append = true; foreach ($result as $key => $innerValue) { $similarity = $swg->compare($innerValue, $outerValue) * 100; if ($similarity >= $minSimilarity) { if (strlen($outerValue) > strlen($innerValue)) { // always keep the longer one $result[$key] = $outerValue; } $append = false; break; } } if ($append) { $result[] = $outerValue; } } return $result; } $test = [ 'The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tape drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.', 'The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tapes drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.', 'The N2225 and N2226 SAS/SATA HBAs support SAS data transfer rates of 3, 6, and 12 Gbps per lane and SATA transfer rates of 3 and 6 Gbps per lane, and they enable maximum connectivity and performance in a low-profile (N2225) or full-height (N2226) form factor.', 'Rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty.', 'The following tables list the compatibility information for the N2225 and N2226 SAS/SATA HBAs and System x®, iDataPlex®, and NeXtScale™ servers.', 'For more information about the System x servers, including older servers that support the N2225 and N2226 adapters, see the following ServerProven® website:', 'The following table lists the external storage systems that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in storage solutions.', 'The following table lists the external tape backup units that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in tape backup solutions.', 'For more information about the specific versions and service levels that are supported and any other prerequisites, see the ServerProven website:', 'The N2225 and N2226 SAS/SATA HBAs carry a one-year limited warranty. When installed in a supported System x server, the adapters assume your system's base warranty and any Lenovo warranty upgrade.', ]; var_dump(applyFilter($test)); 

Теперь вам просто нужно настроить переменную $ minSimilarity в соответствии с вашими потребностями. Например, в вашем случае, если вы держите 90% по умолчанию, удалите 1-й элемент (аналогичный со степенью 2-го до 99,86%). Однако установка более низкого значения (80%) также удалит 8-й элемент.

Надеюсь, поможет!

Предполагая, что значение всегда появляется в самом начале, вы можете сделать что-то вроде этого:

 $arr = ["Some Text.", "Some Text. And more details."]; foreach($arr as $key => $value) { // Look for the value in every element foreach($arr as $key2 => $value2) { // Remove element if its value appears at the beginning of another element if ($key !== $key2 && strpos($value2, $value) === 0) { unset($arr[$key]); continue 2; } } } // Re-index array $arr = array_values($arr); с $arr = ["Some Text.", "Some Text. And more details."]; foreach($arr as $key => $value) { // Look for the value in every element foreach($arr as $key2 => $value2) { // Remove element if its value appears at the beginning of another element if ($key !== $key2 && strpos($value2, $value) === 0) { unset($arr[$key]); continue 2; } } } // Re-index array $arr = array_values($arr); 

Это также работает, если порядок элементов – это наоборот.

Вы все равно можете использовать array_filter и использовать собственный обратный вызов, используйте substr_count чтобы узнать, не превышает ли значение в массиве значение

 $input = array("a","b","c","d","ax","cz"); $str = implode("|",array_unique($input)); $output = array_filter($input, function($var) use ($str){ return substr_count($str, $var) == 1; }); print_r($output); 

иногда несколько слов разные.

Как вы заявили, несколько слов могут быть разными в другом тексте. Но в программировании вам нужно точное условие фильтрации.

Вы можете поместить сопоставленный процент для фильтрации

Вот базовый пример, из которого вы можете получить представление.

 <?php $data = ["this is test","this is another test","one test","two test","this is two test"]; $percentageMatched = 100;//Here you can put your percentage matched to delete for($i=0;$i<count($data)-1;$i++){ $value = explode(" ",$data[$i]); /* check each word in another text */ for($k=$i+1;$k<count($data);$k++){ $nextArray = explode(" ",$data[$k]); $foundCount = 0; for($j=0;$j<count($value);$j++){ if(in_array($value[$j],$nextArray)){ $foundCount++; } } $fromLine = $i; $toLine = $k; $percentage = $foundCount/count($value)*100; echo "EN $fromLine matched $percentage % with EN $toLine \n"; if($percentage >= $percentageMatched){ $data[$i] = ""; break; //array_values($data); } } echo ".............\n"; } print_r(array_filter($data)); ?> 

живая демонстрация: https://eval.in/706478

Если входные данные:

 Array ( [0] => this is test [1] => this is another test [2] => one test [3] => two test [4] => this is two test ) 

Он дает результат: со 100% matched percentage здесь индекс 0 и 3 соответствует 100% и отфильтровывается

 EN 0 matched 100 % with EN 1 ............. EN 1 matched 25 % with EN 2 EN 1 matched 25 % with EN 3 EN 1 matched 75 % with EN 4 ............. EN 2 matched 50 % with EN 3 EN 2 matched 50 % with EN 4 ............. EN 3 matched 100 % with EN 4 ............. Array ( [1] => this is another test [2] => one test [4] => this is two test ) 

Использование array_filter – хороший вариант

 $temp = ""; function prefixmatch($x){ global $temp; $temp = $x; // do an optimist linear search to determine if there's a prefix match $bool = true; for($i=0; $i < min([strlen($x), strlen($temp)]); $i++){ $bool = $bool & ($x[i] === $temp[i]); } // negate the result just because of array_filter return(!$bool); } print_r(array_filter($array1, "prefixmatch")); 

Я думаю, что в этом сценарии может оказаться полезным и легматизация. Если мы возьмем случай первых двух элементов в массиве, единственная разница – единственная «лента» и множественные «ленты».
Array ( [0] => The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tape drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS. [1] => The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tapes drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.

Если вы токенизируете строку и передаете ее через штокмера, например, Php Stemmer , слова «лента» и «лента» будут уменьшены до их стебля, т. Е. «Ленты». После создания столбца вы можете сравнить элементы массива. Я уверен, что он удалит много избыточных элементов.

Вы также можете сделать еще один шаг и выполнить лемматизацию по строкам. Например, на английском языке глагол «ходить» может появляться как «прогулка», «прогулка», «прогулки», «ходьба». Базовая форма «прогулка», которую можно найти в словаре, называется леммой для слова (из wiki).

Я лично использовал Stanford NLP java. Существует также реализация Php, а также PHP-Stanford-NLP

Решение будет зависеть от вашего определения «подобия» и набора данных. Это может быть действительно отличным от одного контекста к другому.

Одним из решений, которое может ответить на вашу потребность, является сходство с косинусом . Вот пример кода: сходство косинуса с расстоянием Хэмминга

В PHP вы можете использовать метод array_unique для удаления дубликатов из массива.

Пример из php.net:

 <?php $input = array("a" => "green", "red", "b" => "green", "blue", "red"); $result = array_unique($input); print_r($result); ?> 

Выход:

 Array ( [a] => green [0] => red [1] => blue ) 

Надеюсь, это то, что вы искали