Эффективный алгоритм для поиска всех «символьно-равных» строк?

Как мы можем написать эффективную функцию, которая выводит « гомоглифные эквиваленты » входной строки?

Пример 1 (псевдокод):

homoglyphs_list = [ ["o", "0"], // "o" and "0" are homoglyphs ["i", "l", "1"] // "i" and "l" and "1" are homoglyphs ] input_string = "someinput" output = [ "someinput", "s0meinput", "somelnput", "s0melnput", "some1nput", "s0me1nput" ]

Пример 2 :

 homoglyphs_list = [ ["rn", "m", "nn"], ] input_string = "rnn" output = ["rnn", "rm", "mn", "rrn", "nnn", "nm", "nrn"]

Пример 3 :

 homoglyphs_list = [ ["d", "ci", "a"], // "o" and "0" are homoglyphs ["i", "l", "1"] // "i" and "l" and "1" are homoglyphs ] /* notice that with the two rules above, we can infer "d" = "ci" = "a" = "cl" = "c1" */ input_string = "pacerier" output = [ "pacerier", "pacerler", "pacer1er", "pcicerier", "pcicerler", "pcicer1er", "pclcerier", "pc1cerier", "pclcerler", "pc1cerler", "pclcer1er", "pc1cer1er", "pdcerier", "pdcerler", "pdcer1er" ]

Примечание: порядок членов в выходном массиве не важен, и мы можем предположить, что данные гомоморфизмы считаются правильными (входы не дают нам «бесконечной петли»).

Мой текущий алгоритм работает, но он использует raw bruteforcing и производительность ужасна. Например, ввод "mmmmm" с гомофильными ["rn", "m", "nn"] занимает 38 секунд:

 // This is php code (non-pseudo just so we could test the running time), // but the question remains language-agnostic public function Func($in, Array $mappings){ $out_table = array(); $out_table[$in] = null; while(true){ $number_of_entries_so_far = count($out_table); foreach(array_keys($out_table) as $key){ foreach($mappings as $mapping){ foreach($mapping as $value){ for($a=0, $aa=strlen($key); $a<$aa; ++$a){ $pos = strpos($key, $value, $a); if($pos === false){ continue; } foreach($mapping as $equal_value){ if($value === $equal_value){ continue; } $out_table[substr($key, 0, $pos) . $equal_value . substr($key, $pos + strlen($value))] = null; } } } } } if($number_of_entries_so_far === count($out_table)){ // it means we have tried bruteforcing but added no additional entries, // so we can quit now break; } } return array_keys($out_table); } в // This is php code (non-pseudo just so we could test the running time), // but the question remains language-agnostic public function Func($in, Array $mappings){ $out_table = array(); $out_table[$in] = null; while(true){ $number_of_entries_so_far = count($out_table); foreach(array_keys($out_table) as $key){ foreach($mappings as $mapping){ foreach($mapping as $value){ for($a=0, $aa=strlen($key); $a<$aa; ++$a){ $pos = strpos($key, $value, $a); if($pos === false){ continue; } foreach($mapping as $equal_value){ if($value === $equal_value){ continue; } $out_table[substr($key, 0, $pos) . $equal_value . substr($key, $pos + strlen($value))] = null; } } } } } if($number_of_entries_so_far === count($out_table)){ // it means we have tried bruteforcing but added no additional entries, // so we can quit now break; } } return array_keys($out_table); }

Как мы можем реализовать эффективный (быстрый) гомоглифный алгоритм расширения?

Вы должны получить некоторую производительность из рекурсивной реализации, но я не думал о реальной производительности. Это не должно зацикливаться на выходной строке несколько раз и рассчитывать выходные данные на каждой итерации. Кроме того, я добавил небольшой «кеш», чтобы предотвратить вычисление одного и того же гомоглифа дважды, для простоты он не проверяет сопоставления, но вы можете реализовать его или просто очистить кеш перед каждым вызовом, где изменяются сопоставления (но это уродливые и легко вводит ошибки).

Код выглядит так:

 cache = {} def homoglyph(input_string, mappings): output = [] character = input_string[0] if input_string in cache: return cache[input_string] elif not input_string: return [] for charset in mappings: if character in charset: temp = input_string sub_homoglyphs = homoglyph(input_string[1:], mappings) for char in charset: temp[0] = char output.append(temp) #adding the homoglyph character to the rest of the string for subhomoglyph in sub_homoglyphs: output.append(char+subhomoglyph) cache[input_string] = output return output

(Это написано с помощью python, так как я не очень разбираюсь в синтаксисе php, я на самом деле не запускаю его, чтобы могли быть ошибки, но суть логики есть)