Jump to content


Help with asertions in a regular expression


11 replies to this topic

#1 Sanva

    Young Padawan

  • Members
  • Pip
  • 15 posts

Posted 16 December 2008 - 01:36 PM

Hello!

I'm trying to make a regular expression that matches with text not wrapped of XHTML paragraph tags. For example, in abcde it must match all the text, in abcd<p>efgh</p> only abcd, and in <p>abcde</p> it must match nothing.

I've been thinking about it and I only could create this expression:

/((?<!<p>).+(?!<\/p>))/

But it doesn't work, because it match every text even it is wrapped by paragraph tags...

Anybody could help me?? Thanks a lot for your time.

#2 derek.sullivan

    Jedi In Training

  • Members
  • PipPip
  • 341 posts
  • Gender:Male
  • Location:Georgia
  • Interests:preaching, programming, music, friends, outdoors, moves, books

Posted 16 December 2008 - 02:20 PM

/((?<!<p>)(.*?)(?!<\/p>))/

umm, maybe that will help. maybe do preg_match_all

#3 Sanva

    Young Padawan

  • Members
  • Pip
  • 15 posts

Posted 16 December 2008 - 06:10 PM

No, it doesn't work... It matches all the letters :S .

#4 rc69

    PHP Master PD

  • P2L Staff
  • PipPipPipPip
  • 3,827 posts
  • Gender:Male
  • Location:Here
  • Interests:Web Development

Posted 16 December 2008 - 08:27 PM

1. Could you explain how this is different from your original topic? http://www.pixel2lif...showtopic=43160

2. If you want every bit of text that is not in a paragraph tag, i would recommend using preg_split() rather then preg_match(), as i don't even think preg_match() is truely capable of this:
$match = preg_split('#<p>.*?</p>#', $text);

Edited by rc69, 16 December 2008 - 08:28 PM.


#5 Sanva

    Young Padawan

  • Members
  • Pip
  • 15 posts

Posted 17 December 2008 - 10:19 AM

Quote

Could you explain how this is different from your original topic? http://www.pixel2lif...showtopic=43160

Sorry... but, well, it's a bit difficult to me to explain my doubts in English... so I supposed it would be easier to create a new topic rather than try to continue the other... my apologizes if it is a problem.

Quote

If you want every bit of text that is not in a paragraph tag, i would recommend using preg_split() rather then preg_match(), as i don't even think preg_match() is truely capable of this:

It works fine, but I need to really "match" the text, because I want to use preg_replace to replace the text matched with itself but wrapped with the paragraph tags. I don't know if I'm explaining it correctly or not... :S

#6 rc69

    PHP Master PD

  • P2L Staff
  • PipPipPipPip
  • 3,827 posts
  • Gender:Male
  • Location:Here
  • Interests:Web Development

Posted 17 December 2008 - 12:45 PM

preg_split() would return an array of text that wasn't wrapped in <p> tags just like preg_match() would.

What i think you're wanting to do is actually use preg_replace() to replace the strings, not preg_match(). While preg_split() wouldn't be the ideal solution for you then, it will work since you can loop through the matchs and replace the strings from there.
$match = preg_split('#<p>.*?</p>#', $text);
while(list($key, $string) = each($match)){
	$text = str_replace($string, '<p>'.$string.'</p>', $text);
}
Disclaimer: I make typos and didn't test anything.

Edited by rc69, 17 December 2008 - 12:46 PM.


#7 derek.sullivan

    Jedi In Training

  • Members
  • PipPip
  • 341 posts
  • Gender:Male
  • Location:Georgia
  • Interests:preaching, programming, music, friends, outdoors, moves, books

Posted 17 December 2008 - 02:24 PM

disclaimer: rc knows what he is talking about lol. not too often does he make mistakes.. :)

#8 Sanva

    Young Padawan

  • Members
  • Pip
  • 15 posts

Posted 18 December 2008 - 08:17 AM

I think your code is not really good... :S
I think that it has an important design-related bug, because if $text has two —or more— paragraphs whit the same text, str_replace would replace the first and second occurrence adding the tags, and later it would replace the occurrences another time, so the text would have two open and two close tags in each occurrence —I don't know if I'm writing correctly English or not... it's horrible. I don't probe your code because I'd created another one just a few minutes before reading yours...

Well... this is the code, how do you think about it??

$text = preg_split('#<p>(.*?)</p>#', $text, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
  $text = '<p>' . implode('</p><p>', $text) . '</p>';

I've written it inspired by your last post, so thanks, thanks a lot :P .

Edited by Sanva, 18 December 2008 - 08:24 AM.


#9 rc69

    PHP Master PD

  • P2L Staff
  • PipPipPipPip
  • 3,827 posts
  • Gender:Male
  • Location:Here
  • Interests:Web Development

Posted 18 December 2008 - 06:29 PM

1. Very good point.
2. Very nice job improvising ;)

#10 Sanva

    Young Padawan

  • Members
  • Pip
  • 15 posts

Posted 24 December 2008 - 10:52 AM

I'm still having problems with assertions and (X)HTML.

The next tip is how to find tags that aren't closed (or open) properly.

For example, in this text I need to match the tag <strong> to remove it:

<p>This is an <strong>error</p>

I thought it would be easy, just an assertion-related problem, and I created this pattern:

#(<(\w+) ?(?-s:.)*?>.*?(?!</\1>))#s

But it doesn't work... It matches both p and strong elements. I've been looking for some documentation and examples about regular expressions but I found anything about my problem.

Could anyone help me?

Thanks a lot for your time.

#11 rc69

    PHP Master PD

  • P2L Staff
  • PipPipPipPip
  • 3,827 posts
  • Gender:Male
  • Location:Here
  • Interests:Web Development

Posted 30 December 2008 - 02:19 PM

Now you're attempting to go beyond the scope of what regex was intended for. In order to properly do what you are asking, you need to make an HTML parser. Also, before you consider attempting to make such a parser, i should warn you that making one is not for the faint of heart (they are quite difficult).

I would recommend a recursive descent parser as they are relatively easy to program. However, you will need to find the HTML language spec in-order to properly do that.

Short of a full-blown parser, you could probably try a stack-based solution (which as i'm thinking about it, seems easy enough), but would probably still not be the ideal solution.

#12 Sanva

    Young Padawan

  • Members
  • Pip
  • 15 posts

Posted 31 December 2008 - 09:01 AM

Quote

you could probably try a stack-based solution
I've been thinking about it, but I think you're right when you say that a recursive descent parser would be better. However, I think that doing it in the best way would be slow... because I'd need to learn about HTML and XHTML specifications, and with the February exams so nearby, I think it would be better if I do it in the future, as an exercise to learn... Now I can use the W3C's tool named Tidy, learn a bit about it [1][2][3], and integrate it in my paragraphs maker :angrylooking:

Thanks a lot for your time!





1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users