Markdown
Markdown is a lightweight and easy-to-use syntax for styling all forms of writing on the modern web platforms. Checkout this excellent guide by GitHub to learn everything about Markdown.
HTML::Pipeline intro
HTML::Pipeline is HTML Processing filters and utilities. It includes a small framework for defining DOM based content filters and applying them to user provided content. Read an introduction about HTML::Pipeline in this blog post. GitHub uses the HTML::Pipeline to implement markdown.
Implementing Markdown
[ Markdown Content ] -> [ RenderMarkdown ] -> [ HTML ]
Content goes into our pipeline, outputs HTML, as simple as that!
Let's implement RenderMarkdown.
Install HTML::Pipeline & dependency for Markdown
First we'll need to install HTML::Pipeline and associated dependencies for each feature:
# Gemfile
gem "github-markdown"
gem "html-pipeline"
1-min HTML::Pipeline tutorial
require "html/pipeline"
filter = HTML::Pipeline::MarkdownFilter.new("Hi **world**!")
filter.call
Filters can be combined into a pipeline:
pipeline = HTML::Pipeline.new [
HTML::Pipeline::MarkdownFilter,
# more filter ...
]
result = pipeline.call "Hi **world**!"
result[:output].to_s
Each filter to hand its output to the next filter's input:
--------------- Pipeline ----------------------
| |
| [Filter 1] -> [Filter 2] ... -> [Filter N] |
| |
-----------------------------------------------
RenderMarkdown
We can then implement RenderMarkdown class by leveraging HTML::Pipeline:
class RenderMarkdown
def initialize(content)
@content = content
end
def call
pipeline = HTML::Pipeline.new [
HTML::Pipeline::MarkdownFilter
]
pipeline.call(content)[:output].to_s
end
private
attr_reader :content
end
To use it:
RenderMarkdown.new("Hello, **world**!").call
=> "<p>Hello, <strong>world</strong>!</p>"
It works and it is very easy!
Avoid HTML markup
Sometimes users may be tempted to try something like:
<img src='' onerror='alert(1)' />
which is a common trick to create a popup box on the page, we don't want all users to see a popup box.
Due to the nature of Markdown, HTML is allowed. You can use HTML::Pipeline's built-in SanitizationFilter to sanitize.
But the problem with SanitizationFilter is that, disallowed tags are discarded. That is fine for regular use case of "html sanitization" where we want to let users enter some html. But actually We never want HTML. Any HTML entered should be displayed as-is.
For example, writing:
hello <script>i am sam</script>
Should not result in the usual sanitized output (GitHub's behavior):
hello
Instead, it should output (escaped HTML)
hello <script>i am sam</script>
So in here we take a different approach:
We can add a NohtmlFilter, simply replace < to <:
class NoHtmlFilter < TextFilter
def call
@text.gsub('<', '<')
# keep `>` since markdown needs that for blockquotes
end
end
Put this NoHtmlFilter Before our markdown filter:
class NoHtmlFilter < HTML::Pipeline::TextFilter
def call
@text.gsub('<', '<')
end
end
class RenderMarkdown
def initialize(content)
@content = content
end
def call
pipeline = HTML::Pipeline.new [
NoHtmlFilter,
HTML::Pipeline::MarkdownFilter,
]
pipeline.call(content)[:output].to_s
end
private
attr_reader :content
end
We keep > since markdown needs that for blockquotes, let's try this:
RenderMarkdown.new("<img src='' onerror='alert(1)' />").call
=> "<p><img src='' onerror='alert(1)' /></p>"
While <, > got escaped, it still looks the same from user's perspective.
But what if we want to talk about some HTML in code tag?
> content = <<~CONTENT
> quoted text
123`<img src='' onerror='alert(1)' />`45678
CONTENT
> RenderMarkdown.new(content).call
=> "<blockquote>\n<p>quoted text</p>\n</blockquote>\n\n<p>123<code>&lt;img src='' onerror='alert(1)' /></code>45678</p>"
The & in the code tag also got escaped, we don't want that. Let's fix this:
class NohtmlMarkdownFilter < HTML::Pipeline::MarkdownFilter
def call
while @text.index(unique = SecureRandom.hex); end
@text.gsub!("<", unique)
super.gsub(unique, "<")
end
end
class RenderMarkdown
def initialize(content)
@content = content
end
def call
pipeline = HTML::Pipeline.new [
NohtmlMarkdownFilter,
HTML::Pipeline::MarkdownFilter,
]
pipeline.call(content)[:output].to_s
end
private
attr_reader :content
end
> RenderMarkdown.new(content).call
=> "<blockquote>\n<p>quoted text</p>\n</blockquote>\n\n<p>123<code><img src='' onerror='alert(1)' /></code>45678</p>"
This is awesome, but here comes another bug report, autolink does not work anymore:
content = "hey Juanito <juanito@example.com>"
> RenderMarkdown.new(content).call
=> "<p>hey Juanito <a href=\"mailto:<juanito@example.com\"><juanito@example.com</a>></p>"
The fix is to add a space after our unique string when replacing the <:
class NohtmlMarkdownFilter < HTML::Pipeline::MarkdownFilter
def call
while @text.index(unique = "#{SecureRandom.hex} "); end
@text.gsub!("<", unique)
super.gsub(unique, "<")
end
end
class RenderMarkdown
def initialize(content)
@content = content
end
def call
pipeline = HTML::Pipeline.new [
NohtmlMarkdownFilter,
HTML::Pipeline::MarkdownFilter,
]
pipeline.call(content)[:output].to_s
end
private
attr_reader :content
end
Now autolink works as usual:
content = "hey Juanito <juanito@example.com>"
> RenderMarkdown.new(content).call
=> "<p>hey Juanito <<a href=\"mailto:juanito@example.com\">juanito@example.com</a>></p>"
But other cases come in. Final version:
class NohtmlMarkdownFilter < HTML::Pipeline::MarkdownFilter
def call
while @text.index(unique = SecureRandom.hex); end
@text.gsub!("<", "#{unique} ")
super.gsub(Regexp.new("#{unique}\\s?"), "<")
end
end
Sanitization
While we can display escaped HTML, we still need to add sanitization.
Add SanitizationFilter after our markdown got translated into HTML:
# Gemfile
gem "sanitize"
# RenderMarkdown
class RenderMarkdown
...
def call
pipeline = HTML::Pipeline.new [
NohtmlMarkdownFilter,
HTML::Pipeline::SanitizationFilter,
]
...
end
...
end
So that our HTML is safe!
Nice to have
Syntax Highlight with Rouge
No more pygements dependency, syntax highlight with Rouge.
# Gemfile
gem "html-pipeline-rouge_filter"
# RenderMarkdown
class RenderMarkdown
...
def call
pipeline = HTML::Pipeline.new [
NohtmlMarkdownFilter,
HTML::Pipeline::SanitizationFilter,
HTML::Pipeline::RougeFilter
]
...
end
...
end
Twemoji instead of gemoji (more emojis)
While HTML::Pipeline originally came with an EmojiFilter, which uses gemoji under the hood, there is an alternative solution, twemoji.
# Gemfile
gem "twemoji"
# new file
class EmojiFilter < HTML::Pipeline::Filter
def call
Twemoji.parse(doc,
file_ext: context[:file_ext] || "svg",
class_name: context[:class_name] || "emoji",
img_attrs: context[:img_attrs] || {},
)
end
end
# RenderMarkdown
class RenderMarkdown
...
def call
pipeline = HTML::Pipeline.new [
NohtmlMarkdownFilter,
HTML::Pipeline::SanitizationFilter,
EmojiFilter,
HTML::Pipeline::RougeFilter
]
...
end
...
end
Wrap Up
We now have a markdown that can:
- Can output escaped HTML
- Syntax highlight with Ruby's Rouge
- And Better Emoji Support via Twemoji
See JuanitoFatas/markdown@eb7f434...377125 for full implementation!
Markdown
Markdown is a lightweight and easy-to-use syntax for styling all forms of writing on the modern web platforms. Checkout this excellent guide by GitHub to learn everything about Markdown.
HTML::Pipeline intro
HTML::Pipeline is HTML Processing filters and utilities. It includes a small framework for defining DOM based content filters and applying them to user provided content. Read an introduction about
HTML::Pipelinein this blog post. GitHub uses theHTML::Pipelineto implement markdown.Implementing Markdown
Content goes into our pipeline, outputs HTML, as simple as that!
Let's implement
RenderMarkdown.Install
HTML::Pipeline& dependency for MarkdownFirst we'll need to install
HTML::Pipelineand associated dependencies for each feature:1-min HTML::Pipeline tutorial
Filters can be combined into a pipeline:
Each filter to hand its output to the next filter's input:
RenderMarkdownWe can then implement
RenderMarkdownclass by leveragingHTML::Pipeline:To use it:
It works and it is very easy!
Avoid HTML markup
Sometimes users may be tempted to try something like:
which is a common trick to create a popup box on the page, we don't want all users to see a popup box.
Due to the nature of Markdown, HTML is allowed. You can use
HTML::Pipeline's built-in SanitizationFilter to sanitize.But the problem with
SanitizationFilteris that, disallowed tags are discarded. That is fine for regular use case of "html sanitization" where we want to let users enter some html. But actually We never want HTML. Any HTML entered should be displayed as-is.For example, writing:
Should not result in the usual sanitized output (GitHub's behavior):
Instead, it should output (escaped HTML)
So in here we take a different approach:
We can add a
NohtmlFilter, simply replace<to<:Put this
NoHtmlFilterBefore our markdown filter:We keep
>since markdown needs that for blockquotes, let's try this:While
<,>got escaped, it still looks the same from user's perspective.But what if we want to talk about some HTML in
codetag?The
&in the code tag also got escaped, we don't want that. Let's fix this:This is awesome, but here comes another bug report, autolink does not work anymore:
The fix is to add a space after our unique string when replacing the
<:Now autolink works as usual:
But other cases come in. Final version:
Sanitization
While we can display escaped HTML, we still need to add sanitization.
Add
SanitizationFilterafter our markdown got translated into HTML:So that our HTML is safe!
Nice to have
Syntax Highlight with Rouge
No more pygements dependency, syntax highlight with Rouge.
Twemoji instead of gemoji (more emojis)
While HTML::Pipeline originally came with an
EmojiFilter, which uses gemoji under the hood, there is an alternative solution, twemoji.Wrap Up
We now have a markdown that can:
See JuanitoFatas/markdown@eb7f434...377125 for full implementation!