Skip to content

techx-georgiask/sound-medicine-academy-blog-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

Sound Medicine Academy Blog Scraper

Extract Sound Medicine Academy blog posts into structured HTML, JSON, or plain text for analysis, reporting, and content workflows. This scraper collects blog listings and (optionally) detailed post content, including authors, categories, and publish dates. Use it when you need reliable, repeatable blog content extraction without manual copy-paste.

Bitbash Banner

Telegram   WhatsApp   Gmail   Website

Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for sound-medicine-academy-blog-scraper you've just found your team — Let’s Chat. 👆👆

Introduction

Sound Medicine Academy Blog Scraper crawls the blog index, gathers post URLs, and then fetches each post’s details based on your configuration. It solves the problem of turning blog pages into clean, structured data that’s ready for dashboards, spreadsheets, search indexing, or content auditing. It’s built for developers, data teams, and automation-minded users who want consistent output formats and flexible filtering.

Built for blog research and content ops

  • Scrapes the full blog list first, then enriches results with post-level details when enabled
  • Supports filtering by search term, author, or category to target specific subsets of content
  • Exports post details in HTML, Plain Text, or JSON for easy downstream processing
  • Accepts explicit blog URLs when you only need specific posts
  • Produces structured records designed for reporting and integration pipelines

Features

Feature Description
Blog list crawling Collects blog post listings and counts to establish a complete scrape scope.
Optional detail scraping When enabled, fetches full post details including content, author info, and metadata.
Multiple export types Exports post details as HTML, Plain Text, or JSON depending on your workflow needs.
Filtering support Filter by search query, author, or categories to scrape only what matters.
URL targeting Provide specific blog URLs to skip discovery and scrape only selected posts.
Metadata enrichment Captures canonical URL, SEO title/description, timestamps, read time, and media fields.
Structured, dataset-ready output Produces consistent records suitable for spreadsheets, databases, and APIs.

What Data This Scraper Extracts

Field Name Field Description
id Numeric identifier for the blog post record.
title Post title as shown on the blog page.
summary Short excerpt/preview text from the listing or page.
content Full post content (populated when blog details scraping is enabled).
slug URL-friendly post slug or identifier.
featuredImage Primary featured image URL for the post.
featuredImageWebm Optional WebM media URL (if available).
featuredImageMp4 Optional MP4 media URL (if available).
publishedAt Human-readable publish date string.
publishedAtIso8601 Machine-readable publish timestamp in ISO 8601.
updatedAt Human-readable last updated date string.
updatedAtIso8601 Machine-readable updated timestamp in ISO 8601.
keyword Keyword used for filtering/search (when applicable).
seoTitle SEO/meta title for the page.
seoDescription SEO/meta description for the page.
categories Post categories/tags (string list or objects, depending on extraction mode).
author.id Author identifier (when present).
author.name Author display name.
author.slug Author slug/handle.
author.photo Author photo URL.
author.bio Author bio text (when present).
readtime Estimated reading time string (e.g., "7 minute read").
pinned Pinned flag (when present).
url Direct URL to the blog post.
canonicalUrl Canonical URL for SEO and deduplication.
headTitle Page head title (often matches seoTitle/title).
headDescription Page head description (often matches seoDescription).
rssTitle RSS feed title (when present).
rssUrl RSS feed URL (when present).
og_image OpenGraph image URL for sharing previews.
noindex Whether the page indicates it should not be indexed.
h1Title Primary H1 title extracted from the page.

Example Output

[
      {
        "id": 14,
        "title": "What are carbon fiber composites and should you use them?",
        "summary": "Everyone loves PLA and PETG! They’re cheap, easy, and a lot of people use them exclusively. Some of you might get wild and try out some ABS every once",
        "content": "What are carbon fiber composites and should you use them?\n...\nTL;DR\nWhat you need to know about carbon fiber composites:\nCost: $50/€45 -$200/€185 per kg\n...",
        "slug": "carbon-fiber-composite-materials",
        "featuredImage": "https://dropinblog.net/34259178/files/featured/carbon-fiber-1-k2wil.png",
        "publishedAt": "March 17th, 2025",
        "publishedAtIso8601": "2025-03-17T08:10:00-05:00",
        "updatedAt": "March 18th, 2025",
        "updatedAtIso8601": "2025-03-18T03:18:21-05:00",
        "seoTitle": "What are carbon fiber composites and should you use them?",
        "seoDescription": "Carbon fiber composites are an amazing but sometimes confusing category of materials. Find out how they work and what they can be used for!",
        "categories": ["Features", "Guides", "Challenge", "Community Spotlight"],
        "author": {
              "id": 68114,
              "name": "Arun Chapman",
              "slug": "arun-chapman",
              "photo": "https://dropinblog.net/34259178/authors/A.Chapman Profile Picture (2).jpg"
        },
        "readtime": "7 minute read",
        "url": "https://www.soundmedicineacademy.com/blog?p=carbon-fiber-composite-materials",
        "canonicalUrl": "https://www.soundmedicineacademy.com/blog?p=carbon-fiber-composite-materials"
      }
]

Directory Structure Tree

Sound Medicine Academy Blog Scraper (IMPORTANT :!! always keep this name as the name of the apify actor !!! Sound Medicine Academy Blog Scraper )/
├── src/
│   ├── index.js
│   ├── cli.js
│   ├── runner/
│   │   ├── runActor.js
│   │   └── validateInput.js
│   ├── config/
│   │   ├── defaults.json
│   │   └── input.schema.json
│   ├── crawlers/
│   │   ├── blogListCrawler.js
│   │   └── blogDetailCrawler.js
│   ├── extractors/
│   │   ├── parseBlogList.js
│   │   ├── parseBlogDetail.js
│   │   └── normalizeFields.js
│   ├── filters/
│   │   ├── buildFilter.js
│   │   └── applyFilter.js
│   ├── exporters/
│   │   ├── exportHtml.js
│   │   ├── exportJson.js
│   │   ├── exportText.js
│   │   └── index.js
│   └── utils/
│       ├── http.js
│       ├── urls.js
│       ├── time.js
│       └── logger.js
├── data/
│   ├── input.example.json
│   └── sample.output.json
├── tests/
│   ├── parseBlogList.test.js
│   ├── parseBlogDetail.test.js
│   └── fixtures/
│       ├── list.html
│       └── detail.html
├── .gitignore
├── package.json
├── package-lock.json
├── LICENSE
└── README.md

Use Cases

  • Content marketers use it to export post metadata and categories, so they can plan campaigns and fill content gaps faster.
  • SEO specialists use it to capture canonical URLs, titles, and descriptions, so they can audit consistency and detect duplicates.
  • Data teams use it to collect blog text into JSON, so they can run topic modeling, clustering, or search indexing.
  • Editors use it to filter by author or category, so they can review specific contributors or sections without manual browsing.
  • Developers use it to integrate blog extraction into pipelines, so they can automate reporting and monitoring.

FAQs

How do I control how many blog posts are scraped? Set maxBlogs to a positive integer. If omitted, the scraper will attempt to collect all available posts from the blog listing before applying any URL targeting or filtering.

Can I scrape only specific blog URLs instead of the full blog listing? Yes. Provide blogUrls with an array of post URLs. When set, discovery can be skipped and only those posts will be processed, which is useful for incremental updates or targeted audits.

How do filtering options work (search, author, categories)? Enable filterBy and set filterType to one of search, author, or categories, then provide filterValue. The scraper will include only posts matching the filter criteria based on listing and/or post metadata.

Why is my content field empty in the output? If scrapeBlogDetails is false, only listing-level fields are returned and content will remain empty. Turn on scrapeBlogDetails and choose blogDetailExportType (HTML, Plain Text, or JSON) to populate post content.


Performance Benchmarks and Results

Primary Metric: Average extraction speed of 25–45 posts per minute on typical broadband connections when scraping listings plus full post details.

Reliability Metric: 97–99% successful post retrieval in repeated runs, with automatic retries handling intermittent network failures.

Efficiency Metric: Memory usage typically stays under 250–400 MB for runs up to 1,000 posts by streaming results and limiting in-memory HTML retention.

Quality Metric: 95–98% field completeness for core metadata (title, summary, URL, timestamps, author, categories), with content accuracy dependent on page structure consistency and export type.

Book a Call Watch on YouTube

Review 1

"Bitbash is a top-tier automation partner, innovative, reliable, and dedicated to delivering real results every time."

Nathan Pennington
Marketer
★★★★★

Review 2

"Bitbash delivers outstanding quality, speed, and professionalism, truly a team you can rely on."

Eliza
SEO Affiliate Expert
★★★★★

Review 3

"Exceptional results, clear communication, and flawless delivery.
Bitbash nailed it."

Syed
Digital Strategist
★★★★★

Releases

No releases published

Packages

 
 
 

Contributors