Imagine reading and cataloguing every English-language article published on the web. That’s what public relations platform Muck Rack does on a daily basis.
Every story — including this one — is automatically scraped by its platform, categorized and pinned to the corresponding author’s profile. To make that happen, Muck Rack’s engineering team needs to bring order to the chaotic data generated by an industry where every publication has a different content system, page layout and text format.
This summer, the New York-based company hired Matt Dennewitz as its new VP of engineering. Dennewitz has spent the last few months digging into Muck Rack’s codebase to learn how it works and what makes it tick.
One of the first things that stood out to him was how clean it was — which makes sense, considering that it’s built to make sense of utter chaos.
Clean code 101
Writing clean code saves onboarding time. First off, it helps new hires get a handle on what each part of your codebase actually does. It also helps engineers working in a new language get up to speed.
Build a culture around clean code. From the onset, Muck Rack CTO Lee Semel encouraged engineers to prioritize writing clean code, creating a legacy that’s passed down as new hires join the team.
- Provide tools and create rules around clean code. As a policy, Muck Rack uses linting tools like eslint, black and isort to check code for errors before it’s committed. The team also conducts pull reviews and evaluates new hires based on their ability to write clean code.
Muck Rack picked Django for code readability
In order to quickly scrape millions of websites for news stories, Muck Rack needs to identify what a news story looks like, differentiate between headlines and the story itself, and sort every story by the person who wrote it.
It’s a tall order that requires navigating custom-built content management systems and websites that block bots. And once you have the data, you need to appropriately sort it to make it useful.
Muck Rack CTO Lee Semel built Muck Rack’s platform in the Python-based Django framework because of its emphasis on readability and clarity in its code structure. He had previously written in PHP, but the language proved to be difficult to maintain a clean structure in.
According to Dennewitz, Python has made it easier for the company to scale the platform as the company and engine has grown. They also use a MySQL database management system and Elasticsearch search engine.
Looking to patterns when proper tags are missing
When the scraping engine finds a news publication, it strips the story down to its text level and looks for metadata tags identifying headline, byline (or author information), dateline and text. If there are no tags, the company’s algorithm is trained to identify common patterns for what these elements typically look like, and where they are usually placed.
From there, the platform scans the story for patterns like author name, publisher and content to match it with the correct author. This is how it knows a story is published by say, Stephen A. Smith, noted ESPN hot take analyst, rather than another writer named Stephen Smith.
“You can’t always avoid author collisions, but you can set up smart rules.”
“We can infer associations based on publisher or recency of other publications,” Dennewitz said. “You can’t always avoid author collisions, but you can set up smart rules.”
Those rules can include linking domains and publications that an author has frequently appeared in to their profile, so the algorithm knows where to direct the content. While most of the process is automated, an editorial team is also on hand to fix any mixups manually or work with publications with paywalls.
Establishing a clean code base starts with hiring
When Dennewitz looked into modules inside the Muck Rack code base, one of the first things that stood out to him was how clean the code was. He saw well-formatted Python types annotated, variables annotated, thorough documentation and obvious naming that makes everything easy to read and understand.
“It makes it easy to go inside this complex monolithic application and follow your way through how the code is being used,” Dennewitz said. “Everything is where you’d expect it to be, and everything is thoughtfully written by humans for humans to go back and work on it again.”
“Everything is where you’d expect it to be, and everything is thoughtfully written by humans for humans.”
The emphasis on clean code stemmed from the culture CTO and co-founder Lee Semel started 10 years ago, Dennewitz said. Semel’s initial work in the codebase served as the example for all new hires to follow, and later informed the documentation new engineers receive today.
The company also makes clean code a priority in hiring. Every candidate receives a test project, and clean, understandable code is one of the primary skills they’re measured on.
Checking for clean code is a multi-step process
Before any code is written, engineers are encouraged to reach out to the product team to discuss the feature they’re requesting. Because the product team understands the tools the engineers use, whether its diagnostic charts, error traces or logs, they are able to offer helpful insight, Dennewitz said. Those exchanges help engineers establish a clear idea of what the code needs to do and prevents black-box situations where the code can only be interpreted by one engineer.
One or two non-author engineers then join the engineer for a routine pull review to offer a fresh perspective and ensure the code is clean and readable before it’s committed. At Muck Rack, readable code means that it’s safe, scalable and easy for others to follow. They also look out for plain-language descriptiveness and type annotations that remove ambiguities.
If it isn’t clear, the non-author engineers are on hand to suggest and discuss changes to meet those goals.
“This process is meant to increase visibility, discussion and quality of life,” Dennewitz said. “So most all changes to Muck Rack come in the form of a pull request.”
“We can just talk about what’s actually happening rather than debating the semantics of spaces versus tabs,” Dennewitz said.
Clean code is more about people than computers
While a clean codebase isn’t always critical for performance, it’s important for the people who maintain it. It makes it easier for others to decipher what the team is doing and gives everyone a shared language.
When Dennewitz worked his way through the codebase, he was able to quickly pick up how the platform worked. Even if an engineer is new to Django or Python, the time to commit new code stays low because it’s easy to follow what’s been done before. But it also bleeds into other working relationships, like enabling the product team to ask questions and make meaningful suggestions.