This article is an ongoing series showcasing some of the best hacks we encountered through the Hack the North competition. To view more of these projects, please checkout the full list of hacks using the Dropbase API
Do you know what sort of personal information is out there that you've shared? Maybe it was a tweet where you replied to your friend with a phone number, or a comment on Reddit that mentioned the town you live in. Alone these sorts of details may seem harmless, but for malicious users, or "Doxxers", information such as this can be used to stalk you, harass you, or even hack accounts.
For four first-year software engineering students at the University of Waterloo, this concept was what formed the basis of their project at Hack the North 2020. Despite having their first semester of university completely online, Sunny Zuo, Aurik Datta, Wolf Van Dierdonck and Matt Zhang found ways to connect with one another, and decided to enter the Hack the North hackathon as a team together. Working from around the world, from Calgary to the Maldives, the team managed to put together an impressive project in less than 36 hours.
The project was inspired by the recent news of the right-wing social media site Parler not scrubbing the metadata from photos posted to the site, which led to personal information and locations being leaked for all their members. Oversights like these can be incredibly damaging and dangerous.
The team decided to try to find a way to stop doxxers from gaining your personal data. For the team, the solution to this was to beat them at their own game; they created DoxMy.tech to let you dox yourself. By feeding the application the usernames of your social media profiles, the tool will scrape the data from the various social media sources and try and find out personal information about you. This allows you to make informed decisions about what content you might want to delete, and what personal information you're comfortable with the world having access to.
With a React frontend, a user enters the usernames for their Reddit, Twitter and Facebook profiles, and then has to go through some authorization in order to ensure that the user being queried is actually the person doing the search. The app then sends calls to the API's of Twitter, Reddit and Facebook to gather parsed data of the users past posts. This data can then be sent through Azure's Natural Language Processing to identify entities and word patterns that would reveal personal information about a user. The application was made to identify your name, email, address, location, phone number, any potential data breaches, alongside some cool data visualizations.
These data visualizations included sentiment analysis on your posts (whether you tend to make positive or negative posts), word clouds with your most commonly used words, and post frequency data to inform you of when you post the most and least throughout each day and every week.
Because the team had to use multiple API's to gather their data, each API returned information in a unique format. In order to optimize their data processing and aggregation, the team setup multiple data pipelines through Dropbase as a means of ingesting their different data formats, and clean the data so that it could be fed into the natural language processing engine. This also meant that after the first users data was cleaned, the process was replicable without the need for any human intervention, allowing the data to be programmatically cleaned.
The team said that their next big goal is to expand the number of social platforms where they aggregate data from, to include a broader view of your online self. Platforms in their immediate roadmap include Instagram and further integration with Facebook. Beyond that, they said that they'd love to find other cool ways of visualizing the data that is collected, and perhaps trying to find other types of identifying information that they could help maker users aware of.