Natural Language Processing for OSINT & Threat Analysis (W54)


Courses Included

Get the access to all our courses via Subscription



In this course, we apply Natural Language Processing to cyber threat analysis and OSINT, to assess and analyze data gained from open sources and social networks. The course is a project-based course so that after learning a concept we immediately bring it into action in order to analyze a dataset. We will work on election subjects and use gathered datasets to evaluate hate speech tweets, popularity, using plots to show most common words, and monthly popularity.

This course is for everyone who wants to become familiar with Natural Language Processing and use it for OSINT and cyber threat analysis. 

  • Forensics professionals
  • OSINT enthusiasts
  • Security analysts
  • Penetration testers

Machine learning is no longer an alien field. It has penetrated through all aspects of technology. NLP is one of the greatest machine learning fields, which aims to process and extract information from text data. Cyber threat analysis using these new tools is more critical than ever before, due to breaking out of social networks.

This is a short but valuable course that aims to help students empower their abilities in OSINT and threat hunting, and to take the first step toward machine learning - especially NLP. This is a new and unique approach to OSINT and web crawling. 

Course benefits:

What tools will you use?

  • Bash
  • Python
  • NLTK
  • spaCy
  • Matplotlib
  • Networkx
  • Vis in js
  • Pyplot

What skills will you gain?

  • Text analysis and text mining.
  • Parsing raw data and using NLP pipelines.
  • Crawling information from open sources.

Course general information:

DURATION: 7 hours

CPE POINTS: On completion you get a certificate granting you 7 CPE points.


Course format:

  • Self-paced
  • Pre-recorded
  • Accessible even after you finish the course
  • No preset deadlines
  • Materials are video, labs, and text
  • All videos captioned


In this course, we work on Kali Linux distribution. It could be installed on a virtual machine or could be live.


  • Before beginning this course, make sure you have a good knowledge of Python.
  • No prior experience with NLP is needed! 


Saeed is currently a project leader working with OWASP and an instructor in e-learning. At OWASP, he is a security researcher and project leader.
He has extensive experience in security areas such as network security, secure-coding, server security, human resource vulnerabilities, DevOps, and more. He has 5 years of experience in research and works in the software engineering and cyber-security fields with some companies. He is also a mentor in Google Summer of Code 2021 with 25 students who actively work on an OSINT Meta Search-Engine project.



Module 0

Before the course

Introduction: Wordcloud and histogram for term-frequency.

Module 1

An introduction to NLP

In this section, we first learn beginner NLP concepts and then how to implement preprocessing NLP pipelines to have a clean dataset and corpora. The topics may be alien for some of the students, but they are very simple. The course is based on an assumption that participants have no prior experience with Machine Learning and NLP.

  • File formats
  • Regular expressions
  • Punctuation
  • Tokenization
  • Standardization
  • Stopwords
  • Lemmatization
  • Stemming
  • Ngram
  • Wordcloud
  • Term-frequency
  • Histogram for term-frequency
  • TF-IDF


Writing a pipeline for preprocessing.

Module 2

Exploring cyberspace

In this module, we will learn how to scrape Twitter, Reddit, Google, and PubMed.

  • Extracting snippets from Google
  • Extracting abstract of articles from PubMed
  • Extracting tweets from Twitter without API
  • Extracting trended hashtags from Twitter
  • Extracting posts from Reddit based on topicality


Extract hashtags, phone numbers, emails from content with regular expressions.

Module 3

Gaining knowledge from data

In this module, we will try to process data in order to gain information from the data. We will classify, label, merge, and make semantic clusters from the gathered snippets of Google and PubMed. Then, we use a sentiment analysis algorithm and hate speech to know the popularity of subjects for tweets and Reddit posts.

  • Classification and clustering and use them in action
  • Similarity formulas: cosine, manhattan, euclidean
  • Merge algorithms to sort results accordingly
  • Sentiment analysis algorithms and use them in action
  • Evaluating tweets with hate speech model for election


Implement a simple document retriever.

If you have any questions, please contact us at [email protected].


There are no reviews yet.

Be the first to review “Natural Language Processing for OSINT & Threat Analysis (W54)”

Your email address will not be published.

© HAKIN9 MEDIA SP. Z O.O. SP. K. 2013