Python Bytes is a weekly podcast hosted by Michael Kennedy and Brian Okken. The show is a short discussion on the headlines and noteworthy news in the Python, developer, and data science space.
#276 Tracking cyber intruders with Jupyter and Python
March 23, 2022
00:45:04
39.44 MB
Downloads: 0
Watch the live stream:
Watch on YouTubeAbout the show
Sponsored by FusionAuth: pythonbytes.fm/fusionauth
Special guest: Ian Hellen
Brian #1: gensim.parsing.preprocessing
- Problem I’m working on
- Turn a blog title into a possible url
- example: “Twisted and Testing Event Driven / Asynchronous Applications - Glyph”
- would like, perhaps: “twisted-testing-event-driven-asynchrounous-applications”
- Turn a blog title into a possible url
- Sub-problem: remove stop words ← this is the hard part
- I started with an article called Removing Stop Words from Strings in Python
- It covered how to do this with NLTK, Gensim, and SpaCy
- I was most successful with
remove_stopwords()
from Gensimfrom gensim.parsing.preprocessing import remove_stopwords
- It’s part of a
gensim.parsing.preprocessing
package
- I wonder what’s all in there?
- a treasure trove
gensim.parsing.preprocessing.preprocess_string
is one- this function applies filters to a string, with the defaults almost being just what I want:
- strip_tags()
- strip_punctuation()
- strip_multiple_whitespaces()
- strip_numeric()
- remove_stopwords()
- strip_short()
- stem_text() ← I think I want everything except this
- this one turns “Twisted” into “Twist”, not good.
- There’s lots of other text processing goodies in there also.
- Oh, yeah, and Gensim is also cool.
- topic modeling for training semantic NLP models
- So, I think I found a really big hammer for my little problem.
- But I’m good with that
Michael #2: DevDocs
- via Loic Thomson
- Gather and search a bunch of technology docs together at once
- For example: Python + Flask + JavaScript + Vue + CSS
- Has an offline mode for laptops / tablets
- Installs as a PWA (sadly not on Firefox)
Ian #3: MSTICPy
- MSTICPy is toolset for CyberSecurity investigations and hunting in Jupyter notebooks.
- What is CyberSec hunting/investigating? - responding to security alerts and threat intelligence reports, trawling through security logs from cloud services and hosts to determine if it’s a real threat or not.
- Why Jupyter notebooks?
- SOC (Security Ops Center) tools can be excellent but all have limitations
- You can get data from anywhere
- Use custom analysis and visualizations
- Control the workflow…. workflow is repeatable
- Open source pkg - created originally to support MS Sentinel Notebooks but now supports lots of providers. When I start this 3+ yrs ago I thought a lot this would be in PyPI - but no 😞
- MSTICPy has 4 main functional areas:
- Data querying - import log data (Sentinel, Splunk, MS Defender, others…working on Elastic Search)
- Enrichment - is this IP Address or domain known to be malicious?
- Analysis - extract more info from data, identify anomalies (simple example - spike in logon failures)
- Visualization - more specialized than traditional graphs - timelines, process trees.
- All components use pandas, Bokeh for visualizations
- Current focus on usability, discovery of functionality and being able to chain
- Always looking for collaborators and contributors - code, docs, queries, critiques
- https://github.com/microsoft/msticpy
- https://msticpy.readthedocs.io/
Brian #4: The Right Way To Compare Floats in Python
- David Amos
- Definitely an easier read than the classic What Every Computer Scientist Should Know About Floating-Point Arithmetic
- What many of us remember
- floating point numbers aren’t exact due to representation limitations and rounding error,
- errors can accumulate
- comparison is tricky
- What many of us remember
- Be careful when comparing floating point numbers, even simple comparisons, like: >>> 0.1 + 0.2 == 0.3 False >>> 0.1 + 0.2 <= 0.3 False
- David has a short but nice introduction to the problems of representation and rounding.
- Three reasons for rounding
- more significant digits than floating point allows
- irrational numbers
- rational but non-terminating
- So how do you compare:
math.isclose()
- be aware of
rel_tol
andabs_tol
and when to use each.
- be aware of
numpy.allclose()
, returns a boolean comparing two arraysnumpy.isclose()
, returns an array of booleanspytest.approx()
, used a bit differently0.1 + 0.2 == pytest.approx(0.3)
- Also allows
rel
andabs
comparisons
- Discussion of
Decimal
andFraction
types- And the memory and speed hit you take on when using them.
Michael #5: Pypyr
- Task runner for automation pipelines
- For when your shell scripts get out of hand. Less tricky than makefile.
- Script sequential task workflow steps in yaml
- Conditional execution, loops, error handling & retries
- Have a look at the getting started.
Ian #6: Pygments
- Python package that’s useful for anyone who wants to display code
- Jupyter notebook Markdown and GitHub markdown let you display code with syntax highlighting. (Jupyter uses Pygments behind the scenes to do this.)
- There are tools that convert code to image format (PNG, JPG, etc) but you lose the ability to copy/paste the code
- Pygments can intelligently render syntax-highlighted code to HTML (and other formats)
- Applications:
- Documentation (used by Sphinx/ReadtheDocs) - render code to HTML + CSS
- Displaying code snippets dynamically in readable form
- Lots (maybe 100s) of code lexers - Python (code, traceback), Bash, C, JS, CSS, HTML, also config and data formats like TOML, JSON, XML
- Easy to use - 3 lines of code - example:
from IPython.display import display, HTML
from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import HtmlFormatter
code = """
def print_hello(who="World"):
message = f"Hello {who}"
print(message)
"""
display(HTML(
highlight(code, PythonLexer(), HtmlFormatter(full=True, nobackground=True))
))
# use HtmlFormatter(style="stata-dark", full=True, nobackground=True)
# for dark themes
- Output to HTML, Latex, image formats.
- We use it in MSTICPy for displaying scripts used in attacks. Example:
Extras
Brian:
- smart-open
- one of the 3 Gensim dependencies
- It’s for streaming large files, from really anywhere, and looks just like Python’s
open()
.
Michael:
- Python 3.10.3 is out.
- git fixup (follow up from last week, via Adam Parkin)
Joke: What’s your secret?