Amazon Macie: A machine learning service to discover and protect sensitive data

jcims · on Dec 3, 2017

Important to note that the classifier only looks at the first 20MB of each object, so it might miss sensitive content in large files.

https://aws.amazon.com/macie/pricing/

jasonrhaas · on Dec 3, 2017

I wonder if this is the same service that they use to scan public Github repositories for Secret AWS keys. I'll admit that I've accidentally committed a private key to a public repo before, and I received an email from AWS letting me know about it shortly after.

I suppose that its in Amazon's best interest to not have people hacking accounts and spinning up the maximum amount of EC2s to mine Bitcoins.

5706906c06c · on Dec 3, 2017

Not for Github, but the TruffleHog project on GitHub might be of interest to you. There is also SourceClear, which does the same for secrets in GitHub.

Note - AWS is monitoring AccessKey use and API thresholds to keep you informed.

artwr · on Dec 3, 2017

Definitely not the same service. I have worked with MACIE during their beta and the focus is different.

pmorici · on Dec 3, 2017

Looks like Macie was originally developed by harvest.ai before they were acquired by Amazon last year.

https://techcrunch.com/2017/01/09/amazon-aws-harvest-ai/

trhway · on Dec 3, 2017

>The San Diego-based startup, co-founded by a team that includes two former NSA employees

>Harvest.ai’s flagship, patent-pending AI product is called MACIE Analytics. It uses AI to monitor how a customer’s intellectual property is being accessed in real-time, assessing who is looking at, copying or moving particular documents, and where they are when they’re doing this, in order to identify suspicious patterns of behavior and flag potential data breaches before they’ve taken place. It bills the service as a way to combat the risk of insider attacks.

did they get the idea after seeing what happens at NSA with contractors/whoever downloading data to wherever?

mbrookes · on Dec 3, 2017

Or after seeing Veritas Data Insight? [0]

Data insight is targeted at more user oriented unstructured content repositories (CIFS, NFS, SharePoint, OneDrive, SharePoint Online, Box), but the fundamentals are very similar: content classifiaction, data profiling, risk scoring, access pattern anomaly detection, access control remediation.

[0] https://www.veritas.com/product/information-governance/data-...

eat_veggies · on Dec 3, 2017

Classic, selling the poison and the cure. Access controls shouldn't be so convoluted and opaque that it requires a separate service to analyze your configurations. Crazy that we've made such a mess of the security landscape that we need AI systems to tell us if we're leaking info.

5706906c06c · on Dec 3, 2017

Not the case. I've seen seasoned developers (not to single them out) make simple stupid mistakes with the S3 bucket ACL, Permissions, and Policies. The issue has to do with the sheer laziness of "let's create unstructured data buckets, write once and forget it all" mentality. At some point, this sort of service can be useful in identifying the "crown jewels" within the buckets. Beyond that, the ACL is noAccess by default, so I can't agree with your assertion that AWS is somehow making it difficult to sell more services in favor of vendor lockin.

latchkey · on Dec 3, 2017

Just a day or two ago on HN... https://github.com/eth0izzle/bucket-stream

5706906c06c · on Dec 3, 2017

Yes, thank you for linking, but fail to see the correlation. This tool is scanning public HTTPS endpoints based on keywords in its dictionary to discover misconfigured buckets. AWS doesn't manage the bucket Perms/ACL, the customer does. AWS' shared-responsibility model clearly defines all of this. The customer is responsible for the bucket ACL, the same would apply if I ran my stack in a data center and went on to configure Apache/NGNIX with open Directory indexes that allowed anyone to traverse them.

Spooky23 · on Dec 3, 2017

If you have data that matters, it needs dual controls. The idea that a company would place PII on a site publicly accessible and protected only by ACL is ridiculous.

Instead of futzing with machine learning, use network or crypto controls to prevent access, and have a different chain of command manage that access in your company.

RKearney · on Dec 3, 2017

Previous Discussion: https://news.ycombinator.com/item?id=15012225

tptacek · on Dec 3, 2017

The S3 data classification seems too expensive, but if the Cloudtrail stuff works, that seems pretty cheap for what you get.

mcqueenjordan · on Dec 3, 2017

CloudTrail is indeed very cheap for customers — we record nearly all API calls and access to AWS resources and deliver these events to our subscribing customers. And the events are delivered for free, outside of S3 and Lambda “Data events” — gets, puts, and function invocation is billed at a very cheap rate.

(We recently released our AWS Lambda integration — you can now record all Lambda function invocations with us!)

Disclaimer: I’m a Software Engineer with the AWS CloudTrail team.

tptacek · on Dec 3, 2017

If I'm reading this right, you now have two paid services for detecting CT anomalies: Guard Duty, which is nosebleed expensive, and Macie, which is practically free. What's the difference between the two?

p0rkbelly · on Dec 3, 2017

Macie Analayzes a subset of CloudTrail, not all actions and is about historical behavior (though for high sev actions, it is more point in time)

GuardDuty is looking for specific threats/attacks and can combine multiple sources of telemetry for more advanced correlation. E.g. A combination of VPC Flows + CloudTrail + DNS that trigger an alert when formed together while a single CloudTrail event may not have.

tptacek · on Dec 3, 2017

Within CT, what are examples of things Macie will catch, vs. things you'd need GD to catch?

If GD weren't so expensive, I wouldn't really care that much. But GD is so expensive that it can be hard to recommend, which is especially weird since the pricing for Macie CT is so low --- even weirder when you note that the pricing for Macie S3 is so high!

p0rkbelly · on Dec 3, 2017

I found the pricing for Guard Duty reasonable compared to most IDS systems.

They let you turn on GuardDuty for free for 30 days and give you an estimate your bill so that helps.

5706906c06c · on Dec 3, 2017

I ran up a $20k+ Macie bill scanning 5 buckets in 24hrs.

tptacek · on Dec 3, 2017

What kinds of results did it generate?

5706906c06c · on Dec 3, 2017

A healthy amount of data that looked like PII based on data range, potential secrets in buckets, CSVs, JSONs, Cloudtrail dumps, but also generated reports on dummy data and without fingerprinting of the live data, it wouldn't know what's real or not. The Cloudtrail feature is also useful since it provides user behavior analytics, based on use, etc.

tptacek · on Dec 3, 2017

The CT stuff looks interesting, since it's inexpensive, and the other monitoring thing Amazon does (GuardDuty?) is expensive.

p0rkbelly · on Dec 3, 2017

It's $4 per 1-million API calls processed. And starts at $1 per GB of logs processed.

Which pricing dimension is of concern?

trengrj · on Dec 3, 2017

How does this compare to Google’s Data Loss Prevention?

stedev · on Dec 3, 2017

Google's Data Loss Prevention is provided on G Suite and Google Cloud Platform (GCP). Both products use the same unified classifier codebase. G Suite DLP allows admins to enforce policy on Gmail and Drive files. On GCP, the Data Loss Prevention API allows developers to classify and redact sensitive data in virtually any data source in real-time or at-rest (e.g. Google Cloud Storage, BigQuery, AWS Redshift, AWS S3, Salesforce, Slack, on-prem, custom apps, etc.).

DLP API scans are not limited to 20MB and can scale up to virtually any size. API results can be used for programmatic automation of alerts, IAM/ACL settings, or other remediation and can be sent automatically into BigQuery for detailed analysis or reporting. In addition to classification, Google’s DLP API provides data masking tools for structured and unstructured data including format-preserving encryption, bucketing, and tokenization. This helps developers reduce unnecessary PII when collecting, storing, or sharing data.

(Note: I am the Product Manager for DLP API at Google Cloud)

nealmueller · on Dec 3, 2017

On the one hand, AWS Macie only scans S3. Google DLP API works on S3, Gmail, Drive, GCS, DynamoDB, Redshift, BigQuery, Slack, SQL, Oracle, Oracle RAC, Zendesk, Twilio, Salesforce, and everything you can point an API at. If you want to use the same engine to test all your repos then Google DLP API is the right solution for you.

On the other hand, Macie has a GUI wizard. DLP API is an API. So if you can't code and just want to scan S3 then Macie might be for you, until Google DLP builds a GUI, if there's demand for that.

Someone should do a comparison of how successful each engine is at picking up sensitive data. I suspect Google DLP will be tuned better, but someone should do the test on a dummy data set and release results. That would be the most interesting comparison.

I work for Google.

captn3m0 · on Dec 3, 2017

Waiting for it (and many more services) to become available in ap-south-1.

jijji · on Dec 3, 2017

jeff bezos's own personal dashboard to exfiltrate warez